[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052513#comment-15052513 ] Xusen Yin commented on SPARK-11381: --- [~somi...@us.ibm.com] This JIRA is blocked by https://issues.apache.org/jira/browse/SPARK-11399. You can take it later after that one being merged. > Replace example code in mllib-linear-methods.md using include_example > - > > Key: SPARK-11381 > URL: https://issues.apache.org/jira/browse/SPARK-11381 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > Labels: starter > > This is similar to SPARK-11289 but for the example code in > mllib-frequent-pattern-mining.md. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6363) make scala 2.11 default language
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052631#comment-15052631 ] Ismael Juma commented on SPARK-6363: It's also worth pointing out that Scala 2.10 is no longer maintained since March 2015. > make scala 2.11 default language > > > Key: SPARK-6363 > URL: https://issues.apache.org/jira/browse/SPARK-6363 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: antonkulaga >Priority: Minor > Labels: scala > > Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 > support. So, it will be better if Spark binaries would be build with Scala > 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052617#comment-15052617 ] Chandra Sekhar commented on SPARK-10625: Can i test this now? which version I have to download which contains this change. > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052512#comment-15052512 ] Xusen Yin commented on SPARK-11381: --- [~somi...@us.ibm.com] This JIRA is blocked by https://issues.apache.org/jira/browse/SPARK-11399. You can take it later after that one being merged. > Replace example code in mllib-linear-methods.md using include_example > - > > Key: SPARK-11381 > URL: https://issues.apache.org/jira/browse/SPARK-11381 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > Labels: starter > > This is similar to SPARK-11289 but for the example code in > mllib-frequent-pattern-mining.md. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6363) make scala 2.11 default language
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6363: - Target Version/s: 2.0.0 > make scala 2.11 default language > > > Key: SPARK-6363 > URL: https://issues.apache.org/jira/browse/SPARK-6363 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: antonkulaga >Priority: Minor > Labels: scala > > Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 > support. So, it will be better if Spark binaries would be build with Scala > 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.
[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052603#comment-15052603 ] Adam Roberts commented on SPARK-9858: - Modifying the UnsafeRowSerializer to always write/read in LE fixes the problem, therefore enabling tungsten features to be fully exploited regardless of endianness (not yet sure why only the aggregate functions are impacted, thought we'd have plenty of test failures). We can use LittleEndianDataInput/OutputStream to achieve this; part of the same package as ByteStreams. Will ensure the regular SparkSqlSerializer is OK too. We're hitting a similar problem with the DatasetAggregatorSuite (instead of 1 we get 9, instead of 2 we get 10, etc), I expect the root cause to be the same. I'll get to work on the pull request, cheers > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > --- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12264) Could DataType provide a TypeTag?
[ https://issues.apache.org/jira/browse/SPARK-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052612#comment-15052612 ] Andras Nemeth commented on SPARK-12264: --- I guess my concrete proposal was a bit hidden in the last sentence of the original description: Let's add a typeTag or scalaTypeTag method to DataType. It's not that creating a mapping on the user side is terribly hard - although there are more complex types like maps and arrays which can be composed arbitrarily as far as I can tell, so you do have to do a bit of work to get it right. It's more about this user implemented mapping being very fragile (e.g. I can definitely see more system types being added in the future) and duplicated at multiple clients. Getting it at runtime from a concrete row is not great for many reasons: - It only gives a ClassTag, not a TypeTag - You may easily end up with a too concrete Class - e.g. maybe in the first row, the first element is a one element set, represented by a collection.immutable.HashSet$HashSet1. But that's not going to be a good class for all elements in the first column. - It's not nice that you have to look at the actual data to understand what it is. What's the point of schemas then? > Could DataType provide a TypeTag? > - > > Key: SPARK-12264 > URL: https://issues.apache.org/jira/browse/SPARK-12264 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Andras Nemeth >Priority: Minor > > We are writing code that's dealing with generic DataFrames as inputs and > further processes their contents with normal RDD operations (not SQL). We > need some mechanism that tells us exactly what Scala types we will find > inside a Row of a given DataFrame. > The schema of the DataFrame contains this information in an abstract sense. > But we need to map it to TypeTags, as that's what the rest of the system uses > to identify what RDD contains what type of data - quite the natural choice in > Scala. > As far as I can tell, there is no good way to do this today. For now we have > a hand coded mapping, but that feels very fragile as spark evolves. Is there > a better way I'm missing? And if not, could we create one? Adding a typeTag > or scalaTypeTag method to DataType, or at least to AtomicType seems easy > enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN
[ https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052737#comment-15052737 ] Pierre Beauvois commented on SPARK-6918: I found nothing about org.apache.spark.deploy.yarn.Client in the client code. The use case is reproducible at 100% and the error happens for any HBase tables. I opened a new ticket for Spark 1.5.2 (https://issues.apache.org/jira/browse/SPARK-12279). > Secure HBase with Kerberos does not work over YARN > -- > > Key: SPARK-6918 > URL: https://issues.apache.org/jira/browse/SPARK-6918 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 1.2.1, 1.3.0, 1.3.1 >Reporter: Dean Chen >Assignee: Dean Chen > Fix For: 1.4.0 > > > Attempts to access HBase from Spark executors will fail at the auth to the > metastore with: _GSSException: No valid credentials provided (Mechanism > level: Failed to find any Kerberos tgt)_ > This is because HBase Kerberos auth token is not send to the executor. Will > need to have something similar to obtainTokensForNamenodes(used for HDFS) in > yarn/Client.scala. Storm also needed something similar: > https://github.com/apache/storm/pull/226 > I've created a patch for this that required an HBase dependency in the YARN > module that we've been using successfully at eBay but am working on a version > that does not require the HBase dependency by calling the class loader. > Should be ready in a few days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052759#comment-15052759 ] Irakli Machabeli commented on SPARK-12218: -- The bug itself is really dangerous, it's ok if it was simply crushing , throwing exception etc but it silently produces wrong results. Imagine coding in java and you have to worry if compiler correctly interprets &&, || in if statement. that's disaster. For me this is not critical, I'm still in try out mode and can always upgrade to 1.6 but for someone who uses spark 1.5 for real job, that's really bad. > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052763#comment-15052763 ] Xiao Li commented on SPARK-12218: - Agree! I will do a search to find out what happened in the push down and which PR fixes the logical plan changes. Will keep you posted. Thanks! > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052718#comment-15052718 ] Xusen Yin commented on SPARK-11136: --- I add a [design doc|https://docs.google.com/document/d/1LSRQDXOepVsOsCRT_PFwuiS9qmbgCzEskVPKXdqHoX0/edit?usp=sharing] here so that we can talk about different implementations easily. > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052852#comment-15052852 ] Steve Loughran commented on SPARK-2356: --- I've stuck up binaries compatible with Hadoop 2.6 & 2.7, to make installing things easier * https://github.com/steveloughran/winutils Note also Hadoop 2.8 includes HADOOP-10775, "fail with meaningful messages if winutils can't be found" > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev >Priority: Critical > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > {code} > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > {code} > It's happened because Hadoop config is initialized each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
[ https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052755#comment-15052755 ] Sean Owen commented on SPARK-11193: --- [~phibit] are you able to test the change in https://github.com/apache/spark/pull/10203 by any chance -- does it look OK to you? > Spark 1.5+ Kinesis Streaming - ClassCastException when starting > KinesisReceiver > --- > > Key: SPARK-11193 > URL: https://issues.apache.org/jira/browse/SPARK-11193 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Phil Kallos > Attachments: screen.png > > > After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis > Spark Streaming application, and am being consistently greeted with this > exception: > java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast > to scala.collection.mutable.SynchronizedMap > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Worth noting that I am able to reproduce this issue locally, and also on > Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). > Also, I am not able to run the included kinesis-asl example. > Built locally using: > git checkout v1.5.1 > mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package > Example run command: > bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector > https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12275) No plan for BroadcastHint in some condition
[ https://issues.apache.org/jira/browse/SPARK-12275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12275: -- Description: *Summary* No plan for BroadcastHint is generated in some condition. *Test Case* {code} val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") val parquetTempFile = "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), scala.util.Random.nextInt) df1.write.parquet(parquetTempFile) val pf1 = sqlContext.read.parquet(parquetTempFile) #1. df1.join(broadcast(pf1)).count() #2. broadcast(pf1).count() {code} *Result* It will trigger assertion in QueryPlanner.scala, like below: {code} scala> df1.join(broadcast(pf1)).count() java.lang.AssertionError: assertion failed: No plan for BroadcastHint +- Relation[key#6,value#7] ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet] at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) {code} was: *Summary* No plan for BroadcastHint is generated in some condition. *Test Case* {code} val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") val parquetTempFile = "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), scala.util.Random.nextInt) df1.write.parquet(parquetTempFile) val pf1 = sqlContext.read.parquet(parquetTempFile) df1.join(broadcast(pf1)).count() {code} *Result* It will trigger assertion in QueryPlanner.scala, like below: {code} scala> df1.join(broadcast(pf1)).count() java.lang.AssertionError: assertion failed: No plan for BroadcastHint +- Relation[key#6,value#7] ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet] at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) {code} > No plan for BroadcastHint in some condition > --- > > Key: SPARK-12275 > URL: https://issues.apache.org/jira/browse/SPARK-12275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: yucai > > *Summary* > No plan for BroadcastHint is generated in some condition. > *Test Case* > {code} > val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") > val parquetTempFile = > "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), > scala.util.Random.nextInt) > df1.write.parquet(parquetTempFile) > val pf1 = sqlContext.read.parquet(parquetTempFile) > #1. df1.join(broadcast(pf1)).count() > #2. broadcast(pf1).count() > {code} > *Result* > It will trigger assertion in QueryPlanner.scala, like below: > {code} > scala> df1.join(broadcast(pf1)).count() > java.lang.AssertionError: assertion failed: No plan for BroadcastHint > +- Relation[key#6,value#7] > ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet] > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at >
[jira] [Created] (SPARK-12280) "--packages" command doesn't work in "spark-submit"
Anton Loss created SPARK-12280: -- Summary: "--packages" command doesn't work in "spark-submit" Key: SPARK-12280 URL: https://issues.apache.org/jira/browse/SPARK-12280 Project: Spark Issue Type: Bug Reporter: Anton Loss Priority: Minor when running "spark-shell", then "--packages" option works as expected, but with "spark-submit" it produces following stacktrace 15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/12/11 17:05:51 WARN Client: Resource file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added multiple times to distributed cache. Exception in thread "main" java.io.FileNotFoundException: Requested file maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does not exist. at com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332) at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942) at com.mapr.fs.MFS.getFileStatus(MFS.java:151) at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467) at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193) at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189) at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) at org.apache.spark.deploy.yarn.Client.run(Client.scala:842) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) it seems it's looking in the wrong place, as jar is clearly present here file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12275) No plan for BroadcastHint in some condition
[ https://issues.apache.org/jira/browse/SPARK-12275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052916#comment-15052916 ] yucai commented on SPARK-12275: --- *Root Cause* When BasicOperators's "case BroadcastHint(child)" is hit (in SparkStrategies.scala), it will recursively invoke BasicOperators.apply by using "child": {code} case BroadcastHint(child) => apply(child) {code} If BasicOperators could not process that child properly, it will lead to "No plan for BroadcastHint". {code} case _ => Nil {code} In my example above, broadcast(pf1) hits "case BroadcastHint(child)", so the child is pf1, its type is "Relation[key#91,value#92] ParquetRelation". And then using this child to invoke BasicOperators.apply again, unfortunately, this child type "Relation[key#91,value#92] ParquetRelation" could not match anything in BasicOperators.apply, so return Nil, "No plan for BroadcastHint". *Solution* Using planLater to invoke other execution strategies that are available > No plan for BroadcastHint in some condition > --- > > Key: SPARK-12275 > URL: https://issues.apache.org/jira/browse/SPARK-12275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: yucai > > *Summary* > No plan for BroadcastHint is generated in some condition. > *Test Case* > {code} > val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") > val parquetTempFile = > "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), > scala.util.Random.nextInt) > df1.write.parquet(parquetTempFile) > val pf1 = sqlContext.read.parquet(parquetTempFile) > df1.join(broadcast(pf1)).count() > {code} > *Result* > It will trigger assertion in QueryPlanner.scala, like below: > {code} > scala> df1.join(broadcast(pf1)).count() > java.lang.AssertionError: assertion failed: No plan for BroadcastHint > +- Relation[key#6,value#7] > ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet] > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9690) Add random seed Param to PySpark CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053588#comment-15053588 ] Apache Spark commented on SPARK-9690: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/10268 > Add random seed Param to PySpark CrossValidator > --- > > Key: SPARK-9690 > URL: https://issues.apache.org/jira/browse/SPARK-9690 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 1.4.1 >Reporter: Martin Menestret >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The fold in the ML CrossValidator depends on a rand whose seed is set to 0 > and it leads the sql.functions rand to call sc._jvm.functions.rand() with no > seed. > In order to be able to unit test a Cross Validation it would be a good idea > to be able to set this seed so the output of the cross validation (with a > featureSubsetStrategy set to "all") would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity
[ https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12276: Assignee: Apache Spark > Prevent RejectedExecutionException by checking if ThreadPoolExecutor is > shutdown and its capacity > - > > Key: SPARK-12276 > URL: https://issues.apache.org/jira/browse/SPARK-12276 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > We noticed that it is possible to throw RejectedExecutionException when > submitting thread in AppClient. The error is like following. We should add > some checks to prevent it. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@2077082c rejected from > java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, > active threads = 0, queued tasks = 0, completed tasks = 1] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) > at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110) > at > org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity
[ https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12276: Assignee: (was: Apache Spark) > Prevent RejectedExecutionException by checking if ThreadPoolExecutor is > shutdown and its capacity > - > > Key: SPARK-12276 > URL: https://issues.apache.org/jira/browse/SPARK-12276 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh >Priority: Minor > > We noticed that it is possible to throw RejectedExecutionException when > submitting thread in AppClient. The error is like following. We should add > some checks to prevent it. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@2077082c rejected from > java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, > active threads = 0, queued tasks = 0, completed tasks = 1] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) > at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110) > at > org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9694) Add random seed Param to Scala CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9694: - Assignee: Yanbo Liang > Add random seed Param to Scala CrossValidator > - > > Key: SPARK-9694 > URL: https://issues.apache.org/jira/browse/SPARK-9694 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9694) Add random seed Param to Scala CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9694: - Shepherd: Joseph K. Bradley Target Version/s: 2.0.0 (was: ) > Add random seed Param to Scala CrossValidator > - > > Key: SPARK-9694 > URL: https://issues.apache.org/jira/browse/SPARK-9694 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12284) Output UnsafeRow from window function
Davies Liu created SPARK-12284: -- Summary: Output UnsafeRow from window function Key: SPARK-12284 URL: https://issues.apache.org/jira/browse/SPARK-12284 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11529) Add section in user guide for StreamingLogisticRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11529: -- Target Version/s: (was: 1.6.0) > Add section in user guide for StreamingLogisticRegressionWithSGD > > > Key: SPARK-11529 > URL: https://issues.apache.org/jira/browse/SPARK-11529 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley > > [~freeman-lab] Would you be able to do this for 1.6? Or if there are others > who can, could you please ping them? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)
Davies Liu created SPARK-12286: -- Summary: Support UnsafeRow in all SparkPlan (if possible) Key: SPARK-12286 URL: https://issues.apache.org/jira/browse/SPARK-12286 Project: Spark Issue Type: Epic Components: SQL Reporter: Davies Liu There are still some SparkPlan does not support UnsafeRow (or does not support well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit
Davies Liu created SPARK-12289: -- Summary: Support UnsafeRow in TakeOrderedAndProject/Limit Key: SPARK-12289 URL: https://issues.apache.org/jira/browse/SPARK-12289 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results
[ https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053616#comment-15053616 ] Yin Huai commented on SPARK-11885: -- [~davies] btw, which exprId was generated at executor side? > UDAF may nondeterministically generate wrong results > > > Key: SPARK-11885 > URL: https://issues.apache.org/jira/browse/SPARK-11885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3 > > > I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). > I think it is an issue in 1.5 branch. > Try the following in spark 1.5 (with a cluster) and you can see the problem. > {code} > import java.math.BigDecimal > import org.apache.spark.sql.expressions.MutableAggregationBuffer > import org.apache.spark.sql.expressions.UserDefinedAggregateFunction > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType, StructField, DataType, > DoubleType, LongType} > class GeometricMean extends UserDefinedAggregateFunction { > def inputSchema: StructType = > StructType(StructField("value", DoubleType) :: Nil) > def bufferSchema: StructType = StructType( > StructField("count", LongType) :: > StructField("product", DoubleType) :: Nil > ) > def dataType: DataType = DoubleType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > buffer(1) = 1.0 > } > def update(buffer: MutableAggregationBuffer,input: Row): Unit = { > buffer(0) = buffer.getAs[Long](0) + 1 > buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0) > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0) > buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1) > } > def evaluate(buffer: Row): Any = { > math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0)) > } > } > sqlContext.udf.register("gm", new GeometricMean) > val df = Seq( > (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"), > (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"), > (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"), > (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"), > (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"), > (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")). > toDF("receipt_id", "store_country", "store_region", "store_id", "amount", > "seller_name") > df.registerTempTable("receipts") > > val q = sql(""" > select store_country, > store_region, > avg(amount), > sum(amount), > gm(amount) > from receipts > whereamount > 50 > and store_country = 'italy' > group by store_country, store_region > """) > q.show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12047) Unhelpful error messages generated by JavaDoc while doing sbt unidoc
[ https://issues.apache.org/jira/browse/SPARK-12047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neelesh Srinivas Salian closed SPARK-12047. --- Resolution: Duplicate > Unhelpful error messages generated by JavaDoc while doing sbt unidoc > > > Key: SPARK-12047 > URL: https://issues.apache.org/jira/browse/SPARK-12047 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Cheng Lian > > I'm not quite familiar with the internal mechanism of the SBT Unidoc plugin, > but it seems that it tries to convert Scala files into Java files and then > run {{javadoc}} over generated files to produces JavaDoc pages. > During this process, {{javadoc}} keeps producing unhelpful error messages > like: > {noformat} > [error] > /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/ml/PredictionModel.java:16: > error: unknown tag: group > [error] /** @group setParam */ > [error] ^ > [error] > /Users/lian/local/src/spark/branch-1.6/graphx/target/java/org/apache/spark/graphx/lib/PageRank.java:83: > error: unknown tag: tparam > [error]* @tparam ED the original edge attribute (not used) > [error] ^ > [error] > /Users/lian/local/src/spark/branch-1.6/core/target/java/org/apache/spark/ContextCleaner.java:76: > error: BlockManagerMaster is not public in org.apache.spark.storage; cannot > be accessed from outside package > [error] private org.apache.spark.storage.BlockManagerMaster > blockManagerMaster () { throw new RuntimeException(); } > [error]^ > [error] > /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.java:72: > error: reference not found > [error]* if it is being added to a {@link DenseMatrix}. If two dense > matrices are added, the output will > [error] ^ > {noformat} > The {{scaladoc}} tool also produces tons of warning messages like: > {noformat} > [warn] > /Users/lian/local/src/spark/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/Column.scala:1117: > Could not find any member to link for "StructField". > [warn] /** > [warn] ^ > {noformat} > (This one is probably because of > [SI-3695|https://issues.scala-lang.org/browse/SI-3695] and > [SI-8734|https://issues.scala-lang.org/browse/SI-8734]). > The problem is that they covered the real problems, and bring difficulty for > API doc auditing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12290) Change the default value in SparkPlan
Davies Liu created SPARK-12290: -- Summary: Change the default value in SparkPlan Key: SPARK-12290 URL: https://issues.apache.org/jira/browse/SPARK-12290 Project: Spark Issue Type: Improvement Reporter: Davies Liu supportUnsafeRows = true supportSafeRows = false // outputUnsafeRows = true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenmin Wu updated SPARK-12272: -- Attachment: screenshot-1.png > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: screenshot-1.png, training-log1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053810#comment-15053810 ] Josh Rosen commented on SPARK-6270: --- While I think that we should have this discussion about UI reconstruction of long-running applications, I think this is orthogonal to the right solution for this issue (SPARK-6270). The root problem here, related to the master / cluster manager dying, seems to be caused by a design flaw: why is the master responsible for serving historical UIs? The standalone history server process should have that responsibility, since UI serving might need a lot of memory. I think the right fix here is to just remove the Master's embedded history server; I just don't think it makes sense to assign history server responsibilities to the master when it's designed to be a very low-resource-use, high-stability, high-resiliency service. > Standalone Master hangs when streaming job completes and event logging is > enabled > - > > Key: SPARK-6270 > URL: https://issues.apache.org/jira/browse/SPARK-6270 > Project: Spark > Issue Type: Bug > Components: Deploy, Streaming >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1 >Reporter: Tathagata Das >Priority: Critical > > If the event logging is enabled, the Spark Standalone Master tries to > recreate the web UI of a completed Spark application from its event logs. > However if this event log is huge (e.g. for a Spark Streaming application), > then the master hangs in its attempt to read and recreate the web ui. This > hang causes the whole standalone cluster to be unusable. > Workaround is to disable the event logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053808#comment-15053808 ] Evan Chen commented on SPARK-10931: --- Hey Joseph, Thanks for the suggestion. I was wondering what model abstraction and getattr method are you referring to? I modified every model on the Python side to reflect how it is being done on the Scala side. Let me know what you think. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenmin Wu updated SPARK-12272: -- Attachment: training-log3.png > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png, training-log2.pnd.png, > training-log3.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12282) Document spark.jars
Justin Bailey created SPARK-12282: - Summary: Document spark.jars Key: SPARK-12282 URL: https://issues.apache.org/jira/browse/SPARK-12282 Project: Spark Issue Type: Documentation Reporter: Justin Bailey The spark.jars property (as implemented in SparkSubmit.scala, https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516) is not documented anywhere, and should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12280) "--packages" command doesn't work in "spark-submit"
[ https://issues.apache.org/jira/browse/SPARK-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12280: -- Component/s: Spark Submit > "--packages" command doesn't work in "spark-submit" > --- > > Key: SPARK-12280 > URL: https://issues.apache.org/jira/browse/SPARK-12280 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Anton Loss >Priority: Minor > > when running "spark-shell", then "--packages" option works as expected, but > with "spark-submit" it produces following stacktrace > 15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 15/12/11 17:05:51 WARN Client: Resource > file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added > multiple times to distributed cache. > Exception in thread "main" java.io.FileNotFoundException: Requested file > maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does > not exist. > at > com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332) > at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942) > at com.mapr.fs.MFS.getFileStatus(MFS.java:151) > at > org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467) > at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193) > at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189) > at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:842) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > it seems it's looking in the wrong place, as jar is clearly present here > file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue
[ https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11497: -- Target Version/s: 1.6.1, 2.0.0 > PySpark RowMatrix Constructor Has Type Erasure Issue > > > Key: SPARK-11497 > URL: https://issues.apache.org/jira/browse/SPARK-11497 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.6.0 >Reporter: Mike Dusenberry >Assignee: Mike Dusenberry >Priority: Minor > > Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark > RowMatrix constructor. As discussed on the dev list > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html], > there appears to be an issue with type erasure with RDDs coming from Java, > and by extension from PySpark. Although we are attempting to construct a > RowMatrix from an RDD[Vector] in > [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115], > the Vector type is erased, resulting in an RDD[Object]. Thus, when calling > Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which > an Object cannot be cast to a Spark Vector. As noted in the aforementioned > dev list thread, this issue was also encountered with DecisionTrees, and the > fix involved an explicit retag of the RDD with a Vector type. Thus, this PR > will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. > IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely > due to their related helper functions in PythonMLlibAPI creating the RDDs > explicitly from DataFrames with pattern matching, thus preserving the types. > The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0: > {code} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = sc.parallelize([[3, -6], [4, -8], [0, 1]]) > mat = RowMatrix(rows) > mat._java_matrix_wrapper.call("tallSkinnyQR", True) > {code} > Should result in the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12283) Use UnsafeRow as the buffer in SortBasedAggregation to avoid Unsafe/Safe conversion
Davies Liu created SPARK-12283: -- Summary: Use UnsafeRow as the buffer in SortBasedAggregation to avoid Unsafe/Safe conversion Key: SPARK-12283 URL: https://issues.apache.org/jira/browse/SPARK-12283 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu SortBasedAggregation use GenericMutableRow as aggregation buffer, also requires that the input can't be UnsafeRow, because the we can't compare/evaluate UnsafeRow and GenericInternalRow in the same time. The TungstenSort output UnsafeRow, so multiple Safe/Unsafe projections will be inserted between them. If we can make sure that all the mutating happens in ascending order, the buffer of UnsafeRow could be used to update var-length object (String, Binary, Struct etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12217: Assignee: Apache Spark > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Assignee: Apache Spark >Priority: Minor > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12217: Assignee: (was: Apache Spark) > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Priority: Minor > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12217: Assignee: Apache Spark > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Assignee: Apache Spark >Priority: Minor > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)
[ https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12286: -- Assignee: Davies Liu > Support UnsafeRow in all SparkPlan (if possible) > > > Key: SPARK-12286 > URL: https://issues.apache.org/jira/browse/SPARK-12286 > Project: Spark > Issue Type: Epic > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There are still some SparkPlan does not support UnsafeRow (or does not > support well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup
[ https://issues.apache.org/jira/browse/SPARK-12287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12287: --- Issue Type: Improvement (was: Epic) > Support UnsafeRow in MapPartitions/MapGroups/CoGroup > > > Key: SPARK-12287 > URL: https://issues.apache.org/jira/browse/SPARK-12287 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11965: -- Assignee: Yanbo Liang > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect
Davies Liu created SPARK-12288: -- Summary: Support UnsafeRow in Coalesce/Except/Intersect Key: SPARK-12288 URL: https://issues.apache.org/jira/browse/SPARK-12288 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results
[ https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053613#comment-15053613 ] Yin Huai commented on SPARK-11885: -- Thanks [~davies]! [~milad.bourh...@gmail.com] Can you try our latest branch 1.5 and see it is fixed for your case? > UDAF may nondeterministically generate wrong results > > > Key: SPARK-11885 > URL: https://issues.apache.org/jira/browse/SPARK-11885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3 > > > I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). > I think it is an issue in 1.5 branch. > Try the following in spark 1.5 (with a cluster) and you can see the problem. > {code} > import java.math.BigDecimal > import org.apache.spark.sql.expressions.MutableAggregationBuffer > import org.apache.spark.sql.expressions.UserDefinedAggregateFunction > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType, StructField, DataType, > DoubleType, LongType} > class GeometricMean extends UserDefinedAggregateFunction { > def inputSchema: StructType = > StructType(StructField("value", DoubleType) :: Nil) > def bufferSchema: StructType = StructType( > StructField("count", LongType) :: > StructField("product", DoubleType) :: Nil > ) > def dataType: DataType = DoubleType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > buffer(1) = 1.0 > } > def update(buffer: MutableAggregationBuffer,input: Row): Unit = { > buffer(0) = buffer.getAs[Long](0) + 1 > buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0) > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0) > buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1) > } > def evaluate(buffer: Row): Any = { > math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0)) > } > } > sqlContext.udf.register("gm", new GeometricMean) > val df = Seq( > (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"), > (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"), > (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"), > (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"), > (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"), > (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")). > toDF("receipt_id", "store_country", "store_region", "store_id", "amount", > "seller_name") > df.registerTempTable("receipts") > > val q = sql(""" > select store_country, > store_region, > avg(amount), > sum(amount), > gm(amount) > from receipts > whereamount > 50 > and store_country = 'italy' > group by store_country, store_region > """) > q.show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12047) Unhelpful error messages generated by JavaDoc while doing sbt unidoc
[ https://issues.apache.org/jira/browse/SPARK-12047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053614#comment-15053614 ] Neelesh Srinivas Salian commented on SPARK-12047: - Closing these since they are duplicated by the above mentioned JIRAs. Thank you. > Unhelpful error messages generated by JavaDoc while doing sbt unidoc > > > Key: SPARK-12047 > URL: https://issues.apache.org/jira/browse/SPARK-12047 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Cheng Lian > > I'm not quite familiar with the internal mechanism of the SBT Unidoc plugin, > but it seems that it tries to convert Scala files into Java files and then > run {{javadoc}} over generated files to produces JavaDoc pages. > During this process, {{javadoc}} keeps producing unhelpful error messages > like: > {noformat} > [error] > /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/ml/PredictionModel.java:16: > error: unknown tag: group > [error] /** @group setParam */ > [error] ^ > [error] > /Users/lian/local/src/spark/branch-1.6/graphx/target/java/org/apache/spark/graphx/lib/PageRank.java:83: > error: unknown tag: tparam > [error]* @tparam ED the original edge attribute (not used) > [error] ^ > [error] > /Users/lian/local/src/spark/branch-1.6/core/target/java/org/apache/spark/ContextCleaner.java:76: > error: BlockManagerMaster is not public in org.apache.spark.storage; cannot > be accessed from outside package > [error] private org.apache.spark.storage.BlockManagerMaster > blockManagerMaster () { throw new RuntimeException(); } > [error]^ > [error] > /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.java:72: > error: reference not found > [error]* if it is being added to a {@link DenseMatrix}. If two dense > matrices are added, the output will > [error] ^ > {noformat} > The {{scaladoc}} tool also produces tons of warning messages like: > {noformat} > [warn] > /Users/lian/local/src/spark/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/Column.scala:1117: > Could not find any member to link for "StructField". > [warn] /** > [warn] ^ > {noformat} > (This one is probably because of > [SI-3695|https://issues.scala-lang.org/browse/SI-3695] and > [SI-8734|https://issues.scala-lang.org/browse/SI-8734]). > The problem is that they covered the real problems, and bring difficulty for > API doc auditing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12273) Spark Streaming Web UI does not list Receivers in order
[ https://issues.apache.org/jira/browse/SPARK-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-12273. -- Resolution: Fixed Assignee: (was: Apache Spark) Fix Version/s: 2.0.0 > Spark Streaming Web UI does not list Receivers in order > --- > > Key: SPARK-12273 > URL: https://issues.apache.org/jira/browse/SPARK-12273 > Project: Spark > Issue Type: Improvement > Components: Streaming, Web UI >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Fix For: 2.0.0 > > Attachments: Spark-12273.png > > > Currently the Streaming web UI does NOT list Receivers in order, while it > seems more convenient for the users if Receivers are listed in order. > !Spark-12273.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.
[ https://issues.apache.org/jira/browse/SPARK-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053648#comment-15053648 ] Apache Spark commented on SPARK-12281: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10269 > Fixed potential exceptions when exiting a local cluster. > > > Key: SPARK-12281 > URL: https://issues.apache.org/jira/browse/SPARK-12281 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Fixed the following potential exceptions when exiting a local cluster. > {code} > java.lang.AssertionError: assertion failed: executor 4 state transfer from > RUNNING to RUNNING is illegal > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.IllegalStateException: Shutdown hooks cannot be modified during > shutdown. > at > org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) > at > org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) > at > org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) > at > org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) > at > org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12285) MLlib user guide: umbrella for missing sections
Joseph K. Bradley created SPARK-12285: - Summary: MLlib user guide: umbrella for missing sections Key: SPARK-12285 URL: https://issues.apache.org/jira/browse/SPARK-12285 Project: Spark Issue Type: Umbrella Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This is an umbrella for updating the MLlib user/programming guide for new APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs
[ https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053734#comment-15053734 ] Joseph K. Bradley commented on SPARK-11606: --- I'm going to split off the remaining guide sections into a new umbrella JIRA so that I can close this one. > ML 1.6 QA: Update user guide for new APIs > - > > Key: SPARK-11606 > URL: https://issues.apache.org/jira/browse/SPARK-11606 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > Note: Now that we have algorithms in spark.ml which are not in spark.mllib, > we should make subsections for the spark.ml API as needed. We can follow the > structure of the spark.mllib user guide. > * The spark.ml user guide can provide: (a) code examples and (b) info on > algorithms which do not exist in spark.mllib. > * We should not duplicate info in the spark.ml guides. Since spark.mllib is > still the primary API, we should provide links to the corresponding > algorithms in the spark.mllib user guide for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs
[ https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11606. --- Resolution: Fixed Fix Version/s: 1.6.0 > ML 1.6 QA: Update user guide for new APIs > - > > Key: SPARK-11606 > URL: https://issues.apache.org/jira/browse/SPARK-11606 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.6.0 > > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > Note: Now that we have algorithms in spark.ml which are not in spark.mllib, > we should make subsections for the spark.ml API as needed. We can follow the > structure of the spark.mllib user guide. > * The spark.ml user guide can provide: (a) code examples and (b) info on > algorithms which do not exist in spark.mllib. > * We should not duplicate info in the spark.ml guides. Since spark.mllib is > still the primary API, we should provide links to the corresponding > algorithms in the spark.mllib user guide for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenmin Wu updated SPARK-12272: -- Attachment: (was: screenshot-1.png) > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenmin Wu updated SPARK-12272: -- Attachment: training-log2.pnd.png > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png, training-log2.pnd.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
Ryan Blue created SPARK-12297: - Summary: Add work-around for Parquet/Hive int96 timestamp bug. Key: SPARK-12297 URL: https://issues.apache.org/jira/browse/SPARK-12297 Project: Spark Issue Type: Task Components: Spark Core Reporter: Ryan Blue Hive has a bug where timestamps in Parquet data are incorrectly adjusted as though they were in the SQL session time zone to UTC. This is incorrect behavior because timestamp values are SQL timestamp without time zone and should not be internally changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.
[ https://issues.apache.org/jira/browse/SPARK-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12281: Assignee: Apache Spark > Fixed potential exceptions when exiting a local cluster. > > > Key: SPARK-12281 > URL: https://issues.apache.org/jira/browse/SPARK-12281 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Fixed the following potential exceptions when exiting a local cluster. > {code} > java.lang.AssertionError: assertion failed: executor 4 state transfer from > RUNNING to RUNNING is illegal > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.IllegalStateException: Shutdown hooks cannot be modified during > shutdown. > at > org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) > at > org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) > at > org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) > at > org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) > at > org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results
[ https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053634#comment-15053634 ] Milad Bourhani commented on SPARK-11885: Sure, I'll give it a go next week :) I'll write the results here. > UDAF may nondeterministically generate wrong results > > > Key: SPARK-11885 > URL: https://issues.apache.org/jira/browse/SPARK-11885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3 > > > I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). > I think it is an issue in 1.5 branch. > Try the following in spark 1.5 (with a cluster) and you can see the problem. > {code} > import java.math.BigDecimal > import org.apache.spark.sql.expressions.MutableAggregationBuffer > import org.apache.spark.sql.expressions.UserDefinedAggregateFunction > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType, StructField, DataType, > DoubleType, LongType} > class GeometricMean extends UserDefinedAggregateFunction { > def inputSchema: StructType = > StructType(StructField("value", DoubleType) :: Nil) > def bufferSchema: StructType = StructType( > StructField("count", LongType) :: > StructField("product", DoubleType) :: Nil > ) > def dataType: DataType = DoubleType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > buffer(1) = 1.0 > } > def update(buffer: MutableAggregationBuffer,input: Row): Unit = { > buffer(0) = buffer.getAs[Long](0) + 1 > buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0) > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0) > buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1) > } > def evaluate(buffer: Row): Any = { > math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0)) > } > } > sqlContext.udf.register("gm", new GeometricMean) > val df = Seq( > (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"), > (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"), > (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"), > (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"), > (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"), > (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")). > toDF("receipt_id", "store_country", "store_region", "store_id", "amount", > "seller_name") > df.registerTempTable("receipts") > > val q = sql(""" > select store_country, > store_region, > avg(amount), > sum(amount), > gm(amount) > from receipts > whereamount > 50 > and store_country = 'italy' > group by store_country, store_region > """) > q.show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue
[ https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11497: -- Target Version/s: 1.5.3, 1.6.1, 2.0.0 (was: 1.6.1, 2.0.0) > PySpark RowMatrix Constructor Has Type Erasure Issue > > > Key: SPARK-11497 > URL: https://issues.apache.org/jira/browse/SPARK-11497 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.6.0 >Reporter: Mike Dusenberry >Assignee: Mike Dusenberry >Priority: Minor > Fix For: 1.5.3, 1.6.1, 2.0.0 > > > Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark > RowMatrix constructor. As discussed on the dev list > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html], > there appears to be an issue with type erasure with RDDs coming from Java, > and by extension from PySpark. Although we are attempting to construct a > RowMatrix from an RDD[Vector] in > [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115], > the Vector type is erased, resulting in an RDD[Object]. Thus, when calling > Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which > an Object cannot be cast to a Spark Vector. As noted in the aforementioned > dev list thread, this issue was also encountered with DecisionTrees, and the > fix involved an explicit retag of the RDD with a Vector type. Thus, this PR > will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. > IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely > due to their related helper functions in PythonMLlibAPI creating the RDDs > explicitly from DataFrames with pattern matching, thus preserving the types. > The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0: > {code} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = sc.parallelize([[3, -6], [4, -8], [0, 1]]) > mat = RowMatrix(rows) > mat._java_matrix_wrapper.call("tallSkinnyQR", True) > {code} > Should result in the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue
[ https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11497. --- Resolution: Fixed Fix Version/s: 1.6.1 1.5.3 2.0.0 Issue resolved by pull request 9458 [https://github.com/apache/spark/pull/9458] > PySpark RowMatrix Constructor Has Type Erasure Issue > > > Key: SPARK-11497 > URL: https://issues.apache.org/jira/browse/SPARK-11497 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.6.0 >Reporter: Mike Dusenberry >Assignee: Mike Dusenberry >Priority: Minor > Fix For: 2.0.0, 1.5.3, 1.6.1 > > > Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark > RowMatrix constructor. As discussed on the dev list > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html], > there appears to be an issue with type erasure with RDDs coming from Java, > and by extension from PySpark. Although we are attempting to construct a > RowMatrix from an RDD[Vector] in > [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115], > the Vector type is erased, resulting in an RDD[Object]. Thus, when calling > Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which > an Object cannot be cast to a Spark Vector. As noted in the aforementioned > dev list thread, this issue was also encountered with DecisionTrees, and the > fix involved an explicit retag of the RDD with a Vector type. Thus, this PR > will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. > IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely > due to their related helper functions in PythonMLlibAPI creating the RDDs > explicitly from DataFrames with pattern matching, thus preserving the types. > The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0: > {code} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = sc.parallelize([[3, -6], [4, -8], [0, 1]]) > mat = RowMatrix(rows) > mat._java_matrix_wrapper.call("tallSkinnyQR", True) > {code} > Should result in the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053721#comment-15053721 ] Apache Spark commented on SPARK-10931: -- User 'evanyc15' has created a pull request for this issue: https://github.com/apache/spark/pull/10270 > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6518: - Issue Type: Sub-task (was: Documentation) Parent: SPARK-12285 > Add example code and user guide for bisecting k-means > - > > Key: SPARK-6518 > URL: https://issues.apache.org/jira/browse/SPARK-6518 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12215) User guide section for KMeans in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12215: -- Issue Type: Sub-task (was: Documentation) Parent: SPARK-12285 > User guide section for KMeans in spark.ml > - > > Key: SPARK-12215 > URL: https://issues.apache.org/jira/browse/SPARK-12215 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa > > [~yuu.ishik...@gmail.com] Will you have time to add a user guide section for > this? Thanks in advance! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash
Davies Liu created SPARK-12291: -- Summary: Support UnsafeRow in BroadcastLeftSemiJoinHash Key: SPARK-12291 URL: https://issues.apache.org/jira/browse/SPARK-12291 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Comment: was deleted (was: User 'anabranch' has created a pull request for this issue: https://github.com/apache/spark/pull/10179) > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
[ https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053639#comment-15053639 ] Phil Kallos commented on SPARK-11193: - yes, code looks great to me, thanks JB and Sean. any indication that this will make the 1.6 release? > Spark 1.5+ Kinesis Streaming - ClassCastException when starting > KinesisReceiver > --- > > Key: SPARK-11193 > URL: https://issues.apache.org/jira/browse/SPARK-11193 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Phil Kallos > Attachments: screen.png > > > After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis > Spark Streaming application, and am being consistently greeted with this > exception: > java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast > to scala.collection.mutable.SynchronizedMap > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Worth noting that I am able to reproduce this issue locally, and also on > Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). > Also, I am not able to run the included kinesis-asl example. > Built locally using: > git checkout v1.5.1 > mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package > Example run command: > bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector > https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12282) Document spark.jars
[ https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12282: -- Priority: Trivial (was: Major) Component/s: Documentation I don't see evidence this is intended to be exposed to end users, so I'd close this. > Document spark.jars > --- > > Key: SPARK-12282 > URL: https://issues.apache.org/jira/browse/SPARK-12282 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Justin Bailey >Priority: Trivial > > The spark.jars property (as implemented in SparkSubmit.scala, > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516) > is not documented anywhere, and should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide
[ https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053743#comment-15053743 ] Joseph K. Bradley commented on SPARK-11959: --- [~yanboliang] Will you have time to write this guide section? If not, please let me know. > Document normal equation solver for ordinary least squares in user guide > > > Key: SPARK-11959 > URL: https://issues.apache.org/jira/browse/SPARK-11959 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Assigning since you wrote the feature, but please reassign as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide
[ https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11959: -- Issue Type: Sub-task (was: Documentation) Parent: SPARK-12285 > Document normal equation solver for ordinary least squares in user guide > > > Key: SPARK-11959 > URL: https://issues.apache.org/jira/browse/SPARK-11959 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Assigning since you wrote the feature, but please reassign as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11965: -- Issue Type: Sub-task (was: Documentation) Parent: SPARK-12285 > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenmin Wu updated SPARK-12272: -- Attachment: training-log1.png > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.
Shixiong Zhu created SPARK-12281: Summary: Fixed potential exceptions when exiting a local cluster. Key: SPARK-12281 URL: https://issues.apache.org/jira/browse/SPARK-12281 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Shixiong Zhu Fixed the following potential exceptions when exiting a local cluster. {code} java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} {code} java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown. at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12247: -- Parent Issue: SPARK-12285 (was: SPARK-8517) > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11529) Add section in user guide for StreamingLogisticRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11529: -- Issue Type: Sub-task (was: Documentation) Parent: SPARK-12285 > Add section in user guide for StreamingLogisticRegressionWithSGD > > > Key: SPARK-11529 > URL: https://issues.apache.org/jira/browse/SPARK-11529 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Joseph K. Bradley > > [~freeman-lab] Would you be able to do this for 1.6? Or if there are others > who can, could you please ping them? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12293) Support UnsafeRow in LocalTableScan
Davies Liu created SPARK-12293: -- Summary: Support UnsafeRow in LocalTableScan Key: SPARK-12293 URL: https://issues.apache.org/jira/browse/SPARK-12293 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12294) Support UnsafeRow in HiveTableScan
Davies Liu created SPARK-12294: -- Summary: Support UnsafeRow in HiveTableScan Key: SPARK-12294 URL: https://issues.apache.org/jira/browse/SPARK-12294 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053808#comment-15053808 ] Evan Chen edited comment on SPARK-10931 at 12/11/15 11:51 PM: -- Hey Joseph, Thanks for the suggestion. What model abstraction and getattr method are you referring to? I modified every model on the Python side to reflect how it is being done on the Scala side. Let me know what you think. was (Author: evanchen92): Hey Joseph, Thanks for the suggestion. I was wondering what model abstraction and getattr method are you referring to? I modified every model on the Python side to reflect how it is being done on the Scala side. Let me know what you think. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel
Joseph K. Bradley created SPARK-12296: - Summary: Feature parity for pyspark.mllib StandardScalerModel Key: SPARK-12296 URL: https://issues.apache.org/jira/browse/SPARK-12296 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Joseph K. Bradley Priority: Minor Some methods are missing, such as ways to access the std, mean, etc. This JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel
[ https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12296: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-11937 > Feature parity for pyspark.mllib StandardScalerModel > > > Key: SPARK-12296 > URL: https://issues.apache.org/jira/browse/SPARK-12296 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > Some methods are missing, such as ways to access the std, mean, etc. This > JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & > StandardScalerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6523) Error when get attribute of StandardScalerModel, When use python api
[ https://issues.apache.org/jira/browse/SPARK-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053830#comment-15053830 ] Joseph K. Bradley commented on SPARK-6523: -- You're right; sorry I did not see that PR as it was put into Spark. I just made one specific to your need: [SPARK-12296] > Error when get attribute of StandardScalerModel, When use python api > > > Key: SPARK-6523 > URL: https://issues.apache.org/jira/browse/SPARK-6523 > Project: Spark > Issue Type: Bug > Components: Examples, MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: lee.xiaobo.2006 > > test code > === > from pyspark.mllib.util import MLUtils > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.feature import StandardScaler > conf = SparkConf().setAppName('Test') > sc = SparkContext(conf=conf) > data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") > label = data.map(lambda x: x.label) > features = data.map(lambda x: x.features) > scaler1 = StandardScaler().fit(features) > print scaler1.std # error > sc.stop() > --- > error: > Traceback (most recent call last): > File "/data1/s/apps/spark-app/app/test_ssm.py", line 22, in > print scaler1.std > AttributeError: 'StandardScalerModel' object has no attribute 'std' > 15/03/25 12:17:28 INFO Utils: path = > /data1/s/apps/spark-1.4.0-SNAPSHOT/data/spark-eb1ed7c0-a5ce-4748-a817-3cb0687ee282/blockmgr-5398b477-127d-4259-a71b-608a324e1cd3, > already present as root for deletion. > = > Another question, how to serialize or save the scaler model ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10931: Assignee: Apache Spark > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10931: Assignee: (was: Apache Spark) > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup
Davies Liu created SPARK-12287: -- Summary: Support UnsafeRow in MapPartitions/MapGroups/CoGroup Key: SPARK-12287 URL: https://issues.apache.org/jira/browse/SPARK-12287 Project: Spark Issue Type: Epic Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12295) Manage the memory used by window function
Davies Liu created SPARK-12295: -- Summary: Manage the memory used by window function Key: SPARK-12295 URL: https://issues.apache.org/jira/browse/SPARK-12295 Project: Spark Issue Type: Improvement Reporter: Davies Liu The buffered rows for a given frame should use UnsafeRow, and stored as pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs
[ https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053790#comment-15053790 ] Joseph K. Bradley commented on SPARK-11606: --- I'll close this now that [SPARK-12285] contains the remaining open tasks. > ML 1.6 QA: Update user guide for new APIs > - > > Key: SPARK-11606 > URL: https://issues.apache.org/jira/browse/SPARK-11606 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > Note: Now that we have algorithms in spark.ml which are not in spark.mllib, > we should make subsections for the spark.ml API as needed. We can follow the > structure of the spark.mllib user guide. > * The spark.ml user guide can provide: (a) code examples and (b) info on > algorithms which do not exist in spark.mllib. > * We should not duplicate info in the spark.ml guides. Since spark.mllib is > still the primary API, we should provide links to the corresponding > algorithms in the spark.mllib user guide for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12217: -- Assignee: Benjamin Fradet (was: Apache Spark) > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Assignee: Benjamin Fradet >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12183) Remove spark.mllib tree, forest implementations and use spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053823#comment-15053823 ] Joseph K. Bradley commented on SPARK-12183: --- Lower priority than both, really. This is more of a clean-up task. We could still improve the spark.ml code without doing this task, and GBT can be handled as a separate JIRA. I'd say moving GBT code to spark.ml is higher priority than this since that is blocking adding more output columns to GBTs (rawPrediction, probability). > Remove spark.mllib tree, forest implementations and use spark.ml > > > Key: SPARK-12183 > URL: https://issues.apache.org/jira/browse/SPARK-12183 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > This JIRA is for replacing the spark.mllib decision tree and random forest > implementations with the one from spark.ml. The spark.ml one should be used > as a wrapper. This should involve moving the implementation, but should > probably not require changing the tests (much). > This blocks on 1 improvement to spark.mllib which needs to be ported to > spark.ml: [SPARK-10064] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12282) Document spark.jars
[ https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053676#comment-15053676 ] Justin Bailey edited comment on SPARK-12282 at 12/11/15 10:18 PM: -- If you pass {{--conf spark.jars=".."}}, you can set this flag, which is actually pretty useful (its a consistent way to set configuration). So maybe {{spark-submit}} should warn or throw if this configuration is included? was (Author: m4dc4p): If you pass `--conf spark.jars=".."`, you can set this flag, which is actually pretty useful (its a consistent way to set configuration). So maybe spark-submit should warn or throw if this configuration is included? > Document spark.jars > --- > > Key: SPARK-12282 > URL: https://issues.apache.org/jira/browse/SPARK-12282 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Justin Bailey >Priority: Trivial > > The spark.jars property (as implemented in SparkSubmit.scala, > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516) > is not documented anywhere, and should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12282) Document spark.jars
[ https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053676#comment-15053676 ] Justin Bailey commented on SPARK-12282: --- If you pass `--conf spark.jars=".."`, you can set this flag, which is actually pretty useful (its a consistent way to set configuration). So maybe spark-submit should warn or throw if this configuration is included? > Document spark.jars > --- > > Key: SPARK-12282 > URL: https://issues.apache.org/jira/browse/SPARK-12282 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Justin Bailey >Priority: Trivial > > The spark.jars property (as implemented in SparkSubmit.scala, > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516) > is not documented anywhere, and should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12292) Support UnsafeRow in Generate
Davies Liu created SPARK-12292: -- Summary: Support UnsafeRow in Generate Key: SPARK-12292 URL: https://issues.apache.org/jira/browse/SPARK-12292 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12217. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10257 [https://github.com/apache/spark/pull/10257] > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Assignee: Benjamin Fradet >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11937) Python API coverage check found issues for ML during 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11937: -- Comment: was deleted (was: User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/10085) > Python API coverage check found issues for ML during 1.6 QA > --- > > Key: SPARK-11937 > URL: https://issues.apache.org/jira/browse/SPARK-11937 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, MLlib, PySpark >Reporter: Yanbo Liang > > Here is the todo list of SPARK-11604 found issues: > Note: I did not list the SparkR related features (such as > ml.feature.Interaction). We have supported RFormula as a wrapper at Python > side, I think we should discuss the necessary to support other R related > features at Python side. > * Missing classes > ** ml.attribute SPARK-8516 > ** ml.feature > *** QuantileDiscretizer SPARK-11922 > *** ChiSqSelector SPARK-11923 > ** ml.classification > *** OneVsRest SPARK-7861 > ** ml.clustering > *** LDA SPARK-11940 > ** mllib.clustering > *** BisectingKMeans SPARK-11944 > * Missing methods/parameters SPARK-11938 > ** ml.classification SPARK-11815 SPARK-11820 > ** ml.feature SPARK-11925 > ** ml.clustering SPARK-11945 > ** mllib.linalg SPARK-12040 SPARK-12041 > ** mllib.stat.test.StreamingTest SPARK-12042 > * Docs: > ** ml.classification SPARK-11875 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053827#comment-15053827 ] Wenmin Wu commented on SPARK-12272: --- I didn't run a test, but the click-through data of my company. There are 28905 features and 18644639 records in this data. I trained a GBDT model with 200 trees(equal to iterations) and maxDepth = 7. From the 'training-log1', you can see the first splitting takes 9.7 min. However, in xgboost single node implementation it takes less than 10 secs. At first, I thought this is due to the statistics communication, but I look into the the details log in single executor just as the 'training-log2' shows. You can see in the single executor these steps take 8 - 9 min. I persist all the data in memory, as shown in 'training-log3'. I also look into the source of GBDT implementation in spark and found that the time complexity of finding the first split is O(K * N) which is the same as the implementation in xgboost. So I ask you how I can accelerate the training of GBDT with spark. > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png, training-log2.pnd.png, > training-log3.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10285) Add @since annotation to pyspark.ml.util
[ https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053933#comment-15053933 ] Joseph K. Bradley commented on SPARK-10285: --- I'll close the issue. Thanks! > Add @since annotation to pyspark.ml.util > > > Key: SPARK-10285 > URL: https://issues.apache.org/jira/browse/SPARK-10285 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10285) Add @since annotation to pyspark.ml.util
[ https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-10285. - Resolution: Not A Problem Target Version/s: (was: 1.6.0) > Add @since annotation to pyspark.ml.util > > > Key: SPARK-10285 > URL: https://issues.apache.org/jira/browse/SPARK-10285 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10263) Add @Since annotation to ml.param and ml.*
[ https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10263: -- Target Version/s: (was: 1.6.0) > Add @Since annotation to ml.param and ml.* > -- > > Key: SPARK-10263 > URL: https://issues.apache.org/jira/browse/SPARK-10263 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Hiroshi Takahashi >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12301) Remove final from classes in spark.ml trees and ensembles where possible
Joseph K. Bradley created SPARK-12301: - Summary: Remove final from classes in spark.ml trees and ensembles where possible Key: SPARK-12301 URL: https://issues.apache.org/jira/browse/SPARK-12301 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley There have been continuing requests (e.g., [SPARK-7131]) for allowing users to extend and modify MLlib models and algorithms. I want this to happen for the next release. For GBT, this may need to wait on some refactoring (to move the implementation to spark.ml). But it could be done for trees already. This will be broken into subtasks. If you are a user who needs these changes, please comment here about what specifically needs to be modified for your use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054000#comment-15054000 ] Joseph K. Bradley commented on SPARK-7131: -- Yes, I'm sorry about how long this has taken, but I have enough confidence in the API now proceed. I've created a JIRA for doing this in the next release: [SPARK-12301], though I may not be able to look at this issue until January. Please post your thoughts there, and ping in early January if there is no activity. Thank you! > Move tree,forest implementation from spark.mllib to spark.ml > > > Key: SPARK-7131 > URL: https://issues.apache.org/jira/browse/SPARK-7131 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to change and improve the spark.ml API for trees and ensembles, but > we cannot change the old API in spark.mllib. To support the changes we want > to make, we should move the implementation from spark.mllib to spark.ml. We > will generalize and modify it, but will also ensure that we do not change the > behavior of the old API. > There are several steps to this: > 1. Copy the implementation over to spark.ml and change the spark.ml classes > to use that implementation, rather than calling the spark.mllib > implementation. The current spark.ml tests will ensure that the 2 > implementations learn exactly the same models. Note: This should include > performance testing to make sure the updated code does not have any > regressions. --> *UPDATE*: I have run tests using spark-perf, and there were > no regressions. > 2. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. > This JIRA is now for step 1 only. Steps 2 and 3 will be in separate JIRAs. > After these updates, we can more safely generalize and improve the spark.ml > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917 ] holdenk edited comment on SPARK-2870 at 12/12/15 1:27 AM: -- So this seems to be resolved in Spark 1.6 with {code}createDataFrame{code} e.g.: {code} input = [{"a": 1}, {"b": "coffee"}] rdd = sc.parallelize(input) df = sqlContext.createDataFrame(rdd, samplingRatio=1.0) print df.schema {code} Results in {code}StructType(List(StructField(a,LongType,true),StructField(b,StringType,true))){code} Do you think its OK to close this issue? was (Author: holdenk): So this seems to be resolved in Spark 1.6 with `createDataFrame` e.g.: {code} input = [{"a": 1}, {"b": "coffee"}] rdd = sc.parallelize(input) df = sqlContext.createDataFrame(rdd, samplingRatio=1.0) print df.schema {code} Results in `StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))` Do you think its OK to close this issue? > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917 ] holdenk edited comment on SPARK-2870 at 12/12/15 1:26 AM: -- So this seems to be resolved in Spark 1.6 with `createDataFrame` e.g.: {code} input = [{"a": 1}, {"b": "coffee"}] rdd = sc.parallelize(input) df = sqlContext.createDataFrame(rdd, samplingRatio=1.0) print df.schema {code} Results in `StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))` Do you think its OK to close this issue? was (Author: holdenk): So this seems to be resolved in Spark 1.6 with `createDataFrame` e.g.: {code:python} input = [{"a": 1}, {"b": "coffee"}] rdd = sc.parallelize(input) df = sqlContext.createDataFrame(rdd, samplingRatio=1.0) print df.schema {code} Results in `StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))` Do you think its OK to close this issue? > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917 ] holdenk commented on SPARK-2870: So this seems to be resolved in Spark 1.6 with `createDataFrame` e.g.: {code:python} input = [{"a": 1}, {"b": "coffee"}] rdd = sc.parallelize(input) df = sqlContext.createDataFrame(rdd, samplingRatio=1.0) print df.schema {code} Results in `StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))` Do you think its OK to close this issue? > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)
[ https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053978#comment-15053978 ] Apache Spark commented on SPARK-12298: -- User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/10271 > Infinite loop in DataFrame.sortWithinPartitions(String, String*) > > > Key: SPARK-12298 > URL: https://issues.apache.org/jira/browse/SPARK-12298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > The String overload of DataFrame.sortWithinPartitions calls itself when it > should call the Column overload, causing an infinite loop: > {code} > Exception in thread "main" java.lang.StackOverflowError > at > org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9578) Stemmer feature transformer
[ https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9578: --- Assignee: (was: Apache Spark) > Stemmer feature transformer > --- > > Key: SPARK-9578 > URL: https://issues.apache.org/jira/browse/SPARK-9578 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Transformer mentioned first in [SPARK-5571] based on suggestion from > [~aloknsingh]. Very standard NLP preprocessing task. > From [~aloknsingh]: > {quote} > We have one scala stemmer in scalanlp%chalk > https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze > which can easily copied (as it is apache project) and is in scala too. > I think this will be better alternative than lucene englishAnalyzer or > opennlp. > Note: we already use the scalanlp%breeze via the maven dependency so I think > adding scalanlp%chalk dependency is also the options. But as you had said we > can copy the code as it is small. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org