[jira] [Created] (SPARK-10630) createDataFrame from a Java List
Xiangrui Meng created SPARK-10630: - Summary: createDataFrame from a Java List Key: SPARK-10630 URL: https://issues.apache.org/jira/browse/SPARK-10630 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Priority: Minor It would be nice to support creating a DataFrame directly from a Java List of Row: {code} def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector
Xiangrui Meng created SPARK-10631: - Summary: Add missing API doc in pyspark.mllib.linalg.Vector Key: SPARK-10631 URL: https://issues.apache.org/jira/browse/SPARK-10631 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark Reporter: Xiangrui Meng Priority: Minor There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10516) Add values as a property to DenseVector in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10516. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8682 [https://github.com/apache/spark/pull/8682] > Add values as a property to DenseVector in PySpark > -- > > Key: SPARK-10516 > URL: https://issues.apache.org/jira/browse/SPARK-10516 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Vinod KC >Priority: Trivial > Labels: starter > Fix For: 1.6.0 > > > We use `values` in Scala but `array` in PySpark. We should add `values` as a > property to match Scala implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10630) createDataFrame from a Java List
[ https://issues.apache.org/jira/browse/SPARK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746968#comment-14746968 ] holdenk commented on SPARK-10630: - Sounds good to me :) I'll give it a shot :) > createDataFrame from a Java List > - > > Key: SPARK-10630 > URL: https://issues.apache.org/jira/browse/SPARK-10630 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Minor > > It would be nice to support creating a DataFrame directly from a Java List of > Row: > {code} > def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5314) java.lang.OutOfMemoryError in SparkSQL with GROUP BY
[ https://issues.apache.org/jira/browse/SPARK-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5314: - Assignee: Michael Armbrust > java.lang.OutOfMemoryError in SparkSQL with GROUP BY > > > Key: SPARK-5314 > URL: https://issues.apache.org/jira/browse/SPARK-5314 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Alex Baretta >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > I am running a SparkSQL GROUP BY query on a largish Parquet table (a few > hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of > RAM, so it should have more than plenty resources to cope with this query. > WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, > ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC > overhead limit exceeded > at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) > at scala.collection.AbstractSeq.distinct(Seq.scala:40) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37) > at > org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100) > at > org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50) > at > org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81) > at > org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5397) Assigning aliases to several return values of an UDF
[ https://issues.apache.org/jira/browse/SPARK-5397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5397: - Assignee: Michael Armbrust > Assigning aliases to several return values of an UDF > > > Key: SPARK-5397 > URL: https://issues.apache.org/jira/browse/SPARK-5397 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Max >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > The query with following syntax is no valid SQL in Spark due to the assigment > of multiple aliases. > So it seems not possible for me to port former HiveQL queries with UDFs > returning multiple values to Spark SQL. > Query > > SELECT my_function(param_one, param_two) AS (return_one, return_two, > return_three) > FROM my_table; > Error > > Unsupported language features in query: SELECT my_function(param_one, > param_two) AS (return_one, return_two, return_three) > FROM my_table; > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > my_table > TOK_SELECT > TOK_SELEXPR > TOK_FUNCTION > my_function > TOK_TABLE_OR_COL > param_one > TOK_TABLE_OR_COL > param_two > return_one > return_two > return_three -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5302) Add support for SQLContext "partition" columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5302: - Assignee: Cheng Lian > Add support for SQLContext "partition" columns > -- > > Key: SPARK-5302 > URL: https://issues.apache.org/jira/browse/SPARK-5302 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Bob Tiernay >Assignee: Cheng Lian > Fix For: 1.4.0 > > > For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to > support a virtual column that maps to part of the the file path, similar to > what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} > where {{dt}} is a column of type {{TEXT}}). > The API could allow the user to type the column using an appropriate > {{DataType}} instance. This new field could be addressed in SQL statements > much the same as is done in Hive. > As a consequence, pruning of partitions could be possible when executing a > query and also remove the need to materialize a column in each logical > partition that is already encoded in the path name. Furthermore, this would > provide an nice interop and migration strategy for Hive users who may one day > use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle
[ https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5421: - Assignee: Michael Armbrust > SparkSql throw OOM at shuffle > - > > Key: SPARK-5421 > URL: https://issues.apache.org/jira/browse/SPARK-5421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Hong Shen >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, > but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill > at shuffle, it's very easy to throw OOM at shuffle. I think sparkSQL also > need spill at shuffle. > One of the executor's log, here is stderr: > 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs > for shuffle 1, fetching them > 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker > actor = > Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] > 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations > 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 > non-empty blocks out of 143 blocks > 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote > fetches in 72 ms > 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL 15: SIGTERM > here is stdout: > 2015-01-27T07:44:43.487+0800: [Full GC 3961343K->3959868K(3961344K), > 29.8959290 secs] > 2015-01-27T07:45:13.460+0800: [Full GC 3961343K->3959992K(3961344K), > 27.9218150 secs] > 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs] > 2015-01-27T07:45:52.950+0800: [Full GC 3961343K->3960113K(3961344K), > 29.3894670 secs] > 2015-01-27T07:46:22.393+0800: [Full GC 3961118K->3960240K(3961344K), > 28.9879600 secs] > 2015-01-27T07:46:51.393+0800: [Full GC 3960240K->3960213K(3961344K), > 34.1530900 secs] > # > # java.lang.OutOfMemoryError: Java heap space > # -XX:OnOutOfMemoryError="kill %p" > # Executing /bin/sh -c "kill 9050"... > 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8786: - Assignee: Josh Rosen > Create a wrapper for BinaryType > --- > > Key: SPARK-8786 > URL: https://issues.apache.org/jira/browse/SPARK-8786 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Josh Rosen > Fix For: 1.5.0 > > > The hashCode and equals() of Array[Byte] does check the bytes, we should > create a wrapper (internally) to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself
[ https://issues.apache.org/jira/browse/SPARK-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6632: - Assignee: Michael Armbrust > Optimize the parquetSchema to metastore schema reconciliation, so that the > process is delegated to each map task itself > --- > > Key: SPARK-6632 > URL: https://issues.apache.org/jira/browse/SPARK-6632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yash Datta >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > Currently in ParquetRelation2, schema from all the part files is first > merged, and then reconciled with metastore schema. This approach does not > scale in case we have thousands of partitions for the table. We can take a > different approach where we can go ahead with the metastore schema, and > reconcile the names of the columns within each map task , using ReadSupport > hooks provided in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector
[ https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746994#comment-14746994 ] Vinod KC commented on SPARK-10631: -- I'm working on this > Add missing API doc in pyspark.mllib.linalg.Vector > -- > > Key: SPARK-10631 > URL: https://issues.apache.org/jira/browse/SPARK-10631 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Priority: Minor > > There are some missing API docs in pyspark.mllib.linalg.Vector (including > DenseVector and SparseVector). We should add them based on their Scala > counterparts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747058#comment-14747058 ] Maciej Bryński commented on SPARK-10577: Tested this patch today on a 1.5.0. Works great. Thank you. > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński > Labels: starter > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10508) incorrect evaluation of searched case expression
[ https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10508: -- Assignee: Josh Rosen > incorrect evaluation of searched case expression > > > Key: SPARK-10508 > URL: https://issues.apache.org/jira/browse/SPARK-10508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Assignee: Josh Rosen > Fix For: 1.4.1, 1.5.0 > > > The following case expression never evaluates to 'test1' when cdec is -1 or > 10 as it will for Hive 0.13. Instead is returns 'other' for all rows. > {code} > select rnum, cdec, case when cdec in ( -1,10,0.1 ) then 'test1' else 'other' > end from tdec > create table if not exists TDEC ( RNUM int , CDEC decimal(7, 2 )) > TERMINATED BY '\n' > STORED AS orc ; > 0|\N > 1|-1.00 > 2|0.00 > 3|1.00 > 4|0.10 > 5|10.00 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. At first I thought it was a problem with my udt class, then tried the Vector class from mlib and the error was the same. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. I tried to save a dataframe with instances of the Vector class from mlib and got the error. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at
[jira] [Updated] (SPARK-3231) select on a table in parquet format containing smallint as a field type does not work
[ https://issues.apache.org/jira/browse/SPARK-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3231: - Assignee: Alex Rovner > select on a table in parquet format containing smallint as a field type does > not work > - > > Key: SPARK-3231 > URL: https://issues.apache.org/jira/browse/SPARK-3231 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: The table is created through Hive-0.13. > SparkSql 1.1 is used. >Reporter: chirag aggarwal >Assignee: Alex Rovner > Fix For: 1.5.0 > > > A table is created through hive. This table has a field of type smallint. The > format of the table is parquet. > select on this table works perfectly on hive shell. > But, when the select is run on this table from spark-sql, then the query > fails. > Steps to reproduce the issue: > -- > hive> create table abct (a smallint, b int) row format delimited fields > terminated by '|' stored as textfile; > A text file is stored in hdfs for this table. > hive> create table abc (a smallint, b int) stored as parquet; > hive> insert overwrite table abc select * from abct; > hive> select * from abc; > 2 1 > 2 2 > 2 3 > spark-sql> select * from abc; > 10:08:46 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0.0 in stage 33.0 (TID 2340) had a not serializable > result: org.apache.hadoop.io.IntWritable > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1158) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1147) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1146) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1146) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:685) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > But, if the type of this table is now changed to int, then spark-sql gives > the correct results. > hive> alter table abc change a a int; > spark-sql> select * from abc; > 2 1 > 2 2 > 2 3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3617) Configurable case sensitivity
[ https://issues.apache.org/jira/browse/SPARK-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3617: - Assignee: Michael Armbrust > Configurable case sensitivity > - > > Key: SPARK-3617 > URL: https://issues.apache.org/jira/browse/SPARK-3617 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > Right now SQLContext is case sensitive and HiveContext is not. It would be > better to make it configurable in both instances. All of the underlying > plumbing is there we just need to expose an option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2824) Allow saving Parquet files to the HiveMetastore
[ https://issues.apache.org/jira/browse/SPARK-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2824: - Assignee: Michael Armbrust > Allow saving Parquet files to the HiveMetastore > --- > > Key: SPARK-2824 > URL: https://issues.apache.org/jira/browse/SPARK-2824 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Aaron Davidson >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > Currently, we use one code path for reading all data from the Hive Metastore, > and this precludes writing or loading data as custom ParquetRelations (they > are always MetastoreRelations). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2695) Figure out a good way to handle NullType columns.
[ https://issues.apache.org/jira/browse/SPARK-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2695: - Assignee: Michael Armbrust > Figure out a good way to handle NullType columns. > - > > Key: SPARK-2695 > URL: https://issues.apache.org/jira/browse/SPARK-2695 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > For example, if a JSON dataset has a field which only has null values, we > will get NullType when we turn that dataset as a SchemaRDD. We need to think > about what is the good way to deal with NullType columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10632) Cannot save DataFrame with User Defined Types
Joao created SPARK-10632: Summary: Cannot save DataFrame with User Defined Types Key: SPARK-10632 URL: https://issues.apache.org/jira/browse/SPARK-10632 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Joao Cannot save DataFrames that contain user-defined types. At first I thought it was a problem with my udt class, then tried the Vector class from mlib and the error was the same. Te code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at
[jira] [Commented] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing
[ https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747088#comment-14747088 ] Sean Owen commented on SPARK-10629: --- That sounds like the same issue in SPARK-10433; it's not clear that whatever it is only manifests as constantly increasing size. In any event, you should of course try 1.5+ to see if the fix fixes this anyway. I'd personally close this, given this information, until you are certain it happens now in the latest code, in which case it is something else. > Gradient boosted trees: mapPartitions input size increasing > > > Key: SPARK-10629 > URL: https://issues.apache.org/jira/browse/SPARK-10629 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1 >Reporter: Wenmin Wu > > First of all, I think my problem is quite different from > https://issues.apache.org/jira/browse/SPARK-10433, which point that the input > size increasing at each iteration. > My problem is the mapPartitions input size increase in one iteration. My > training samples has 2958359 features in total. Within one iteration, 3 > collectAsMap operation had been called. And here is a summary of each call. > | Stage Id | Description| > Duration | Input| Shuffle Read | Shuffle Write | > |:--:|:---:|:---:|:---:|::|::| > | 4 | mapPartitions at DecisionTree.scala:613 | 1.6 h |710.2 > MB | | 2.8 GB | > | 5 | collectAsMap at DecisionTree.scala:642 | 1.8 min | >|2.8 GB| | > | 6 | mapPartitions at DecisionTree.scala:613 | 1.2 h | 27.0 > GB || 5.6 GB | > | 7 | collectAsMap at DecisionTree.scala:642 | 2.0 min | | > 5.6GB | | > | 8 | mapPartitions at DecisionTree.scala:613 | 1.2 h | 26.5 > GB || 11.1 GB | > | 9 | collectAsMap at DecisionTree.scala:642 | 2.0 min | | > 8.3 GB | | > the mapPartitions operation took too long time! It's so strange! I wonder > whether there is bug exits? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747089#comment-14747089 ] Antonio Jesus Navarro commented on SPARK-10485: --- Now, the bug is into the checking of the DataTypes: https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33 else if (trueValue.dataType != falseValue.dataType) { TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " + s"(${trueValue.dataType.simpleString} and ${falseValue.dataType.simpleString}).") } If the true expression or the false expression is a NullType, now the type validation fails. > IF expression is not correctly resolved when one of the options have NullType > - > > Key: SPARK-10485 > URL: https://issues.apache.org/jira/browse/SPARK-10485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Antonio Jesus Navarro > > If we have this query: > {code} > SELECT IF(column > 1, 1, NULL) FROM T1 > {code} > On Spark 1.4.1 we have this: > {code} > override lazy val resolved = childrenResolved && trueValue.dataType == > falseValue.dataType > {code} > So if one of the types is NullType, the if expression is not resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747089#comment-14747089 ] Antonio Jesus Navarro edited comment on SPARK-10485 at 9/16/15 7:32 AM: Now, the bug is into the checking of the DataTypes: https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33 {code} else if (trueValue.dataType != falseValue.dataType) { TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " + s"(${trueValue.dataType.simpleString} and ${falseValue.dataType.simpleString}).") } {code} If the true expression or the false expression is a NullType, now the type validation fails. was (Author: ajnavarro): Now, the bug is into the checking of the DataTypes: https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33 else if (trueValue.dataType != falseValue.dataType) { TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " + s"(${trueValue.dataType.simpleString} and ${falseValue.dataType.simpleString}).") } If the true expression or the false expression is a NullType, now the type validation fails. > IF expression is not correctly resolved when one of the options have NullType > - > > Key: SPARK-10485 > URL: https://issues.apache.org/jira/browse/SPARK-10485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Antonio Jesus Navarro > > If we have this query: > {code} > SELECT IF(column > 1, 1, NULL) FROM T1 > {code} > On Spark 1.4.1 we have this: > {code} > override lazy val resolved = childrenResolved && trueValue.dataType == > falseValue.dataType > {code} > So if one of the types is NullType, the if expression is not resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3700) Improve the performance of scanning JSON datasets
[ https://issues.apache.org/jira/browse/SPARK-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3700: - Assignee: Yanbo Liang > Improve the performance of scanning JSON datasets > - > > Key: SPARK-3700 > URL: https://issues.apache.org/jira/browse/SPARK-3700 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yanbo Liang > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3833) Allow Spark SQL SchemaRDDs to be merged
[ https://issues.apache.org/jira/browse/SPARK-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3833: - Assignee: Michael Armbrust > Allow Spark SQL SchemaRDDs to be merged > --- > > Key: SPARK-3833 > URL: https://issues.apache.org/jira/browse/SPARK-3833 > Project: Spark > Issue Type: Wish > Components: SQL >Reporter: Chris Wood >Assignee: Michael Armbrust > Fix For: 1.4.0 > > > We have JSON flowing into Spark SQL. > I can successfully store them as parquet and read them with > sqlContext.jsonRDD, but the inferred schemas cannot be merged into a single > table to do queries. > I'd like a way to allow for parquet file schemas to be merged, whether they > match or not, since we know the schema should be a union of the schemas from > the files. > This will allow us to have the data define the schema and new columns will > just appear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3978) Schema change on Spark-Hive (Parquet file format) table not working
[ https://issues.apache.org/jira/browse/SPARK-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3978: - Assignee: Alex Rovner > Schema change on Spark-Hive (Parquet file format) table not working > --- > > Key: SPARK-3978 > URL: https://issues.apache.org/jira/browse/SPARK-3978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Nilesh Barge >Assignee: Alex Rovner > Fix For: 1.5.0 > > > On following releases: > Spark 1.1.0 (built using sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly) , > Apache HDFS 2.2 > Spark job is able to create/add/read data in hive, parquet formatted, tables > using HiveContext. > But, after changing schema, spark job is not able to read data and throws > following exception: > java.lang.ArrayIndexOutOfBoundsException: 2 > at > org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127) > > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284) > > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) > > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:744) > code snippet in short: > hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name > String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' > OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'"); > hiveContext.sql("INSERT INTO TABLE people_table SELECT name, age FROM > temp_table_people1"); > hiveContext.sql("SELECT * FROM people_table"); //Here, data read was > successful. > hiveContext.sql("ALTER TABLE people_table ADD COLUMNS (gender STRING)"); > hiveContext.sql("SELECT * FROM people_table"); //Not able to read existing > data and ArrayIndexOutOfBoundsException is thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5738) Reuse mutable row for each record at jsonStringToRow
[ https://issues.apache.org/jira/browse/SPARK-5738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5738: - Assignee: Yanbo Liang > Reuse mutable row for each record at jsonStringToRow > > > Key: SPARK-5738 > URL: https://issues.apache.org/jira/browse/SPARK-5738 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Other table scan like operations include ParquetTableScan, HiveTableScan use > a reusable mutable row for seralization to decrease garbage. We also make > JSONRelation#buildScan() with this optimization. > When serialize json string to row, reuse a mutable row for both each record > and inner nested structure instead of creating a new one for each. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3804) Output of Generator expressions is not stable after serialization.
[ https://issues.apache.org/jira/browse/SPARK-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3804: - Assignee: Michael Armbrust > Output of Generator expressions is not stable after serialization. > -- > > Key: SPARK-3804 > URL: https://issues.apache.org/jira/browse/SPARK-3804 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > The output of a generator expression (such as explode) can change after > serialization. This caused a nasty data corruption bug. While this bug has > seen been addressed it would be good to fix this since it violates the > general contract of query plans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5823) Reuse mutable rows for inner structures when parsing JSON objects
[ https://issues.apache.org/jira/browse/SPARK-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5823: - Assignee: Yanbo Liang > Reuse mutable rows for inner structures when parsing JSON objects > - > > Key: SPARK-5823 > URL: https://issues.apache.org/jira/browse/SPARK-5823 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yanbo Liang > > With SPARK-5738, we will reuse a mutable row for rows when parsing JSON > objects. We can do the same thing for inner structures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10511) Source releases should not include maven jars
[ https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10511. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Resolved by https://github.com/apache/spark/pull/8774 > Source releases should not include maven jars > - > > Key: SPARK-10511 > URL: https://issues.apache.org/jira/browse/SPARK-10511 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Patrick Wendell >Assignee: Luciano Resende >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > I noticed our source jars seemed really big for 1.5.0. At least one > contributing factor is that, likely due to some change in the release script, > the maven jars are being bundled in with the source code in our build > directory. This runs afoul of the ASF policy on binaries in source releases - > we should fix it in 1.5.1. > The issue (I think) is that we might invoke maven to compute the version > between when we checkout Spark from github and when we package the source > file. I think it could be fixed by simply clearing out the build/ directory > after that statement runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4559) Adding support for ucase and lcase
[ https://issues.apache.org/jira/browse/SPARK-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4559: - Assignee: Michael Armbrust > Adding support for ucase and lcase > -- > > Key: SPARK-4559 > URL: https://issues.apache.org/jira/browse/SPARK-4559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Fei Wang >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > Adding support for ucase and lcase in spark sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)
[ https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4273: - Assignee: Michael Armbrust > Providing ExternalSet to avoid OOM when count(distinct) > --- > > Key: SPARK-4273 > URL: https://issues.apache.org/jira/browse/SPARK-4273 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Reporter: YanTang Zhai >Assignee: Michael Armbrust >Priority: Minor > Fix For: 1.5.0 > > > Some task may OOM when count(distinct) if it needs to process many records. > CombineSetsAndCountFunction puts all records into an OpenHashSet, if it > fetchs many records, it may occupy large memory. > I think a data structure ExternalSet like ExternalAppendOnlyMap could be > provided to store OpenHashSet data in disks when it's capacity exceeds some > threshold. > For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with > hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be > indicated as follows: > ohs1 [d, b, c, a] => [a, b, c, d] => file1 > ohs2 [e, f, g, a] => [a, e, f, g] => file2 > ohs3 [e, h, i, g] => [e, g, h, i] => file3 > ohs4 [j, h, a] => [a, h, j] => sortedSet > When output, all keys with the same hashCode will be put into a OpenHashSet, > then the iterator of this OpenHashSet is accessing. The procedure could be > indicated as follows: > file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a; > file1 -> b -> ohsB; ohsB -> b; > file1 -> c -> ohsC; ohsC -> c; > file1 -> d -> ohsD; ohsD -> d; > file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e; > ... > I think using the ExternalSet could avoid OOM when count(distinct). Welcomes > comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5109) Loading multiple parquet files into a single SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5109: - Assignee: Michael Armbrust > Loading multiple parquet files into a single SchemaRDD > -- > > Key: SPARK-5109 > URL: https://issues.apache.org/jira/browse/SPARK-5109 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sam Steingold >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > {{[SQLContext.parquetFile(String)|http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/SQLContext.html#parquetFile%28java.lang.String%29]}} > accepts a comma-separated list of files to load. > This feature prevents files with commas in its name (a rare use case, > admittedly), it is also an _extremely_ unusual feature. > This feature should be deprecated and new methods > {code} > SQLContext.parquetFile(String[]) > SQLContext.parquetFile(List) > {code} > should be added instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10515) When killing executor, the pending replacement executors will be lost
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-10515: -- Summary: When killing executor, the pending replacement executors will be lost (was: When kill executor, there is no need to seed RequestExecutors to AM) > When killing executor, the pending replacement executors will be lost > - > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)
[ https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9032: - Assignee: Josh Rosen > scala.MatchError in DataFrameReader.json(String path) > - > > Key: SPARK-9032 > URL: https://issues.apache.org/jira/browse/SPARK-9032 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.4.0 > Environment: Ubuntu 15.04 >Reporter: Philipp Poetter >Assignee: Josh Rosen > Fix For: 1.4.1 > > > Executing read().json() of SQLContext e.g. DataFrameReader raises a > MatchError with a stacktrace as follows while trying to read JSON data: > {code} > 15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, > took 6.981330 s > Exception in thread "main" scala.MatchError: StringType (of class > org.apache.spark.sql.types.StringType$) > at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139) > at > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137) > at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137) > at > org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213) > at com.hp.sparkdemo.Example.main(Example.java:23) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook > 15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040 > 15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler > 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all > executors > 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to > shut down > 15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > {code} > Offending code snippet (around line 23): > {code} >JavaSparkContext sctx = new JavaSparkContext(sparkConf); > SQLContext ctx = new SQLContext(sctx); > DataFrame frame = ctx.read().json(facebookJSON); > frame.printSchema(); > {code} > The exception is reproducable using the following JSON: > {code} > { >"data": [ > { > "id": "X999_Y999", > "from": { > "name": "Tom Brady", "id": "X12" > }, > "message": "Looking forward to 2010!", > "actions": [ > { >"name": "Comment", >"link": "http://www.facebook.com/X999/posts/Y999; > }, > { >"name": "Like", >"link": "http://www.facebook.com/X999/posts/Y999; > } > ], > "type": "status", > "created_time": "2010-08-02T21:27:44+", > "updated_time": "2010-08-02T21:27:44+" > }, > { > "id": "X998_Y998", > "from": { > "name": "Peyton Manning", "id": "X18" > }, > "message": "Where's my contract?", > "actions": [ > { >"name": "Comment", >"link": "http://www.facebook.com/X998/posts/Y998; > }, > { >"name": "Like", >"link": "http://www.facebook.com/X998/posts/Y998; > } > ], > "type": "status", > "created_time": "2010-08-02T21:27:44+", > "updated_time": "2010-08-02T21:27:44+" > } >] > } > {code} -- This message was sent by Atlassian JIRA
[jira] [Updated] (SPARK-9343) DROP TABLE ignores IF EXISTS clause
[ https://issues.apache.org/jira/browse/SPARK-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9343: - Assignee: Michael Armbrust > DROP TABLE ignores IF EXISTS clause > --- > > Key: SPARK-9343 > URL: https://issues.apache.org/jira/browse/SPARK-9343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov >Assignee: Michael Armbrust > Labels: sql > Fix For: 1.5.0 > > > If a table is missing, {{DROP TABLE IF EXISTS _tableName_}} generates an > exception: > {code} > 15/07/25 15:17:32 ERROR Hive: NoSuchObjectException(message:default.test > table not found) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy27.get_table(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy28.getTable(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) > at > org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3846) > > {code} > I standalone test with full spark-shell output is available at: > [https://gist.github.com/ssimeonov/eeb388d13f802689d772] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9033: - Assignee: Josh Rosen > scala.MatchError: interface java.util.Map (of class java.lang.Class) with > Spark SQL > --- > > Key: SPARK-9033 > URL: https://issues.apache.org/jira/browse/SPARK-9033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.2, 1.3.1 >Reporter: Pavel >Assignee: Josh Rosen > Fix For: 1.4.0 > > > I've a java.util.Mapfield in a POJO class and I'm trying to > use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext > and getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL: > *sample code: > {code} > SQLContext sqlCtx = new SQLContext(sc.sc()); > JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); > //text line is splitted and assigned to respective field of the event class > here > DataFrame schemaRDD = sqlCtx.createDataFrame(rdd, Event.class); <-- error > thrown here > schemaRDD.registerTempTable("events"); > {code} > Event class is a Serializable containing a field of type > java.util.Map . This issue occurs also with Spark streaming > when used with SQL. > {code} > JavaDStream receiverStream = jssc.receiverStream(new > StreamingReceiver()); > JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, > SLIDE_INTERVAL); > jssc.checkpoint("event-streaming"); > windowDStream.foreachRDD(evRDD -> { >if(evRDD.count() == 0) return null; > DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class); > schemaRDD.registerTempTable("events"); > ... > } > {code} > *error: > {code} > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > {code} > **also this occurs for fields of custom POJO classes: > {code} > scala.MatchError: class com.test.MyClass (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > {code} > **also occurs for Calendar type: > {code} > scala.MatchError: class java.util.Calendar (of class java.lang.Class) > at >
[jira] [Updated] (SPARK-10437) Support aggregation expressions in Order By
[ https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10437: -- Assignee: Liang-Chi Hsieh > Support aggregation expressions in Order By > --- > > Key: SPARK-10437 > URL: https://issues.apache.org/jira/browse/SPARK-10437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > Followup on SPARK-6583 > The following still fails. > {code} > val df = sqlContext.read.json("examples/src/main/resources/people.json") > df.registerTempTable("t") > val df2 = sqlContext.sql("select age, count(*) from t group by age order by > count(*)") > df2.show() > {code} > {code:title=StackTrace} > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No > function to evaluate expression. type: Count, tree: COUNT(1) > at > org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41) > at > org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219) > {code} > In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this > case. > Haven't looked at 1.5 code, but don't see a change to bindReference in this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10475) improve column prunning for Project on Sort
[ https://issues.apache.org/jira/browse/SPARK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10475: -- Assignee: Wenchen Fan > improve column prunning for Project on Sort > --- > > Key: SPARK-10475 > URL: https://issues.apache.org/jira/browse/SPARK-10475 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.
Lunen created SPARK-10633: - Summary: Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already. Key: SPARK-10633 URL: https://issues.apache.org/jira/browse/SPARK-10633 Project: Spark Issue Type: Bug Components: SQL, Streaming Affects Versions: 1.5.0, 1.4.0 Environment: Ubuntu 14.04 IntelliJ IDEA 14.1.4 sbt mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1) Reporter: Lunen Priority: Blocker Persisting Spark Kafka stream to MySQL Spark 1.4 + tries to create a table automatically every time the stream gets sent to a specified table. Please note, Spark 1.3.1 works. Code sample: val url = "jdbc:mysql://host:port/db?user=user=password val crp = RowSetProvider.newFactory() val crsSql: CachedRowSet = crp.createCachedRowSet() val crsTrg: CachedRowSet = crp.createCachedRowSet() crsSql.beforeFirst() crsTrg.beforeFirst() //Read Stream from Kafka //Produce SQL INSERT STRING streamT.foreachRDD { rdd => if (rdd.toLocalIterator.nonEmpty) { sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events") while (crsSql.next) { sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", new Properties) println("Persisted Data: " + 'SQL INSERT STRING') } crsSql.beforeFirst() } stmt.close() conn.close() } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.
[ https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10633: -- Priority: Major (was: Blocker) [~lunendl] Please have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- for example you should not set Blocker, as this can't really be considered one at this point. > Persisting Spark stream to MySQL - Spark tries to create the table for every > stream even if it exist already. > - > > Key: SPARK-10633 > URL: https://issues.apache.org/jira/browse/SPARK-10633 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 1.4.0, 1.5.0 > Environment: Ubuntu 14.04 > IntelliJ IDEA 14.1.4 > sbt > mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1) >Reporter: Lunen > > Persisting Spark Kafka stream to MySQL > Spark 1.4 + tries to create a table automatically every time the stream gets > sent to a specified table. > Please note, Spark 1.3.1 works. > Code sample: > val url = "jdbc:mysql://host:port/db?user=user=password > val crp = RowSetProvider.newFactory() > val crsSql: CachedRowSet = crp.createCachedRowSet() > val crsTrg: CachedRowSet = crp.createCachedRowSet() > crsSql.beforeFirst() > crsTrg.beforeFirst() > //Read Stream from Kafka > //Produce SQL INSERT STRING > > streamT.foreachRDD { rdd => > if (rdd.toLocalIterator.nonEmpty) { > sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events") > while (crsSql.next) { > sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", > new Properties) > println("Persisted Data: " + 'SQL INSERT STRING') > } > crsSql.beforeFirst() > } > stmt.close() > conn.close() > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10267) Add @Since annotation to ml.util
[ https://issues.apache.org/jira/browse/SPARK-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10267. --- Resolution: Not A Problem Assignee: Ehsan Mohyedin Kermani > Add @Since annotation to ml.util > > > Key: SPARK-10267 > URL: https://issues.apache.org/jira/browse/SPARK-10267 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Ehsan Mohyedin Kermani >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector
[ https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10631: -- Assignee: Vinod KC > Add missing API doc in pyspark.mllib.linalg.Vector > -- > > Key: SPARK-10631 > URL: https://issues.apache.org/jira/browse/SPARK-10631 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Vinod KC >Priority: Minor > > There are some missing API docs in pyspark.mllib.linalg.Vector (including > DenseVector and SparseVector). We should add them based on their Scala > counterparts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation
[ https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10276. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8677 [https://github.com/apache/spark/pull/8677] > Add @since annotation to pyspark.mllib.recommendation > - > > Key: SPARK-10276 > URL: https://issues.apache.org/jira/browse/SPARK-10276 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768853#comment-14768853 ] Yashwanth Kumar commented on SPARK-6810: Hi Shivaram Venkataraman, I would like to try this. > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute
[ https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768840#comment-14768840 ] Xiangrui Meng commented on SPARK-10262: --- [~tijoparacka] Are you still working on this? > Add @Since annotation to ml.attribute > - > > Key: SPARK-10262 > URL: https://issues.apache.org/jira/browse/SPARK-10262 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics
[ https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768850#comment-14768850 ] Yashwanth Kumar commented on SPARK-9492: Can i Have this task? > LogisticRegression in R should provide model statistics > --- > > Key: SPARK-9492 > URL: https://issues.apache.org/jira/browse/SPARK-9492 > Project: Spark > Issue Type: Sub-task > Components: ML, R >Reporter: Eric Liang > > Like ml LinearRegression, LogisticRegression should provide a training > summary including feature names and their coefficients. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
Prachi Burathoki created SPARK-10634: Summary: The spark sql fails if the where clause contains a string with " in it. Key: SPARK-10634 URL: https://issues.apache.org/jira/browse/SPARK-10634 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Prachi Burathoki When running a sql query in which the where clause contains a string with " in it, the sql parser throws an error. Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but identifier test found SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) at com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) at com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) at com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768938#comment-14768938 ] Sean Owen commented on SPARK-10634: --- Shouldn't the " be escaped in some way? or else I'm not sure how the parser would know the end of the literal from a quote inside the literal. > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10614) SystemClock uses non-monotonic time in its wait logic
[ https://issues.apache.org/jira/browse/SPARK-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768900#comment-14768900 ] Steve Loughran commented on SPARK-10614: Having done a little more detailed research on the current state of this clock, I'm now having doubts about this. On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} counter to get the timestamp —which is fast and only goes forwards (albeit at a rate which depends on CPU power states). But it turns out that on manycore CPUs, because that could lead to different answers on different cores, the OS may use alternative mechanisms to return a counter: which may be neither monotonic nor fast. # [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks] # [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250] # [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in general lacks monotonicity|JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in general lacks monotonicity] # [Redhat on timestamps in Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html] # [Timekeeping in VMware Virtual Machineshttp://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf] These docs imply that nanotime may be fast-but-unreliable-on multiple socket systems (the latest many core parts share one counter) —and may downgrade to something slower than calls to getTimeMillis()., or even something that isn't guaranteed to be monotonic. It's not clear that on deployments of physical many-core systems moving to nanotime actually offers much. I don't know about EC2 or other cloud infrastructures though. maybe its just best to WONTFIX this as it won't raise unrealistic expectations about nanoTime working > SystemClock uses non-monotonic time in its wait logic > - > > Key: SPARK-10614 > URL: https://issues.apache.org/jira/browse/SPARK-10614 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Steve Loughran >Priority: Minor > > The consolidated (SPARK-4682) clock uses {{System.currentTimeMillis()}} for > measuring time, which means its {{waitTillTime()}} routine is brittle against > systems (VMs in particular) whose time can go backwards as well as forward. > For the {{ExecutorAllocationManager}} this appears to be a regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14769057#comment-14769057 ] Sean Owen commented on SPARK-10634: --- Don't you need to use "" to quote double quotes? I actually am not sure what Spark supports but that's not uncommon in SQL. I don't think you're escaping the quotes here. > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14769011#comment-14769011 ] Prachi Burathoki commented on SPARK-10634: -- I tried by escaping with \, but still same error Caused by: java.lang.RuntimeException: [1.130] failure: ``)'' expected but identifier test found SELECT clistc1426336010, corc2125646118, candc2031403851, SYSIBM_ROW_NUMBER FROM TABLE_1 WHERE ((clistc1426336010 = "this is a \"test\"")) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) at com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) at com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) at com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRef > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at >
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10635) pyspark - running on a different host
Ben Duffield created SPARK-10635: Summary: pyspark - running on a different host Key: SPARK-10635 URL: https://issues.apache.org/jira/browse/SPARK-10635 Project: Spark Issue Type: Improvement Reporter: Ben Duffield At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790700#comment-14790700 ] Xiangrui Meng commented on SPARK-6810: -- the ml part (glm) is about the same. all computation is on the Scala side. i would wait until 1.6 to benchmark GLM because we are going to implement the same algorithm as R in 1.6. > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790713#comment-14790713 ] Seth Hendrickson commented on SPARK-10602: -- Right now I have working versions of single pass algos for skewness and kurtosis. I'm still writing tests and need to work on performance profiling, but I just wanted to make sure we don't duplicate any work. I'll post back with more details soon. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10637) DataFrames: saving with nested User Data Types
Joao created SPARK-10637: Summary: DataFrames: saving with nested User Data Types Key: SPARK-10637 URL: https://issues.apache.org/jira/browse/SPARK-10637 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Joao Cannot save data frames using nested UserDefinedType I wrote a simple example to show the error. It causes the following error java.lang.IllegalArgumentException: Nested type should be repeated: required group array { required int32 num; } {code:java} import org.apache.spark.sql.SaveMode import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[AUDT]) case class A(list:Seq[B]) class AUDT extends UserDefinedType[A] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(BUDT, containsNull = false), nullable = true))) override def userClass: Class[A] = classOf[A] override def serialize(obj: Any): Any = obj match { case A(list) => val row = new GenericMutableRow(1) row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) row } override def deserialize(datum: Any): A = { datum match { case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) } } } object AUDT extends AUDT @SQLUserDefinedType(udt = classOf[BUDT]) case class B(num:Int) class BUDT extends UserDefinedType[B] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[B] = classOf[B] override def serialize(obj: Any): Any = obj match { case B(num) => val row = new GenericMutableRow(1) row.setInt(0, num) row } override def deserialize(datum: Any): B = { datum match { case row: InternalRow => new B(row.getInt(0)) } } } object BUDT extends BUDT object TestNested { def main(args:Array[String]) = { val col = Seq(new A(Seq(new B(1), new B(2))), new A(Seq(new B(3), new B(4 val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("TestSpark")) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(1 to 2 zip col).toDF() df.show() df.write.mode(SaveMode.Overwrite).save(...) } } {code} The error log is shown below: {noformat} 15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external IP address! 15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB) 15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1 15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73 15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) with 1 output partitions 15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at TestNested.scala:73) 15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List() 15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has no missing parents 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with curMem=0, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 58.4 KB, free 973.6 MB) 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with curMem=59832, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.3 KB, free 973.5 MB) 15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50986 (size: 20.3 KB, free: 973.6 MB) 15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at
[jira] [Created] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
Glenn Strycker created SPARK-10636: -- Summary: RDD filter does not work after if..then..else RDD blocks Key: SPARK-10636 URL: https://issues.apache.org/jira/browse/SPARK-10636 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Glenn Strycker I have an RDD declaration of the following form: {code} val myRDD = if (some condition) { tempRDD1.some operations } else { tempRDD2.some operations}.filter(a => a._2._5 <= 50) {code} When I output the contents of myRDD, I found entries that clearly had a._2._5 > 50... the filter didn't work! If I move the filter inside of the if..then blocks, it suddenly does work: {code} val myRDD = if (some condition) { tempRDD1.some operations.filter(a => a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } {code} I ran toDebugString after both of these code examples, and "filter" does appear in the DAG for the second example, but DOES NOT appear in the first DAG. Why not? Am I misusing the if..then..else syntax for Spark/Scala? Here is my actual code... ignore the crazy naming conventions and what it's doing... {code} // this does NOT work val myRDD = if (tempRDD2.count() > 0) { tempRDD1. map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). leftOuterJoin(tempRDD2). map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, a._2._2.getOrElse(1L. leftOuterJoin(tempRDD2). map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, a._2._2.getOrElse(1L. map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) } else { tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) }. filter(a => a._2._5 <= 50). partitionBy(partitioner). setName("myRDD"). persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint() println(myRDD.toDebugString) // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at reduceByKey at myProgram.scala:1689 [] // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at myProgram.scala:383 [] // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | | CheckpointRDD[17] at count at myProgram.scala:394 [] // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at myProgram.scala:383 [] // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | CheckpointRDD[17] at count at myProgram.scala:394 [] // this DOES work! val myRDD = if (tempRDD2.count() > 0) { tempRDD1. map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). leftOuterJoin(tempRDD2). map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, a._2._2.getOrElse(1L. leftOuterJoin(tempRDD2). map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, a._2._2.getOrElse(1L. map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). filter(a => a._2._5 <= 50) } else { tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). filter(a => a._2._5 <= 50) }. partitionBy(partitioner). setName("myRDD"). persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint() println(myRDD.toDebugString) // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 [] // | MapPartitionsRDD[58] at map at myProgram.scala:2120 [] // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] // | |
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790634#comment-14790634 ] Prachi Burathoki commented on SPARK-10634: -- I tried escaping" with both \ and ".But got the same error.But I've found a workaround, in the where clause i changed it to listc = 'this is a"test"' and that works. Thanks > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-10634: --- > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9296: --- Assignee: Apache Spark > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790663#comment-14790663 ] Apache Spark commented on SPARK-9296: - User 'JihongMA' has created a pull request for this issue: https://github.com/apache/spark/pull/8778 > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10637) DataFrames: saving with nested User Data Types
[ https://issues.apache.org/jira/browse/SPARK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10637: - Description: Cannot save data frames using nested UserDefinedType I wrote a simple example to show the error. It causes the following error java.lang.IllegalArgumentException: Nested type should be repeated: required group array { required int32 num; } {code:java} import org.apache.spark.sql.SaveMode import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[AUDT]) case class A(list:Seq[B]) class AUDT extends UserDefinedType[A] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(BUDT, containsNull = false), nullable = true))) override def userClass: Class[A] = classOf[A] override def serialize(obj: Any): Any = obj match { case A(list) => val row = new GenericMutableRow(1) row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) row } override def deserialize(datum: Any): A = { datum match { case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) } } } object AUDT extends AUDT @SQLUserDefinedType(udt = classOf[BUDT]) case class B(num:Int) class BUDT extends UserDefinedType[B] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[B] = classOf[B] override def serialize(obj: Any): Any = obj match { case B(num) => val row = new GenericMutableRow(1) row.setInt(0, num) row } override def deserialize(datum: Any): B = { datum match { case row: InternalRow => new B(row.getInt(0)) } } } object BUDT extends BUDT object TestNested { def main(args:Array[String]) = { val col = Seq(new A(Seq(new B(1), new B(2))), new A(Seq(new B(3), new B(4 val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("TestSpark")) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(1 to 2 zip col).toDF() df.show() df.write.mode(SaveMode.Overwrite).save(...) } } {code} The error log is shown below: {noformat} 15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external IP address! 15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB) 15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1 15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73 15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) with 1 output partitions 15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at TestNested.scala:73) 15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List() 15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has no missing parents 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with curMem=0, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 58.4 KB, free 973.6 MB) 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with curMem=59832, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.3 KB, free 973.5 MB) 15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50986 (size: 20.3 KB, free: 973.6 MB) 15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861 15/09/16 16:44:39 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790654#comment-14790654 ] Shivaram Venkataraman commented on SPARK-6810: -- So the DataFrame API doesn't need much of performance benchmarks as we mostly wrap all our calls to Java / Scala - However we are adding new ML API components and [~mengxr] will be able to provide more guidance for this. cc [~sunrui] > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790681#comment-14790681 ] Glenn Strycker commented on SPARK-10636: I didn't "forget", I believed that "RDD = if {} else {} . something" would automatically take care of the associative property, and that anything after the final else {} would apply to both blocks. I didn't realize that braces behave similarly to parentheses and that I needed extras -- makes sense. I have now added these to my code. This wasn't a question for "user@ first", since I really did believe there was a bug. Jira is the place for submitting bug reports, even when the resolution is user error. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else {
[jira] [Closed] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker closed SPARK-10636. -- > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 [] > // | MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56]
[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prachi Burathoki closed SPARK-10634. Resolution: Done > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10636. --- Resolution: Not A Problem In the first case, your {{.filter}} statement plainly applies only to the else block. That's the behavior you see. Did you forget to wrap the if-else in parens? I'd ask this as a question on user@ first, since I think you are posing this as a question. JIRA is not the right place to start with those. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >
[jira] [Resolved] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10634. --- Resolution: Not A Problem > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-10634. - > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9296: --- Assignee: (was: Apache Spark) > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. I tried to save a dataframe with instances of the Vector class from mlib and got the error. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at
[jira] [Created] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"
Thouis Jones created SPARK-10642: Summary: Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer" Key: SPARK-10642 URL: https://issues.apache.org/jira/browse/SPARK-10642 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Environment: OSX Reporter: Thouis Jones Running this command: {code} sc.parallelize([(('a', 'b'), 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b')) {code} gives the following error: {noformat} 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361 Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", line 2199, in lookup return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)]) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530) at scala.collection.Iterator$class.find(Iterator.scala:780) at scala.collection.AbstractIterator.find(Iterator.scala:1157) at scala.collection.IterableLike$class.find(IterableLike.scala:79) at scala.collection.AbstractIterable.find(Iterable.scala:54) at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4440) Enhance the job progress API to expose more information
[ https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790935#comment-14790935 ] Josh Rosen commented on SPARK-4440: --- Does anyone still want these extensions? If so, can you please come up with a concrete proposal for exactly what you'd like to add? > Enhance the job progress API to expose more information > --- > > Key: SPARK-4440 > URL: https://issues.apache.org/jira/browse/SPARK-4440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Rui Li > > The progress API introduced in SPARK-2321 provides a new way for user to > monitor job progress. However the information exposed in the API is > relatively limited. It'll be much more useful if we can enhance the API to > expose more data. > Some improvement for example may include but not limited to: > 1. Stage submission and completion time. > 2. Task metrics. > The requirement is initially identified for the hive on spark > project(HIVE-7292), other application should benefit as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4442. --- Resolution: Won't Fix We're using test-jar dependencies instead, so this is "Won't Fix". > Move common unit test utilities into their own package / module > --- > > Key: SPARK-4442 > URL: https://issues.apache.org/jira/browse/SPARK-4442 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Josh Rosen >Priority: Minor > > We should move generally-useful unit test fixtures / utility methods to their > own test utilities set package / module to make them easier to find / use. > See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for > one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
Jihong MA created SPARK-10645: - Summary: Bivariate Statistics for continuous vs. continuous Key: SPARK-10645 URL: https://issues.apache.org/jira/browse/SPARK-10645 Project: Spark Issue Type: New Feature Reporter: Jihong MA this is an umbrella jira, which covers Bivariate Statistics for continuous vs. continuous columns, including covariance, Pearson's correlation, Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790816#comment-14790816 ] Jihong MA commented on SPARK-10602: --- I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can you assign SPARK-10641 to Seth? and help fix the link, Thanks! > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10485: - Affects Version/s: 1.5.0 > IF expression is not correctly resolved when one of the options have NullType > - > > Key: SPARK-10485 > URL: https://issues.apache.org/jira/browse/SPARK-10485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Antonio Jesus Navarro > > If we have this query: > {code} > SELECT IF(column > 1, 1, NULL) FROM T1 > {code} > On Spark 1.4.1 we have this: > {code} > override lazy val resolved = childrenResolved && trueValue.dataType == > falseValue.dataType > {code} > So if one of the types is NullType, the if expression is not resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10643) Support HDFS urls in spark-submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Braithwaite updated SPARK-10643: - Description: When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. Code in question: https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698 was: When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. > Support HDFS urls in spark-submit > - > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10641: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10589) Add defense against external site framing
[ https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10589. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8745 [https://github.com/apache/spark/pull/8745] > Add defense against external site framing > - > > Key: SPARK-10589 > URL: https://issues.apache.org/jira/browse/SPARK-10589 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 1.6.0 > > > This came up as a minor point during a security audit using a common scanning > tool: It's best if Spark UIs try to actively defend against certain types of > frame-related vulnerabilities by setting X-Frame-Options. See > https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet > Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2991) RDD transforms for scan and scanLeft
[ https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2991. --- Resolution: Won't Fix > RDD transforms for scan and scanLeft > - > > Key: SPARK-2991 > URL: https://issues.apache.org/jira/browse/SPARK-2991 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Erik Erlandson >Assignee: Erik Erlandson >Priority: Minor > Labels: features > > Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) > and scanLeft(z)(f) (sequential prefix scan) > Discussion of a scanLeft implementation: > http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ > Discussion of scan: > http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3497) Report serialized size of task binary
[ https://issues.apache.org/jira/browse/SPARK-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3497. --- Resolution: Fixed We now have an automatic warning-level log message for large closures. > Report serialized size of task binary > - > > Key: SPARK-3497 > URL: https://issues.apache.org/jira/browse/SPARK-3497 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Sandy Ryza > > This is useful for determining that task closures are larger than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790792#comment-14790792 ] Sean Owen commented on SPARK-10636: --- It's a Scala syntax issue, as you say when you wondered if you're not using if..else in the way you think. Stuff happens, but if that crosses your mind, I'd just float a user@ question first or try a simple test in Scala to narrow down the behavior in question. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >setName("myRDD"). >
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790859#comment-14790859 ] Joseph K. Bradley commented on SPARK-10602: --- Yeah, JIRA only allows 2 levels of subtasks (a long-time annoyance of mine!). I'd recommend linking here using "contains." I'll fix the link for now. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-869) Retrofit rest of RDD api to use proper serializer type
[ https://issues.apache.org/jira/browse/SPARK-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-869. -- Resolution: Done Going to resolve this as Done; please open a new JIRA if you find specific examples where we're using the wrong serializer. > Retrofit rest of RDD api to use proper serializer type > -- > > Key: SPARK-869 > URL: https://issues.apache.org/jira/browse/SPARK-869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.8.0 >Reporter: Dmitriy Lyubimov > > SPARK-826 and SPARK-827 resolved proper serialization support for some RDD > method parameters, but not all. > This issue is to address the rest of RDD api and operations. Most of the time > this is due to wrapping RDD parameters into a closure which can only use a > closure serializer to communicate to the backend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
[ https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-10640: - Assignee: Thomas Graves > Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied > -- > > Key: SPARK-10640 > URL: https://issues.apache.org/jira/browse/SPARK-10640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > I'm seeing an exception from the spark history server trying to read a > history file: > scala.MatchError: TaskCommitDenied (of class java.lang.String) > at > org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) > at > org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10371: -- Description: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. {code} val input = sqlContext.range(10) val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2) output.explain(true) == Parsed Logical Plan == 'Project [*,('x * 2) AS y#254] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Analyzed Logical Plan == id: bigint, x: bigint, y: bigint Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Optimized Logical Plan == Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Physical Plan == TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] Scan PhysicalRDD[id#252L] Code Generation: true input: org.apache.spark.sql.DataFrame = [id: bigint] output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] {code} was: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
[ https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3489. --- Resolution: Won't Fix Resolving as "Won't Fix" per PR discussion. > support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params > -- > > Key: SPARK-3489 > URL: https://issues.apache.org/jira/browse/SPARK-3489 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Mohit Jaggi >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10644) Applications wait even if free executors are available
Balagopal Nair created SPARK-10644: -- Summary: Applications wait even if free executors are available Key: SPARK-10644 URL: https://issues.apache.org/jira/browse/SPARK-10644 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.5.0 Environment: RHEL 6.5 64 bit Reporter: Balagopal Nair Number of workers: 21 Number of executors: 63 Steps to reproduce: 1. Run 4 jobs each with max cores set to 10 2. The first 3 jobs run with 10 each. (30 executors consumed so far) 3. The 4 th job waits even though there are 33 idle executors. The reason is that a job will not get executors unless the total number of EXECUTORS in use < the number of WORKERS If there are executors available, resources should be allocated to the pending job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION
[ https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4568. --- Resolution: Fixed We now do this. > Publish release candidates under $VERSION-RCX instead of $VERSION > - > > Key: SPARK-4568 > URL: https://issues.apache.org/jira/browse/SPARK-4568 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION
[ https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790945#comment-14790945 ] Josh Rosen edited comment on SPARK-4568 at 9/16/15 7:00 PM: We now do this. Specifically, we publish RC's under both names. was (Author: joshrosen): We now do this. > Publish release candidates under $VERSION-RCX instead of $VERSION > - > > Key: SPARK-4568 > URL: https://issues.apache.org/jira/browse/SPARK-4568 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791016#comment-14791016 ] Nicholas Chammas commented on SPARK-4216: - Thanks Josh! > Eliminate duplicate Jenkins GitHub posts from AMPLab > > > Key: SPARK-4216 > URL: https://issues.apache.org/jira/browse/SPARK-4216 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > * [Real Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873361] > * [Imposter Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10641) skewness and kurtosis support
Jihong MA created SPARK-10641: - Summary: skewness and kurtosis support Key: SPARK-10641 URL: https://issues.apache.org/jira/browse/SPARK-10641 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790804#comment-14790804 ] Sudarshan Kadambi commented on SPARK-10320: --- Sure, a function as proposed that allows for the topic, partitions and offsets to be specified in a fine grained manner is needed to provide the full flexbility we desire (starting at an arbitrary offset within each topic partition). If separate DStreams are desired for each topic, you intend for createDirectStream to be called multiple times (with a different subscription topic each time) both before and after the streaming context is started? Also, what kind of defaults did you have in mind? For e.g. I might require the ability to specify new topics after the streaming context is started but might not want the burden of being aware of the partitions within the topic or the offsets. I might simply want to default to either the start or the end of each partition that exists for that topic. > Kafka Support new topic subscriptions without requiring restart of the > streaming context > > > Key: SPARK-10320 > URL: https://issues.apache.org/jira/browse/SPARK-10320 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Sudarshan Kadambi > > Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe > to current ones once the streaming context has been started. Restarting the > streaming context increases the latency of update handling. > Consider a streaming application subscribed to n topics. Let's say 1 of the > topics is no longer needed in streaming analytics and hence should be > dropped. We could do this by stopping the streaming context, removing that > topic from the topic list and restarting the streaming context. Since with > some DStreams such as DirectKafkaStream, the per-partition offsets are > maintained by Spark, we should be able to resume uninterrupted (I think?) > from where we left off with a minor delay. However, in instances where > expensive state initialization (from an external datastore) may be needed for > datasets published to all topics, before streaming updates can be applied to > it, it is more convenient to only subscribe or unsubcribe to the incremental > changes to the topic list. Without such a feature, updates go unprocessed for > longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10643) Support HDFS urls in spark-submit
Alan Braithwaite created SPARK-10643: Summary: Support HDFS urls in spark-submit Key: SPARK-10643 URL: https://issues.apache.org/jira/browse/SPARK-10643 Project: Spark Issue Type: New Feature Reporter: Alan Braithwaite Priority: Minor When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4738) Update the netty-3.x version in spark-assembly-*.jar
[ https://issues.apache.org/jira/browse/SPARK-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4738. --- Resolution: Incomplete Resolving as "Incomplete" since this an old issue and it doesn't look like there's any action to take here. In newer Spark versions, you should be able to use the various user-classpath-first configurationst o work around this. > Update the netty-3.x version in spark-assembly-*.jar > > > Key: SPARK-4738 > URL: https://issues.apache.org/jira/browse/SPARK-4738 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.1.0 >Reporter: Tobias Pfeiffer >Priority: Minor > > It seems as if the version of akka-remote (2.2.3-shaded-protobuf) that is > bundled in the spark-assembly-1.1.1-hadoop2.4.0.jar file pulls in an ancient > version of netty, namely io.netty:netty:3.6.6.Final (using the package > org.jboss.netty). This means that when using spark-submit, there will always > be this netty version on the classpath before any versions added by the user. > This may lead to issues with other packages that depend on newer versions and > may fail with java.lang.NoSuchMethodError etc.(finagle-http in my case). > I wonder if it possible to manually include a newer netty version, like > netty-3.8.0.Final. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4087) Only use broadcast for large tasks
[ https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen closed SPARK-4087. - Resolution: Won't Fix > Only use broadcast for large tasks > -- > > Key: SPARK-4087 > URL: https://issues.apache.org/jira/browse/SPARK-4087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu >Priority: Critical > > After we broadcast every tasks, some regressions are introduced because of > broadcast is not stable enough. > So we would like to only use broadcast for large tasks, which will keep the > same behaviour as 1.0 for most of the cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org