[jira] [Created] (SPARK-10630) createDataFrame from a Java List

2015-09-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10630:
-

 Summary: createDataFrame from a Java List
 Key: SPARK-10630
 URL: https://issues.apache.org/jira/browse/SPARK-10630
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng
Priority: Minor


It would be nice to support creating a DataFrame directly from a Java List of 
Row:

{code}
def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10631:
-

 Summary: Add missing API doc in pyspark.mllib.linalg.Vector
 Key: SPARK-10631
 URL: https://issues.apache.org/jira/browse/SPARK-10631
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark
Reporter: Xiangrui Meng
Priority: Minor


There are some missing API docs in pyspark.mllib.linalg.Vector (including 
DenseVector and SparseVector). We should add them based on their Scala 
counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10516) Add values as a property to DenseVector in PySpark

2015-09-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10516.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8682
[https://github.com/apache/spark/pull/8682]

> Add values as a property to DenseVector in PySpark
> --
>
> Key: SPARK-10516
> URL: https://issues.apache.org/jira/browse/SPARK-10516
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Trivial
>  Labels: starter
> Fix For: 1.6.0
>
>
> We use `values` in Scala but `array` in PySpark. We should add `values` as a 
> property to match Scala implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10630) createDataFrame from a Java List

2015-09-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746968#comment-14746968
 ] 

holdenk commented on SPARK-10630:
-

Sounds good to me :) I'll give it a shot :)

> createDataFrame from a Java List
> -
>
> Key: SPARK-10630
> URL: https://issues.apache.org/jira/browse/SPARK-10630
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It would be nice to support creating a DataFrame directly from a Java List of 
> Row:
> {code}
> def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5314) java.lang.OutOfMemoryError in SparkSQL with GROUP BY

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5314:
-
Assignee: Michael Armbrust

> java.lang.OutOfMemoryError in SparkSQL with GROUP BY
> 
>
> Key: SPARK-5314
> URL: https://issues.apache.org/jira/browse/SPARK-5314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Alex Baretta
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> I am running a SparkSQL GROUP BY query on a largish Parquet table (a few 
> hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of 
> RAM, so it should have more than plenty resources to cope with this query.
> WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, 
> ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
> at scala.collection.AbstractSeq.distinct(Seq.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33)
> at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33)
> at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100)
> at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101)
> at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5397) Assigning aliases to several return values of an UDF

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5397:
-
Assignee: Michael Armbrust

> Assigning aliases to several return values of an UDF
> 
>
> Key: SPARK-5397
> URL: https://issues.apache.org/jira/browse/SPARK-5397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Max
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> The query with following syntax is no valid SQL in Spark due to the assigment 
> of multiple aliases.
> So it seems not possible for me to port former HiveQL queries with UDFs 
> returning multiple values to Spark SQL.
> Query 
>  
> SELECT my_function(param_one, param_two) AS (return_one, return_two,
> return_three) 
> FROM my_table; 
> Error 
>  
> Unsupported language features in query: SELECT my_function(param_one,
> param_two) AS (return_one, return_two, return_three) 
> FROM my_table; 
> TOK_QUERY 
>   TOK_FROM 
> TOK_TABREF 
>   TOK_TABNAME 
> my_table 
> TOK_SELECT 
>   TOK_SELEXPR 
> TOK_FUNCTION 
>   my_function 
>   TOK_TABLE_OR_COL 
> param_one 
>   TOK_TABLE_OR_COL 
> param_two 
> return_one 
> return_two 
> return_three 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5302) Add support for SQLContext "partition" columns

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5302:
-
Assignee: Cheng Lian

> Add support for SQLContext "partition" columns
> --
>
> Key: SPARK-5302
> URL: https://issues.apache.org/jira/browse/SPARK-5302
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Bob Tiernay
>Assignee: Cheng Lian
> Fix For: 1.4.0
>
>
> For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to 
> support a virtual column that maps to part of the the file path, similar to 
> what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} 
> where {{dt}} is a column of type {{TEXT}}). 
> The API could allow the user to type the column using an appropriate 
> {{DataType}} instance. This new field could be addressed in SQL statements 
> much the same as is done in Hive. 
> As a consequence, pruning of partitions could be possible when executing a 
> query and also remove the need to materialize a column in each logical 
> partition that is already encoded in the path name. Furthermore, this would 
> provide an nice interop and migration strategy for Hive users who may one day 
> use {{SQLContext}} directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5421:
-
Assignee: Michael Armbrust

> SparkSql throw OOM at shuffle
> -
>
> Key: SPARK-5421
> URL: https://issues.apache.org/jira/browse/SPARK-5421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  
> but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill 
> at shuffle, it's very easy to throw OOM at shuffle.  I think sparkSQL also 
> need spill at shuffle.
> One of the executor's log, here is  stderr:
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
> for shuffle 1, fetching them
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
> actor = 
> Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
> 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
> non-empty blocks out of 143 blocks
> 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
> fetches in 72 ms
> 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL 15: SIGTERM
> here is  stdout:
> 2015-01-27T07:44:43.487+0800: [Full GC 3961343K->3959868K(3961344K), 
> 29.8959290 secs]
> 2015-01-27T07:45:13.460+0800: [Full GC 3961343K->3959992K(3961344K), 
> 27.9218150 secs]
> 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
> 2015-01-27T07:45:52.950+0800: [Full GC 3961343K->3960113K(3961344K), 
> 29.3894670 secs]
> 2015-01-27T07:46:22.393+0800: [Full GC 3961118K->3960240K(3961344K), 
> 28.9879600 secs]
> 2015-01-27T07:46:51.393+0800: [Full GC 3960240K->3960213K(3961344K), 
> 34.1530900 secs]
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill %p"
> #   Executing /bin/sh -c "kill 9050"...
> 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8786:
-
Assignee: Josh Rosen

> Create a wrapper for BinaryType
> ---
>
> Key: SPARK-8786
> URL: https://issues.apache.org/jira/browse/SPARK-8786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Josh Rosen
> Fix For: 1.5.0
>
>
> The hashCode and equals() of Array[Byte] does check the bytes, we should 
> create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6632:
-
Assignee: Michael Armbrust

> Optimize the parquetSchema to metastore schema reconciliation, so that the 
> process is delegated to each map task itself
> ---
>
> Key: SPARK-6632
> URL: https://issues.apache.org/jira/browse/SPARK-6632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> Currently in ParquetRelation2, schema from all the part files is first 
> merged, and then reconciled with metastore schema. This approach does not 
> scale in case we have thousands of partitions for the table. We can take a 
> different approach where we can go ahead with the metastore schema, and 
> reconcile the names of the columns within each map task , using ReadSupport 
> hooks provided in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-16 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746994#comment-14746994
 ] 

Vinod KC commented on SPARK-10631:
--

I'm working on this

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747058#comment-14747058
 ] 

Maciej Bryński commented on SPARK-10577:


Tested this patch today on a 1.5.0.
Works great.
Thank you.

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10508) incorrect evaluation of searched case expression

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10508:
--
Assignee: Josh Rosen

> incorrect evaluation of searched case expression
> 
>
> Key: SPARK-10508
> URL: https://issues.apache.org/jira/browse/SPARK-10508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Assignee: Josh Rosen
> Fix For: 1.4.1, 1.5.0
>
>
> The following case expression never evaluates to 'test1' when cdec is -1 or 
> 10 as it will for Hive 0.13. Instead is returns 'other' for all rows.
> {code}
> select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
> end from tdec 
> create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> 0|\N
> 1|-1.00
> 2|0.00
> 3|1.00
> 4|0.10
> 5|10.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
At first I thought it was a problem with my udt class, then tried the Vector 
class from mlib and the error was the same.


The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 

[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
I tried to save a dataframe with instances of the Vector class from mlib and 
got the error.

The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 

[jira] [Updated] (SPARK-3231) select on a table in parquet format containing smallint as a field type does not work

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3231:
-
Assignee: Alex Rovner

> select on a table in parquet format containing smallint as a field type does 
> not work
> -
>
> Key: SPARK-3231
> URL: https://issues.apache.org/jira/browse/SPARK-3231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: The table is created through Hive-0.13.
> SparkSql 1.1 is used.
>Reporter: chirag aggarwal
>Assignee: Alex Rovner
> Fix For: 1.5.0
>
>
> A table is created through hive. This table has a field of type smallint. The 
> format of the table is parquet.
> select on this table works perfectly on hive shell.
> But, when the select is run on this table from spark-sql, then the query 
> fails.
> Steps to reproduce the issue:
> --
> hive> create table abct (a smallint, b int) row format delimited fields 
> terminated by '|' stored as textfile;
> A text file is stored in hdfs for this table.
> hive> create table abc (a smallint, b int) stored as parquet; 
> hive> insert overwrite table abc select * from abct;
> hive> select * from abc;
> 2 1
> 2 2
> 2 3
> spark-sql> select * from abc;
> 10:08:46 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0.0 in stage 33.0 (TID 2340) had a not serializable 
> result: org.apache.hadoop.io.IntWritable
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1158)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1147)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1146)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1146)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:685)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> But, if the type of this table is now changed to int, then spark-sql gives 
> the correct results.
> hive> alter table abc change a a int;
> spark-sql> select * from abc;
> 2 1
> 2 2
> 2 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3617) Configurable case sensitivity

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3617:
-
Assignee: Michael Armbrust

> Configurable case sensitivity
> -
>
> Key: SPARK-3617
> URL: https://issues.apache.org/jira/browse/SPARK-3617
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> Right now SQLContext is case sensitive and HiveContext is not.  It would be 
> better to make it configurable in both instances.  All of the underlying 
> plumbing is there we just need to expose an option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2824) Allow saving Parquet files to the HiveMetastore

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2824:
-
Assignee: Michael Armbrust

> Allow saving Parquet files to the HiveMetastore
> ---
>
> Key: SPARK-2824
> URL: https://issues.apache.org/jira/browse/SPARK-2824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> Currently, we use one code path for reading all data from the Hive Metastore, 
> and this precludes writing or loading data as custom ParquetRelations (they 
> are always MetastoreRelations).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2695) Figure out a good way to handle NullType columns.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2695:
-
Assignee: Michael Armbrust

> Figure out a good way to handle NullType columns.
> -
>
> Key: SPARK-2695
> URL: https://issues.apache.org/jira/browse/SPARK-2695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> For example, if a JSON dataset has a field which only has null values, we 
> will get NullType when we turn that dataset as a SchemaRDD. We need to think 
> about what is the good way to deal with NullType columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)
Joao created SPARK-10632:


 Summary: Cannot save DataFrame with User Defined Types
 Key: SPARK-10632
 URL: https://issues.apache.org/jira/browse/SPARK-10632
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Joao


Cannot save DataFrames that contain user-defined types.
At first I thought it was a problem with my udt class, then tried the Vector 
class from mlib and the error was the same.


Te code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 

[jira] [Commented] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747088#comment-14747088
 ] 

Sean Owen commented on SPARK-10629:
---

That sounds like the same issue in SPARK-10433; it's not clear that whatever it 
is only manifests as constantly increasing size. In any event, you should of 
course try 1.5+ to see if the fix fixes this anyway. I'd personally close this, 
given this information, until you are certain it happens now in the latest 
code, in which case it is something else.

> Gradient boosted trees: mapPartitions input size increasing 
> 
>
> Key: SPARK-10629
> URL: https://issues.apache.org/jira/browse/SPARK-10629
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Wenmin Wu
>
> First of all, I think my problem is quite different from 
> https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
> size increasing at each iteration.
> My problem is the mapPartitions input size increase in one iteration. My 
> training samples has 2958359 features in total. Within one iteration, 3 
> collectAsMap operation had been called. And here is a summary of each call.
> | Stage Id |   Description| 
> Duration  |   Input| Shuffle Read | Shuffle Write |
> |:--:|:---:|:---:|:---:|::|::|
> |  4  | mapPartitions at DecisionTree.scala:613 |  1.6 h  |710.2 
> MB | |   2.8 GB   |
> |  5  | collectAsMap at DecisionTree.scala:642  |  1.8 min  | 
>|2.8 GB|  |
> |  6  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 27.0 
> GB  ||  5.6 GB |
> |  7  | collectAsMap at DecisionTree.scala:642 | 2.0 min |   |
> 5.6GB   |  |
> |  8  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 26.5 
> GB  ||   11.1 GB |
> |  9  | collectAsMap at DecisionTree.scala:642 | 2.0 min |  |
> 8.3 GB  |  |
> the mapPartitions operation took too long time! It's so strange! I wonder 
> whether there is bug exits?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-16 Thread Antonio Jesus Navarro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747089#comment-14747089
 ] 

Antonio Jesus Navarro commented on SPARK-10485:
---

Now, the bug is into the checking of the DataTypes:

https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33

else if (trueValue.dataType != falseValue.dataType) {
TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " +
  s"(${trueValue.dataType.simpleString} and 
${falseValue.dataType.simpleString}).")
}

If the true expression or the false expression is a NullType, now the type 
validation fails.

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-16 Thread Antonio Jesus Navarro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747089#comment-14747089
 ] 

Antonio Jesus Navarro edited comment on SPARK-10485 at 9/16/15 7:32 AM:


Now, the bug is into the checking of the DataTypes:

https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33

{code}
else if (trueValue.dataType != falseValue.dataType) {
TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " +
  s"(${trueValue.dataType.simpleString} and 
${falseValue.dataType.simpleString}).")
}
{code}

If the true expression or the false expression is a NullType, now the type 
validation fails.


was (Author: ajnavarro):
Now, the bug is into the checking of the DataTypes:

https://github.com/apache/spark/blob/branch-1.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L33

else if (trueValue.dataType != falseValue.dataType) {
TypeCheckResult.TypeCheckFailure(s"differing types in '$prettyString' " +
  s"(${trueValue.dataType.simpleString} and 
${falseValue.dataType.simpleString}).")
}

If the true expression or the false expression is a NullType, now the type 
validation fails.

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3700) Improve the performance of scanning JSON datasets

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3700:
-
Assignee: Yanbo Liang

> Improve the performance of scanning JSON datasets
> -
>
> Key: SPARK-3700
> URL: https://issues.apache.org/jira/browse/SPARK-3700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yanbo Liang
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3833) Allow Spark SQL SchemaRDDs to be merged

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3833:
-
Assignee: Michael Armbrust

> Allow Spark SQL SchemaRDDs to be merged
> ---
>
> Key: SPARK-3833
> URL: https://issues.apache.org/jira/browse/SPARK-3833
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Chris Wood
>Assignee: Michael Armbrust
> Fix For: 1.4.0
>
>
> We have JSON flowing into Spark SQL.
> I can successfully store them as parquet and read them with 
> sqlContext.jsonRDD, but the inferred schemas cannot be merged into a single 
> table to do queries.
> I'd like a way to allow for parquet file schemas to be merged, whether they 
> match or not, since we know the schema should be a union of the schemas from 
> the files.
> This will allow us to have the data define the schema and new columns will 
> just appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3978) Schema change on Spark-Hive (Parquet file format) table not working

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3978:
-
Assignee: Alex Rovner

> Schema change on Spark-Hive (Parquet file format) table not working
> ---
>
> Key: SPARK-3978
> URL: https://issues.apache.org/jira/browse/SPARK-3978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Nilesh Barge
>Assignee: Alex Rovner
> Fix For: 1.5.0
>
>
> On following releases: 
> Spark 1.1.0 (built using sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly) , 
> Apache HDFS 2.2 
> Spark job is able to create/add/read data in hive, parquet formatted, tables 
> using HiveContext. 
> But, after changing schema, spark job is not able to read data and throws 
> following exception: 
> java.lang.ArrayIndexOutOfBoundsException: 2 
> at 
> org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127)
>  
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284)
>  
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
> at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
> at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) 
> at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) 
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>  
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>  
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
> at org.apache.spark.scheduler.Task.run(Task.scala:54) 
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  
> at java.lang.Thread.run(Thread.java:744)
> code snippet in short: 
> hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name 
> String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' 
> OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'"); 
> hiveContext.sql("INSERT INTO TABLE people_table SELECT name, age FROM 
> temp_table_people1"); 
> hiveContext.sql("SELECT * FROM people_table"); //Here, data read was 
> successful.  
> hiveContext.sql("ALTER TABLE people_table ADD COLUMNS (gender STRING)"); 
> hiveContext.sql("SELECT * FROM people_table"); //Not able to read existing 
> data and ArrayIndexOutOfBoundsException is thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5738) Reuse mutable row for each record at jsonStringToRow

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5738:
-
Assignee: Yanbo Liang

> Reuse mutable row for each record at jsonStringToRow
> 
>
> Key: SPARK-5738
> URL: https://issues.apache.org/jira/browse/SPARK-5738
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Other table scan like operations include ParquetTableScan, HiveTableScan use 
> a reusable mutable row for seralization to decrease garbage. We also make 
> JSONRelation#buildScan() with this optimization.
> When serialize json string to row, reuse a mutable row for both each record 
> and inner nested structure instead of creating a new one for each. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3804) Output of Generator expressions is not stable after serialization.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3804:
-
Assignee: Michael Armbrust

> Output of Generator expressions is not stable after serialization.
> --
>
> Key: SPARK-3804
> URL: https://issues.apache.org/jira/browse/SPARK-3804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> The output of a generator expression (such as explode) can change after 
> serialization.  This caused a nasty data corruption bug.  While this bug has 
> seen been addressed it would be good to fix this since it violates the 
> general contract of query plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5823) Reuse mutable rows for inner structures when parsing JSON objects

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5823:
-
Assignee: Yanbo Liang

> Reuse mutable rows for inner structures when parsing JSON objects
> -
>
> Key: SPARK-5823
> URL: https://issues.apache.org/jira/browse/SPARK-5823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yanbo Liang
>
> With SPARK-5738, we will reuse a mutable row for rows when parsing JSON 
> objects. We can do the same thing for inner structures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10511) Source releases should not include maven jars

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10511.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Resolved by https://github.com/apache/spark/pull/8774

> Source releases should not include maven jars
> -
>
> Key: SPARK-10511
> URL: https://issues.apache.org/jira/browse/SPARK-10511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Luciano Resende
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> I noticed our source jars seemed really big for 1.5.0. At least one 
> contributing factor is that, likely due to some change in the release script, 
> the maven jars are being bundled in with the source code in our build 
> directory. This runs afoul of the ASF policy on binaries in source releases - 
> we should fix it in 1.5.1.
> The issue (I think) is that we might invoke maven to compute the version 
> between when we checkout Spark from github and when we package the source 
> file. I think it could be fixed by simply clearing out the build/ directory 
> after that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4559) Adding support for ucase and lcase

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4559:
-
Assignee: Michael Armbrust

> Adding support for ucase and lcase
> --
>
> Key: SPARK-4559
> URL: https://issues.apache.org/jira/browse/SPARK-4559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Fei Wang
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> Adding support for ucase and lcase in spark sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4273:
-
Assignee: Michael Armbrust

> Providing ExternalSet to avoid OOM when count(distinct)
> ---
>
> Key: SPARK-4273
> URL: https://issues.apache.org/jira/browse/SPARK-4273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: YanTang Zhai
>Assignee: Michael Armbrust
>Priority: Minor
> Fix For: 1.5.0
>
>
> Some task may OOM when count(distinct) if it needs to process many records. 
> CombineSetsAndCountFunction puts all records into an OpenHashSet, if it 
> fetchs many records, it may occupy large memory.
> I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
> provided to store OpenHashSet data in disks when it's capacity exceeds some 
> threshold.
> For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with 
> hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be 
> indicated as follows:
> ohs1 [d, b, c, a] => [a, b, c, d] => file1
> ohs2 [e, f, g, a] => [a, e, f, g] => file2
> ohs3 [e, h, i, g] => [e, g, h, i] => file3
> ohs4 [j, h, a] => [a, h, j] => sortedSet
> When output, all keys with the same hashCode will be put into a OpenHashSet, 
> then the iterator of this OpenHashSet is accessing. The procedure could be 
> indicated as follows:
> file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
> file1 -> b -> ohsB; ohsB -> b;
> file1 -> c -> ohsC; ohsC -> c;
> file1 -> d -> ohsD; ohsD -> d;
> file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
> ...
> I think using the ExternalSet could avoid OOM when count(distinct). Welcomes 
> comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5109) Loading multiple parquet files into a single SchemaRDD

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5109:
-
Assignee: Michael Armbrust

> Loading multiple parquet files into a single SchemaRDD
> --
>
> Key: SPARK-5109
> URL: https://issues.apache.org/jira/browse/SPARK-5109
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sam Steingold
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> {{[SQLContext.parquetFile(String)|http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/SQLContext.html#parquetFile%28java.lang.String%29]}}
>  accepts a comma-separated list of files to load.
> This feature prevents files with commas in its name (a rare use case, 
> admittedly), it is also an _extremely_ unusual feature.
> This feature should be deprecated and new methods
> {code}
> SQLContext.parquetFile(String[])
> SQLContext.parquetFile(List)
> {code} 
> should be added instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-09-16 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10515:
--
Summary: When killing executor, the pending replacement executors will be 
lost  (was: When kill executor, there is no need to seed RequestExecutors to AM)

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9032:
-
Assignee: Josh Rosen

> scala.MatchError in DataFrameReader.json(String path)
> -
>
> Key: SPARK-9032
> URL: https://issues.apache.org/jira/browse/SPARK-9032
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 15.04
>Reporter: Philipp Poetter
>Assignee: Josh Rosen
> Fix For: 1.4.1
>
>
> Executing read().json() of SQLContext e.g. DataFrameReader raises a 
> MatchError with a stacktrace as follows while trying to read JSON data:
> {code}
> 15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
> 15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, 
> took 6.981330 s
> Exception in thread "main" scala.MatchError: StringType (of class 
> org.apache.spark.sql.types.StringType$)
>   at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137)
>   at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137)
>   at 
> org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213)
>   at com.hp.sparkdemo.Example.main(Example.java:23)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook
> 15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
> 15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> {code}
> Offending code snippet (around line 23):
> {code}
>JavaSparkContext sctx = new JavaSparkContext(sparkConf);
> SQLContext ctx = new SQLContext(sctx);
> DataFrame frame = ctx.read().json(facebookJSON);
> frame.printSchema();
> {code}
> The exception is reproducable using the following JSON:
> {code}
> {
>"data": [
>   {
>  "id": "X999_Y999",
>  "from": {
> "name": "Tom Brady", "id": "X12"
>  },
>  "message": "Looking forward to 2010!",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X999/posts/Y999;
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X999/posts/Y999;
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   },
>   {
>  "id": "X998_Y998",
>  "from": {
> "name": "Peyton Manning", "id": "X18"
>  },
>  "message": "Where's my contract?",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X998/posts/Y998;
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X998/posts/Y998;
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   }
>]
> }
> {code}



--
This message was sent by Atlassian JIRA

[jira] [Updated] (SPARK-9343) DROP TABLE ignores IF EXISTS clause

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9343:
-
Assignee: Michael Armbrust

> DROP TABLE ignores IF EXISTS clause
> ---
>
> Key: SPARK-9343
> URL: https://issues.apache.org/jira/browse/SPARK-9343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>Assignee: Michael Armbrust
>  Labels: sql
> Fix For: 1.5.0
>
>
> If a table is missing, {{DROP TABLE IF EXISTS _tableName_}} generates an 
> exception:
> {code}
> 15/07/25 15:17:32 ERROR Hive: NoSuchObjectException(message:default.test 
> table not found)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy27.get_table(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy28.getTable(Unknown Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
>   at 
> org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3846)
>  
> {code}
> I standalone test with full spark-shell output is available at: 
> [https://gist.github.com/ssimeonov/eeb388d13f802689d772]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9033:
-
Assignee: Josh Rosen

> scala.MatchError: interface java.util.Map (of class java.lang.Class) with 
> Spark SQL
> ---
>
> Key: SPARK-9033
> URL: https://issues.apache.org/jira/browse/SPARK-9033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1
>Reporter: Pavel
>Assignee: Josh Rosen
> Fix For: 1.4.0
>
>
> I've a java.util.Map field in a POJO class and I'm trying to 
> use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext 
> and getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL:
> *sample code:
> {code}
> SQLContext sqlCtx = new SQLContext(sc.sc());
> JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); 
> //text line is splitted and assigned to respective field of the event class 
> here
> DataFrame schemaRDD  = sqlCtx.createDataFrame(rdd, Event.class); <-- error 
> thrown here
> schemaRDD.registerTempTable("events");
> {code}
> Event class is a Serializable containing a field of type  
> java.util.Map. This issue occurs also with Spark streaming 
> when used with SQL.
> {code}
> JavaDStream receiverStream = jssc.receiverStream(new 
> StreamingReceiver());
> JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, 
> SLIDE_INTERVAL);
> jssc.checkpoint("event-streaming");
> windowDStream.foreachRDD(evRDD -> {
>if(evRDD.count() == 0) return null;
> DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class);
> schemaRDD.registerTempTable("events");
>   ...
> }
> {code}
> *error:
> {code}
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
> {code}
> **also this occurs for fields of custom POJO classes:
> {code}
> scala.MatchError: class com.test.MyClass (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
> {code}
> **also occurs for Calendar  type:
> {code}
> scala.MatchError: class java.util.Calendar (of class java.lang.Class)
>   at 
> 

[jira] [Updated] (SPARK-10437) Support aggregation expressions in Order By

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10437:
--
Assignee: Liang-Chi Hsieh

> Support aggregation expressions in Order By
> ---
>
> Key: SPARK-10437
> URL: https://issues.apache.org/jira/browse/SPARK-10437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> Followup on SPARK-6583
> The following still fails. 
> {code}
> val df = sqlContext.read.json("examples/src/main/resources/people.json")
> df.registerTempTable("t")
> val df2 = sqlContext.sql("select age, count(*) from t group by age order by 
> count(*)")
> df2.show()
> {code}
> {code:title=StackTrace}
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No 
> function to evaluate expression. type: Count, tree: COUNT(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219)
> {code}
> In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this 
> case.
> Haven't looked at 1.5 code, but don't see a change to bindReference in this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10475) improve column prunning for Project on Sort

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10475:
--
Assignee: Wenchen Fan

> improve column prunning for Project on Sort
> ---
>
> Key: SPARK-10475
> URL: https://issues.apache.org/jira/browse/SPARK-10475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.

2015-09-16 Thread Lunen (JIRA)
Lunen created SPARK-10633:
-

 Summary: Persisting Spark stream to MySQL - Spark tries to create 
the table for every stream even if it exist already.
 Key: SPARK-10633
 URL: https://issues.apache.org/jira/browse/SPARK-10633
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 1.5.0, 1.4.0
 Environment: Ubuntu 14.04
IntelliJ IDEA 14.1.4
sbt
mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1)
Reporter: Lunen
Priority: Blocker


Persisting Spark Kafka stream to MySQL 
Spark 1.4 + tries to create a table automatically every time the stream gets 
sent to a specified table.
Please note, Spark 1.3.1 works.
Code sample:
val url = "jdbc:mysql://host:port/db?user=user=password
val crp = RowSetProvider.newFactory()
val crsSql: CachedRowSet = crp.createCachedRowSet()
val crsTrg: CachedRowSet = crp.createCachedRowSet()
crsSql.beforeFirst()
crsTrg.beforeFirst()

//Read Stream from Kafka
//Produce SQL INSERT STRING

streamT.foreachRDD { rdd =>
  if (rdd.toLocalIterator.nonEmpty) {
sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events")
while (crsSql.next) {
  sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", 
new Properties)
  println("Persisted Data: " + 'SQL INSERT STRING')
}
crsSql.beforeFirst()
  }
  stmt.close()
  conn.close()
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10633:
--
Priority: Major  (was: Blocker)

[~lunendl] Please have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- for 
example you should not set Blocker, as this can't really be considered one at 
this point.

> Persisting Spark stream to MySQL - Spark tries to create the table for every 
> stream even if it exist already.
> -
>
> Key: SPARK-10633
> URL: https://issues.apache.org/jira/browse/SPARK-10633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 1.4.0, 1.5.0
> Environment: Ubuntu 14.04
> IntelliJ IDEA 14.1.4
> sbt
> mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1)
>Reporter: Lunen
>
> Persisting Spark Kafka stream to MySQL 
> Spark 1.4 + tries to create a table automatically every time the stream gets 
> sent to a specified table.
> Please note, Spark 1.3.1 works.
> Code sample:
> val url = "jdbc:mysql://host:port/db?user=user=password
> val crp = RowSetProvider.newFactory()
> val crsSql: CachedRowSet = crp.createCachedRowSet()
> val crsTrg: CachedRowSet = crp.createCachedRowSet()
> crsSql.beforeFirst()
> crsTrg.beforeFirst()
> //Read Stream from Kafka
> //Produce SQL INSERT STRING
> 
> streamT.foreachRDD { rdd =>
>   if (rdd.toLocalIterator.nonEmpty) {
> sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events")
> while (crsSql.next) {
>   sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", 
> new Properties)
>   println("Persisted Data: " + 'SQL INSERT STRING')
> }
> crsSql.beforeFirst()
>   }
>   stmt.close()
>   conn.close()
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10267) Add @Since annotation to ml.util

2015-09-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10267.
---
Resolution: Not A Problem
  Assignee: Ehsan Mohyedin Kermani

> Add @Since annotation to ml.util
> 
>
> Key: SPARK-10267
> URL: https://issues.apache.org/jira/browse/SPARK-10267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Ehsan Mohyedin Kermani
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10631:
--
Assignee: Vinod KC

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation

2015-09-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10276.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8677
[https://github.com/apache/spark/pull/8677]

> Add @since annotation to pyspark.mllib.recommendation
> -
>
> Key: SPARK-10276
> URL: https://issues.apache.org/jira/browse/SPARK-10276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Yashwanth Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768853#comment-14768853
 ] 

Yashwanth Kumar commented on SPARK-6810:


Hi Shivaram Venkataraman, I would like to try this.

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute

2015-09-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768840#comment-14768840
 ] 

Xiangrui Meng commented on SPARK-10262:
---

[~tijoparacka] Are you still working on this?

> Add @Since annotation to ml.attribute
> -
>
> Key: SPARK-10262
> URL: https://issues.apache.org/jira/browse/SPARK-10262
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics

2015-09-16 Thread Yashwanth Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768850#comment-14768850
 ] 

Yashwanth Kumar commented on SPARK-9492:


Can i Have this task?

> LogisticRegression in R should provide model statistics
> ---
>
> Key: SPARK-9492
> URL: https://issues.apache.org/jira/browse/SPARK-9492
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, R
>Reporter: Eric Liang
>
> Like ml LinearRegression, LogisticRegression should provide a training 
> summary including feature names and their coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)
Prachi Burathoki created SPARK-10634:


 Summary: The spark sql fails if the where clause contains a string 
with " in it.
 Key: SPARK-10634
 URL: https://issues.apache.org/jira/browse/SPARK-10634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Prachi Burathoki


When running a sql query in which the where clause contains a string with " in 
it, the sql parser throws an error.

Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
identifier test found

SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER FROM 
TABLE_1 WHERE ((clistc215647292 = "this is a "test""))

  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
at 
com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
at 
com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
at 
com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
... 31 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768938#comment-14768938
 ] 

Sean Owen commented on SPARK-10634:
---

Shouldn't the " be escaped in some way? or else I'm not sure how the parser 
would know the end of the literal from a quote inside the literal.

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10614) SystemClock uses non-monotonic time in its wait logic

2015-09-16 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14768900#comment-14768900
 ] 

Steve Loughran commented on SPARK-10614:


Having done a little more detailed research on the current state of this clock, 
I'm now having doubts about this.

On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} 
counter to get the timestamp —which is fast and only goes forwards (albeit at a 
rate which depends on CPU power states). But it turns out that on manycore 
CPUs, because that could lead to different answers on different cores, the OS 
may use alternative mechanisms to return a counter: which may be neither 
monotonic nor fast.

# [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - 
Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks]
# [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than 
System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250]
# [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in 
general lacks monotonicity|JDK-6458294 : nanoTime affected by system clock 
change on Linux (RH9) or in general lacks monotonicity]
# [Redhat on timestamps in 
Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html]
# [Timekeeping in VMware Virtual 
Machineshttp://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]

These docs imply that nanotime may be fast-but-unreliable-on multiple socket 
systems (the latest many core parts share one counter) —and may downgrade to 
something slower than calls to getTimeMillis()., or even something that isn't 
guaranteed to be monotonic. 

It's not clear that on deployments of physical many-core systems moving to 
nanotime actually offers much. I don't know about EC2 or other cloud 
infrastructures though.

maybe its just best to WONTFIX this as it won't raise unrealistic expectations 
about nanoTime working

> SystemClock uses non-monotonic time in its wait logic
> -
>
> Key: SPARK-10614
> URL: https://issues.apache.org/jira/browse/SPARK-10614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The consolidated (SPARK-4682) clock uses {{System.currentTimeMillis()}} for 
> measuring time, which means its {{waitTillTime()}} routine is brittle against 
> systems (VMs in particular) whose time can go backwards as well as forward.
> For the {{ExecutorAllocationManager}} this appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{/code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  

def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket



> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14769057#comment-14769057
 ] 

Sean Owen commented on SPARK-10634:
---

Don't you need to use "" to quote double quotes? I actually am not sure what 
Spark supports but that's not uncommon in SQL. I don't think you're escaping 
the quotes here.

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14769011#comment-14769011
 ] 

Prachi Burathoki commented on SPARK-10634:
--

I tried by escaping with \, but still same error

Caused by: java.lang.RuntimeException: [1.130] failure: ``)'' expected but 
identifier test found

SELECT clistc1426336010, corc2125646118, candc2031403851, SYSIBM_ROW_NUMBER 
FROM TABLE_1 WHERE ((clistc1426336010 = "this is a \"test\""))

 ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
at 
com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
at 
com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
at 
com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRef

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> 

[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{/code}


> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}


> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)
Ben Duffield created SPARK-10635:


 Summary: pyspark - running on a different host
 Key: SPARK-10635
 URL: https://issues.apache.org/jira/browse/SPARK-10635
 Project: Spark
  Issue Type: Improvement
Reporter: Ben Duffield


At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  

def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790700#comment-14790700
 ] 

Xiangrui Meng commented on SPARK-6810:
--

the ml part (glm) is about the same. all computation is on the Scala side. i 
would wait until 1.6 to benchmark GLM because we are going to implement the 
same algorithm as R in 1.6.

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790713#comment-14790713
 ] 

Seth Hendrickson commented on SPARK-10602:
--

Right now I have working versions of single pass algos for skewness and 
kurtosis. I'm still writing tests and need to work on performance profiling, 
but I just wanted to make sure we don't duplicate any work. I'll post back with 
more details soon.

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10637) DataFrames: saving with nested User Data Types

2015-09-16 Thread Joao (JIRA)
Joao created SPARK-10637:


 Summary: DataFrames: saving with nested User Data Types
 Key: SPARK-10637
 URL: https://issues.apache.org/jira/browse/SPARK-10637
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Joao


Cannot save data frames using nested UserDefinedType
I wrote a simple example to show the error.

It causes the following error java.lang.IllegalArgumentException: Nested type 
should be repeated: required group array {
  required int32 num;
}

{code:java}
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(num:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(num) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, num)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {
  case row: InternalRow => new B(row.getInt(0))
}
  }
}

object BUDT extends BUDT

object TestNested {
  def main(args:Array[String]) = {
val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF()

df.show()

df.write.mode(SaveMode.Overwrite).save(...)


  }
}
{code}

The error log is shown below:

{noformat}
15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable 
address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external 
IP address!
15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB)
15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1
15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73
15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) 
with 1 output partitions
15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at 
TestNested.scala:73)
15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List()
15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has 
no missing parents
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with 
curMem=0, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 58.4 KB, free 973.6 MB)
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with 
curMem=59832, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 20.3 KB, free 973.5 MB)
15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:50986 (size: 20.3 KB, free: 973.6 MB)
15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at 

[jira] [Created] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)
Glenn Strycker created SPARK-10636:
--

 Summary: RDD filter does not work after if..then..else RDD blocks
 Key: SPARK-10636
 URL: https://issues.apache.org/jira/browse/SPARK-10636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Glenn Strycker


I have an RDD declaration of the following form:

{code}
val myRDD = if (some condition) { tempRDD1.some operations } else { 
tempRDD2.some operations}.filter(a => a._2._5 <= 50)
{code}

When I output the contents of myRDD, I found entries that clearly had a._2._5 > 
50... the filter didn't work!

If I move the filter inside of the if..then blocks, it suddenly does work:

{code}
val myRDD = if (some condition) { tempRDD1.some operations.filter(a => a._2._5 
<= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
{code}

I ran toDebugString after both of these code examples, and "filter" does appear 
in the DAG for the second example, but DOES NOT appear in the first DAG.  Why 
not?

Am I misusing the if..then..else syntax for Spark/Scala?


Here is my actual code... ignore the crazy naming conventions and what it's 
doing...

{code}
// this does NOT work

val myRDD = if (tempRDD2.count() > 0) {
   tempRDD1.
 map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
 leftOuterJoin(tempRDD2).
 map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
a._2._2.getOrElse(1L.
 leftOuterJoin(tempRDD2).
 map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
a._2._2.getOrElse(1L.
 map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
(a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
(a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
   } else {
 tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
   }.
   filter(a => a._2._5 <= 50).
   partitionBy(partitioner).
   setName("myRDD").
   persist(StorageLevel.MEMORY_AND_DISK_SER)

myRDD.checkpoint()

println(myRDD.toDebugString)

// (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
//  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
//  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
//  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
//  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
//  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
//  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
//  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
//  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at reduceByKey 
at myProgram.scala:1689 []
//  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
//  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
//  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
myProgram.scala:383 []
//  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
//  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
//  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
myProgram.scala:383 []
// |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
// |  CheckpointRDD[17] at count at myProgram.scala:394 []




// this DOES work!

val myRDD = if (tempRDD2.count() > 0) {
   tempRDD1.
 map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
 leftOuterJoin(tempRDD2).
 map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
a._2._2.getOrElse(1L.
 leftOuterJoin(tempRDD2).
 map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
a._2._2.getOrElse(1L.
 map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
(a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
(a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
 filter(a => a._2._5 <= 50)
   } else {
 tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
 filter(a => a._2._5 <= 50)
   }.
   partitionBy(partitioner).
   setName("myRDD").
   persist(StorageLevel.MEMORY_AND_DISK_SER)

myRDD.checkpoint()

println(myRDD.toDebugString)

// (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 []
//  |  MapPartitionsRDD[58] at map at myProgram.scala:2120 []
//  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
//  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
//  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
//  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
//  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
//  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  

[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790634#comment-14790634
 ] 

Prachi Burathoki commented on SPARK-10634:
--

I tried escaping" with both \ and ".But got the same error.But I've found a 
workaround, in the where clause i changed it to listc = 'this is a"test"'  and 
that works.
Thanks

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-10634:
---

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9296:
---

Assignee: Apache Spark

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790663#comment-14790663
 ] 

Apache Spark commented on SPARK-9296:
-

User 'JihongMA' has created a pull request for this issue:
https://github.com/apache/spark/pull/8778

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10637) DataFrames: saving with nested User Data Types

2015-09-16 Thread Joao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10637:
-
Description: 
Cannot save data frames using nested UserDefinedType
I wrote a simple example to show the error.

It causes the following error java.lang.IllegalArgumentException: Nested type 
should be repeated: required group array {
  required int32 num;
}

{code:java}
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(num:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(num) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, num)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {
  case row: InternalRow => new B(row.getInt(0))
}
  }
}

object BUDT extends BUDT

object TestNested {
  def main(args:Array[String]) = {
val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF()

df.show()

df.write.mode(SaveMode.Overwrite).save(...)


  }
}
{code}

The error log is shown below:

{noformat}
15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable 
address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external 
IP address!
15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB)
15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1
15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73
15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) 
with 1 output partitions
15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at 
TestNested.scala:73)
15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List()
15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has 
no missing parents
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with 
curMem=0, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 58.4 KB, free 973.6 MB)
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with 
curMem=59832, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 20.3 KB, free 973.5 MB)
15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:50986 (size: 20.3 KB, free: 973.6 MB)
15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:861
15/09/16 16:44:39 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at 

[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790654#comment-14790654
 ] 

Shivaram Venkataraman commented on SPARK-6810:
--

So the DataFrame API doesn't need much of performance benchmarks as we mostly 
wrap all our calls to Java / Scala - However we are adding new ML API 
components and [~mengxr] will be able to provide more guidance for this.

cc [~sunrui] 

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790681#comment-14790681
 ] 

Glenn Strycker commented on SPARK-10636:


I didn't "forget", I believed that "RDD = if {} else {} . something" would 
automatically take care of the associative property, and that anything after 
the final else {} would apply to both blocks.  I didn't realize that braces 
behave similarly to parentheses and that I needed extras -- makes sense.  I 
have now added these to my code.

This wasn't a question for "user@ first", since I really did believe there was 
a bug.  Jira is the place for submitting bug reports, even when the resolution 
is user error.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { 

[jira] [Closed] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker closed SPARK-10636.
--

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 []
> //  |  MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] 

[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prachi Burathoki closed SPARK-10634.

Resolution: Done

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10636.
---
Resolution: Not A Problem

In the first case, your {{.filter}} statement plainly applies only to the else 
block. That's the behavior you see. Did you forget to wrap the if-else in 
parens?

I'd ask this as a question on user@ first, since I think you are posing this as 
a question. JIRA is not the right place to start with those.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>

[jira] [Resolved] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10634.
---
Resolution: Not A Problem

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-10634.
-

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9296:
---

Assignee: (was: Apache Spark)

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
I tried to save a dataframe with instances of the Vector class from mlib and 
got the error.

The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 

[jira] [Created] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-16 Thread Thouis Jones (JIRA)
Thouis Jones created SPARK-10642:


 Summary: Crash in rdd.lookup() with "java.lang.Long cannot be cast 
to java.lang.Integer"
 Key: SPARK-10642
 URL: https://issues.apache.org/jira/browse/SPARK-10642
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
 Environment: OSX
Reporter: Thouis Jones


Running this command:

{code}
sc.parallelize([(('a', 'b'), 
'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
{code}

gives the following error:
{noformat}
15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
line 2199, in lookup
return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", line 
916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
line 36, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: java.lang.ClassCastException: java.lang.Long cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
at scala.collection.Iterator$class.find(Iterator.scala:780)
at scala.collection.AbstractIterator.find(Iterator.scala:1157)
at scala.collection.IterableLike$class.find(IterableLike.scala:79)
at scala.collection.AbstractIterable.find(Iterable.scala:54)
at 
org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4440) Enhance the job progress API to expose more information

2015-09-16 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790935#comment-14790935
 ] 

Josh Rosen commented on SPARK-4440:
---

Does anyone still want these extensions? If so, can you please come up with a 
concrete proposal for exactly what you'd like to add?

> Enhance the job progress API to expose more information
> ---
>
> Key: SPARK-4440
> URL: https://issues.apache.org/jira/browse/SPARK-4440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rui Li
>
> The progress API introduced in SPARK-2321 provides a new way for user to 
> monitor job progress. However the information exposed in the API is 
> relatively limited. It'll be much more useful if we can enhance the API to 
> expose more data.
> Some improvement for example may include but not limited to:
> 1. Stage submission and completion time.
> 2. Task metrics.
> The requirement is initially identified for the hive on spark 
> project(HIVE-7292), other application should benefit as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4442) Move common unit test utilities into their own package / module

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4442.
---
Resolution: Won't Fix

We're using test-jar dependencies instead, so this is "Won't Fix".

> Move common unit test utilities into their own package / module
> ---
>
> Key: SPARK-4442
> URL: https://issues.apache.org/jira/browse/SPARK-4442
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Josh Rosen
>Priority: Minor
>
> We should move generally-useful unit test fixtures / utility methods to their 
> own test utilities set package / module to make them easier to find / use.
> See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
> one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10645:
-

 Summary: Bivariate Statistics for continuous vs. continuous
 Key: SPARK-10645
 URL: https://issues.apache.org/jira/browse/SPARK-10645
 Project: Spark
  Issue Type: New Feature
Reporter: Jihong MA


this is an umbrella jira, which covers Bivariate Statistics for continuous vs. 
continuous columns, including covariance, Pearson's correlation, Spearman's 
correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790816#comment-14790816
 ] 

Jihong MA commented on SPARK-10602:
---

I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, 
couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can 
you assign SPARK-10641 to Seth?  and help fix the link, Thanks! 

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10485:
-
Affects Version/s: 1.5.0

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10643) Support HDFS urls in spark-submit

2015-09-16 Thread Alan Braithwaite (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Braithwaite updated SPARK-10643:
-
Description: 
When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.

Code in question:
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698

  was:
When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.


> Support HDFS urls in spark-submit
> -
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 

[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10641:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10589) Add defense against external site framing

2015-09-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10589.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8745
[https://github.com/apache/spark/pull/8745]

> Add defense against external site framing
> -
>
> Key: SPARK-10589
> URL: https://issues.apache.org/jira/browse/SPARK-10589
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> This came up as a minor point during a security audit using a common scanning 
> tool: It's best if Spark UIs try to actively defend against certain types of 
> frame-related vulnerabilities by setting X-Frame-Options. See 
> https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet
> Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2991) RDD transforms for scan and scanLeft

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2991.
---
Resolution: Won't Fix

> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3497) Report serialized size of task binary

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3497.
---
Resolution: Fixed

We now have an automatic warning-level log message for large closures.

> Report serialized size of task binary
> -
>
> Key: SPARK-3497
> URL: https://issues.apache.org/jira/browse/SPARK-3497
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> This is useful for determining that task closures are larger than expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790792#comment-14790792
 ] 

Sean Owen commented on SPARK-10636:
---

It's a Scala syntax issue, as you say when you wondered if you're not using 
if..else in the way you think. Stuff happens, but if that crosses your mind, 
I'd just float a user@ question first or try a simple test in Scala to narrow 
down the behavior in question.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>setName("myRDD").
>

[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790859#comment-14790859
 ] 

Joseph K. Bradley commented on SPARK-10602:
---

Yeah, JIRA only allows 2 levels of subtasks  (a long-time annoyance of mine!).  
I'd recommend linking here using "contains."  I'll fix the link for now.

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-869) Retrofit rest of RDD api to use proper serializer type

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-869.
--
Resolution: Done

Going to resolve this as Done; please open a new JIRA if you find specific 
examples where we're using the wrong serializer.

> Retrofit rest of RDD api to use proper serializer type
> --
>
> Key: SPARK-869
> URL: https://issues.apache.org/jira/browse/SPARK-869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.8.0
>Reporter: Dmitriy Lyubimov
>
> SPARK-826 and SPARK-827 resolved proper serialization support for some RDD 
> method parameters, but  not all. 
> This issue is to address the rest of RDD api and operations. Most of the time 
> this is due to wrapping RDD parameters into a closure which can only use a 
> closure serializer to communicate to the backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-10640:
-

Assignee: Thomas Graves

> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I'm seeing an exception from the spark history server trying to read a 
> history file:
> scala.MatchError: TaskCommitDenied (of class java.lang.String)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10371) Optimize sequential projections

2015-09-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10371:
--
Description: 
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.

{code}
val input = sqlContext.range(10)
val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2)
output.explain(true)

== Parsed Logical Plan ==
'Project [*,('x * 2) AS y#254]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Analyzed Logical Plan ==
id: bigint, x: bigint, y: bigint
Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Optimized Logical Plan ==
Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Physical Plan ==
TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 Scan PhysicalRDD[id#252L]

Code Generation: true
input: org.apache.spark.sql.DataFrame = [id: bigint]
output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
{code}

  was:
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.


> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3489.
---
Resolution: Won't Fix

Resolving as "Won't Fix" per PR discussion.

> support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
> --
>
> Key: SPARK-3489
> URL: https://issues.apache.org/jira/browse/SPARK-3489
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Mohit Jaggi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10644) Applications wait even if free executors are available

2015-09-16 Thread Balagopal Nair (JIRA)
Balagopal Nair created SPARK-10644:
--

 Summary: Applications wait even if free executors are available
 Key: SPARK-10644
 URL: https://issues.apache.org/jira/browse/SPARK-10644
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.5.0
 Environment: RHEL 6.5 64 bit
Reporter: Balagopal Nair


Number of workers: 21
Number of executors: 63

Steps to reproduce:
1. Run 4 jobs each with max cores set to 10
2. The first 3 jobs run with 10 each. (30 executors consumed so far)
3. The 4 th job waits even though there are 33 idle executors.

The reason is that a job will not get executors unless 
the total number of EXECUTORS in use < the number of WORKERS

If there are executors available, resources should be allocated to the pending 
job.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4568.
---
Resolution: Fixed

We now do this.

> Publish release candidates under $VERSION-RCX instead of $VERSION
> -
>
> Key: SPARK-4568
> URL: https://issues.apache.org/jira/browse/SPARK-4568
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION

2015-09-16 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790945#comment-14790945
 ] 

Josh Rosen edited comment on SPARK-4568 at 9/16/15 7:00 PM:


We now do this. Specifically, we publish RC's under both names.


was (Author: joshrosen):
We now do this.

> Publish release candidates under $VERSION-RCX instead of $VERSION
> -
>
> Key: SPARK-4568
> URL: https://issues.apache.org/jira/browse/SPARK-4568
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2015-09-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791016#comment-14791016
 ] 

Nicholas Chammas commented on SPARK-4216:
-

Thanks Josh!

> Eliminate duplicate Jenkins GitHub posts from AMPLab
> 
>
> Key: SPARK-4216
> URL: https://issues.apache.org/jira/browse/SPARK-4216
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> * [Real Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873361]
> * [Imposter Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10641:
-

 Summary: skewness and kurtosis support
 Key: SPARK-10641
 URL: https://issues.apache.org/jira/browse/SPARK-10641
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA


Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-16 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790804#comment-14790804
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

Sure, a function as proposed that allows for the topic, partitions and offsets 
to be specified in a fine grained manner is needed to provide the full 
flexbility we desire (starting at an arbitrary offset within each topic 
partition). If separate DStreams are desired for each topic, you intend for 
createDirectStream to be called multiple times (with a different subscription 
topic each time) both before and after the streaming context is started? 

Also, what kind of defaults did you have in mind? For e.g. I might require the 
ability to specify new topics after the streaming context is started but might 
not want the burden of being aware of the partitions within the topic or the 
offsets. I might simply want to default to either the start or the end of each 
partition that exists for that topic.

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10643) Support HDFS urls in spark-submit

2015-09-16 Thread Alan Braithwaite (JIRA)
Alan Braithwaite created SPARK-10643:


 Summary: Support HDFS urls in spark-submit
 Key: SPARK-10643
 URL: https://issues.apache.org/jira/browse/SPARK-10643
 Project: Spark
  Issue Type: New Feature
Reporter: Alan Braithwaite
Priority: Minor


When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4738) Update the netty-3.x version in spark-assembly-*.jar

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4738.
---
Resolution: Incomplete

Resolving as "Incomplete" since this an old issue and it doesn't look like 
there's any action to take here. In newer Spark versions, you should be able to 
use the various user-classpath-first configurationst o work around this.

> Update the netty-3.x version in spark-assembly-*.jar
> 
>
> Key: SPARK-4738
> URL: https://issues.apache.org/jira/browse/SPARK-4738
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Tobias Pfeiffer
>Priority: Minor
>
> It seems as if the version of akka-remote (2.2.3-shaded-protobuf) that is 
> bundled in the spark-assembly-1.1.1-hadoop2.4.0.jar file pulls in an ancient 
> version of netty, namely io.netty:netty:3.6.6.Final (using the package 
> org.jboss.netty). This means that when using spark-submit, there will always 
> be this netty version on the classpath before any versions added by the user. 
> This may lead to issues with other packages that depend on newer versions and 
> may fail with java.lang.NoSuchMethodError etc.(finagle-http in my case).
> I wonder if it possible to manually include a newer netty version, like 
> netty-3.8.0.Final.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4087) Only use broadcast for large tasks

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen closed SPARK-4087.
-
Resolution: Won't Fix

> Only use broadcast for large tasks
> --
>
> Key: SPARK-4087
> URL: https://issues.apache.org/jira/browse/SPARK-4087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Critical
>
> After we broadcast every tasks, some regressions are introduced because of 
> broadcast is not stable enough.
> So we would like to only use broadcast for large tasks, which will keep the 
> same behaviour as 1.0 for most of the cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >