[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558088#comment-15558088
 ] 

Hyukjin Kwon commented on SPARK-8128:
-

I am not 100% sure but I recall I saw similar issue was resolved. Could you 
confirm this is still happening in the recent versions? - [~brdwrd]

> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> 
>
> Key: SPARK-8128
> URL: https://issues.apache.org/jira/browse/SPARK-8128
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and 
> loading it into one dataframe so schema merging is involved. There is some 
> bug with the schema merging that you print the schema and it shows all the 
> attributes. However when you run a query and filter on that attribute is 
> errors saying it's not in the schema. The query is incorrectly going to one 
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  ...
> \|-- items: array (nullable = true)
>  \||-- element: struct (containsNull = true)
>  \|||-- _id: string (nullable = true)
>  \|||-- addedOn: string (nullable = true)
>  \|||-- authorId: string (nullable = true)
>  \|||-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
> stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: 
> Column [mediaProcessingState] was not found in schema!
> at parquet.Preconditions.checkArgument(Preconditions.java:47)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> 

[jira] [Closed] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-10-08 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu closed SPARK-17540.
--
Resolution: Won't Fix

> SparkR array serde cannot work correctly when array length == 0
> ---
>
> Key: SPARK-17540
> URL: https://issues.apache.org/jira/browse/SPARK-17540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>
> SparkR cannot handle array serde when array length == 0
> when length = 0
> R side set the element type as class("somestring")
> so that scala side code receive it as a string array,
> but the array we need to transfer may be other types,
> it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10427) Spark-sql -f or -e will output some

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10427.
---
Resolution: Not A Problem

> Spark-sql -f or -e will output some
> ---
>
> Key: SPARK-10427
> URL: https://issues.apache.org/jira/browse/SPARK-10427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
> Environment: Spark 1.4.1 
>Reporter: cen yuhai
>Priority: Minor
>
> We use  spark-sql -f 1.sql  > 1.txt 
> It will print these information in 1.txt :
> spark.sql.parquet.binaryAsString=...
> spark.sql.hive.metastore.version=.
> .etc and so on 
> We dont' need these information and hive will not print these in the standard 
> outputstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16903) nullValue in first field is not respected by CSV source when read

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16903.
--
Resolution: Duplicate

[~falaki] I am going to make this as a duplicate because the PR was merged and 
I am sure we all here agree with closing this.

> nullValue in first field is not respected by CSV source when read
> -
>
> Key: SPARK-16903
> URL: https://issues.apache.org/jira/browse/SPARK-16903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> file:
> {code}
> a,-
> -,10
> {code}
> Query:
> {code}
> create temporary table test(key string, val decimal) 
> using com.databricks.spark.csv 
> options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", 
> nullValue "-");
> {code}
> Result:
> {code}
> select count(*) from test where key is null
> 0
> {code}
> But
> {code}
> select count(*) from test where val is null
> 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7012.
--
Resolution: Not A Problem

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558512#comment-15558512
 ] 

Xiao Li commented on SPARK-9442:


Is it still a problem in the latest branch?

> java.lang.ArithmeticException: / by zero when reading Parquet
> -
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: DB Tsai
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
>  |-- profileId: long (nullable = true)
>  |-- country: string (nullable = true)
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- videoId: long (nullable = true)
>  |||-- date: long (nullable = true)
>  |||-- label: double (nullable = true)
>  |||-- weight: double (nullable = true)
>  |||-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and 
> each of the parquet file is around 600MB. The total size is around 120GB. 
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
> file 
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>   ... 21 more
> {code}
> BTW, no all the tasks fail, and some of them are successful. 
> Another note: By explicitly looping through the data to count, it will works.
> {code}
> sqlContext.read.load(hdfsPath + s"/testing/u1snappy/${date}/").map(x => 
> 1L).reduce((x, y) => x + y) 
> {code}
> I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558474#comment-15558474
 ] 

Cody Koeninger commented on SPARK-17344:


I think this is premature until you have a fully operational battlestation, er, 
structured stream, that has all the necessary features for 0.10

Regarding the conversation with Michael about possibly using the kafka protocol 
directly as a way to work around the differences between 0.8 and 0.10, please 
don't consider that.  Every kafka consumer implementation I've ever used has 
bugs, and we don't need to spend time writing another buggy one.  

By contrast, writing a streaming source shim around the existing simple 
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have 
stuff like SSL, dynamic topics, or offset committing.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10703) Physical filter operators should replace the general AND/OR/equality/etc with a special version that treats null as false

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558509#comment-15558509
 ] 

Xiao Li commented on SPARK-10703:
-

The problem has been resolved, I think. Try the latest branch by 
{noformat}
val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (value: String) => value.length > 2)
df.filter($"animals".rlike(".*")).filter(callUDF("simpleUDF", 
$"animals")).show()
{noformat}

> Physical filter operators should replace the general AND/OR/equality/etc with 
> a special version that treats null as false
> -
>
> Key: SPARK-10703
> URL: https://issues.apache.org/jira/browse/SPARK-10703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mingyu Kim
>
> {noformat}
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df.filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
> $"animals"))
>   .collect()
> {noformat}
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
> This feels weird. Reading that code, I wouldn't expect null to be passed to 
> the second filter. Even weirder is that if you call collect() after the first 
> filter you won't see nulls, and if you write the data to disk and reread it, 
> the NPE won't happen.
> After the discussion on the dev list, [~rxin] suggested,
> {quote}
> we can add a rule for the physical filter operator to replace the general 
> AND/OR/equality/etc with a special version that treats null as false. This 
> rule needs to be carefully written because it should only apply to subtrees 
> of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10703) Physical filter operators should replace the general AND/OR/equality/etc with a special version that treats null as false

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10703.
---
Resolution: Not A Problem

> Physical filter operators should replace the general AND/OR/equality/etc with 
> a special version that treats null as false
> -
>
> Key: SPARK-10703
> URL: https://issues.apache.org/jira/browse/SPARK-10703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mingyu Kim
>
> {noformat}
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df.filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
> $"animals"))
>   .collect()
> {noformat}
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
> This feels weird. Reading that code, I wouldn't expect null to be passed to 
> the second filter. Even weirder is that if you call collect() after the first 
> filter you won't see nulls, and if you write the data to disk and reread it, 
> the NPE won't happen.
> After the discussion on the dev list, [~rxin] suggested,
> {quote}
> we can add a rule for the physical filter operator to replace the general 
> AND/OR/equality/etc with a special version that treats null as false. This 
> rule needs to be carefully written because it should only apply to subtrees 
> of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558506#comment-15558506
 ] 

Cody Koeninger commented on SPARK-17812:


So I'm willing to do this work, mostly because I've already done it, but there 
are some user interface issues here that need to get figured out.

You already chose the name "startingOffset" for specifying the equivalent of 
auto.offset.reset.  Now we're looking at actually adding starting offsets.  
Furthermore, it should be possible to specify starting offsets for some 
partitions, while relying on the equivalent of auto.offset.reset for other 
unspecified ones (the existing DStream does this).

What are you expecting configuration of this to look like?  I can see a couple 
of options:

1. Try to cram everything into startingOffset with some horrible string-based 
DSL
2. Have a separate option for specifying starting offsets for real, with a name 
that makes it clear what it is, yet doesn't use "startingoffset".  As for the 
value, I guess in json form of some kind?   { "topicfoo" : { "0": 1234, "1": 
4567 }}

Somewhat related is that Assign needs a way of specifying topicpartitions.

As far as the idea to seek back X offsets, I think it'd be better to look at 
offset time indexing.
If you are going to do the X offsets back idea, the offsets -1L and -2L already 
have special meaning, so it's going to be kind of confusing to allow negative 
numbers in an interface that is specifying offsets.


> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek back {{X}} offsets in the stream from the moment the query starts
>  - seek to user specified offsets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11428) Schema Merging Broken for Some Queries

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558095#comment-15558095
 ] 

Hyukjin Kwon commented on SPARK-11428:
--

How about https://issues.apache.org/jira/browse/SPARK-8128 ?

> Schema Merging Broken for Some Queries
> --
>
> Key: SPARK-11428
> URL: https://issues.apache.org/jira/browse/SPARK-11428
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.1
> Environment: AWS,
>Reporter: Brad Willard
>  Labels: dataframe, parquet, pyspark, schema, sparksql
>
> I have data being written into parquet format via spark streaming. The data 
> can change slightly so schema merging is required. I load a dataframe like 
> this
> {code}
> urls = [
> "/streaming/parquet/events/key=2015-10-30*",
> "/streaming/parquet/events/key=2015-10-29*"
> ]
> sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls)
> sdf.registerTempTable('events')
> {code}
> If I print the schema you can see the contested column
> {code}
> sdf.printSchema()
> root
>  |-- _id: string (nullable = true)
> ...
>  |-- d__device_s: string (nullable = true)
>  |-- d__isActualPageLoad_s: string (nullable = true)
>  |-- d__landing_s: string (nullable = true)
>  |-- d__lang_s: string (nullable = true)
>  |-- d__os_s: string (nullable = true)
>  |-- d__performance_i: long (nullable = true)
>  |-- d__product_s: string (nullable = true)
>  |-- d__refer_s: string (nullable = true)
>  |-- d__rk_i: long (nullable = true)
>  |-- d__screen_s: string (nullable = true)
>  |-- d__submenuName_s: string (nullable = true)
> {code}
> The column that's in one but not the other file is  d__product_s
> So I'm able to run this query and it works fine.
> {code}
> sql_context.sql('''
> select 
> distinct(d__product_s) 
> from 
> events
> where 
> n = 'view'
> ''').collect()
> [Row(d__product_s=u'website'),
>  Row(d__product_s=u'store'),
>  Row(d__product_s=None),
>  Row(d__product_s=u'page')]
> {code}
> However if I instead use that column in the where clause things break.
> {code}
> sql_context.sql('''
> select 
> * 
> from 
> events
> where 
> n = 'view' and d__product_s = 'page'
> ''').take(1)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   6 where
>   7 n = 'frontsite_view' and d__product_s = 'page'
> > 8 ''').take(1)
> /root/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
> 303 with SCCallSiteSync(self._sc) as css:
> 304 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 305 self._jdf, num)
> 306 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 307 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  34 def deco(*a, **kw):
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
>  38 s = e.java_exception.toString()
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage 
> 15.0 (TID 6536, 10.X.X.X): java.lang.IllegalArgumentException: Column 
> [d__product_s] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> 

[jira] [Commented] (SPARK-10044) AnalysisException in resolving reference for sorting with aggregation

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558470#comment-15558470
 ] 

Xiao Li commented on SPARK-10044:
-

This has been resolved at least in the Spark 2.0. Thus, it should not be a 
problem now. Let me close it. Thanks!

> AnalysisException in resolving reference for sorting with aggregation
> -
>
> Key: SPARK-10044
> URL: https://issues.apache.org/jira/browse/SPARK-10044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Unit test as:
> {code}
> withTempTable("mytable") {
>   sqlContext.sparkContext.parallelize(1 to 10).map(i => (i, i.toString))
> .toDF("key", "value")
> .registerTempTable("mytable")
>   checkAnswer(sql(
> """select max(value) from mytable group by key % 2
>   |order by max(concat(value,",", key)), min(substr(value, 0, 4))
>   |""".stripMargin), Row("8") :: Row("9") :: Nil)
> }
> {code}
> Exception like:
> {code}
> cannot resolve '_aggOrdering' given input columns _c0, _aggOrdering, 
> _aggOrdering;
> org.apache.spark.sql.AnalysisException: cannot resolve '_aggOrdering' given 
> input columns _c0, _aggOrdering, _aggOrdering;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10044) AnalysisException in resolving reference for sorting with aggregation

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10044.
---
Resolution: Not A Problem

> AnalysisException in resolving reference for sorting with aggregation
> -
>
> Key: SPARK-10044
> URL: https://issues.apache.org/jira/browse/SPARK-10044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Unit test as:
> {code}
> withTempTable("mytable") {
>   sqlContext.sparkContext.parallelize(1 to 10).map(i => (i, i.toString))
> .toDF("key", "value")
> .registerTempTable("mytable")
>   checkAnswer(sql(
> """select max(value) from mytable group by key % 2
>   |order by max(concat(value,",", key)), min(substr(value, 0, 4))
>   |""".stripMargin), Row("8") :: Row("9") :: Nil)
> }
> {code}
> Exception like:
> {code}
> cannot resolve '_aggOrdering' given input columns _c0, _aggOrdering, 
> _aggOrdering;
> org.apache.spark.sql.AnalysisException: cannot resolve '_aggOrdering' given 
> input columns _c0, _aggOrdering, _aggOrdering;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5511) [SQL] Possible optimisations for predicate pushdowns from Spark SQL to Parquet

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557667#comment-15557667
 ] 

Hyukjin Kwon edited comment on SPARK-5511 at 10/8/16 4:52 PM:
--

1. I agree it needs a change on Parquet.

2. We already supported this before via user-defiened filter but it was removed 
due to the performance about filtering record-by-record. Then, there is a try 
to add this back with the combinations of OR operators. See SPARK-17091


was (Author: hyukjin.kwon):
1. I agree it needs a change on Spark.

2. We already supported this before via user-defiened filter but it was removed 
due to the performance about filtering record-by-record. Then, there is a try 
to add this back with the combinations of OR operators. See SPARK-17091

> [SQL] Possible optimisations for predicate pushdowns from Spark SQL to Parquet
> --
>
> Key: SPARK-5511
> URL: https://issues.apache.org/jira/browse/SPARK-5511
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Mick Davies
>Priority: Minor
>
> The following changes could make predicate pushdown more effective under 
> certain conditions, which are not uncommon.
> 1. Parquet predicate evaluation does not use dictionary compression 
> information, furthermore it circumvents dictionary decoding optimisations 
> (https://issues.apache.org/jira/browse/PARQUET-36). This means predicates are 
> re-evaluated repeatedly for the same Strings, and also Binary->String 
> conversions are repeated. This is a change purely on the Parquet side.
> 2. Support IN clauses in predicate pushdown. This requires changes to Parquet 
> and then subsequently in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10545) HiveMetastoreTypes.toMetastoreType should handle interval type

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558489#comment-15558489
 ] 

Xiao Li commented on SPARK-10545:
-

Both Hive and Spark do not support INTERVAL as a column data type. Thus, it 
should not be a bug, right? 

> HiveMetastoreTypes.toMetastoreType should handle interval type
> --
>
> Key: SPARK-10545
> URL: https://issues.apache.org/jira/browse/SPARK-10545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Minor
>
> We need to handle interval type at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L946-L965.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558466#comment-15558466
 ] 

Xiao Li commented on SPARK-7012:


Since 2.0, we have a native Parser. Thus, this has been resolved. Thanks!

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558577#comment-15558577
 ] 

Xiao Li commented on SPARK-10972:
-

There is a workaround to fix it. You can specify the filter above join. 

Yeah, the performance might be not as good as treating it as a join condition. 

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7097.
--
Resolution: Won't Fix

> Partitioned tables should only consider referred partitions in query during 
> size estimation for checking against autoBroadcastJoinThreshold
> ---
>
> Key: SPARK-7097
> URL: https://issues.apache.org/jira/browse/SPARK-7097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
>Reporter: Yash Datta
>
> Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
> the size estimation of partitioned tables involved considers the size of 
> entire table. This results in many query plans using shuffle hash joins , 
> where infact only a small number of partitions may be being referred by the 
> actual query (due to additional filters), and hence these could be run using 
> BroadCastHash join.
> The query plan should consider the size of only the referred partitions in 
> such cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558718#comment-15558718
 ] 

Xiao Li commented on SPARK-7097:


This will be resolved in the ongoing CBO work. Thus, close it now. Thanks!

> Partitioned tables should only consider referred partitions in query during 
> size estimation for checking against autoBroadcastJoinThreshold
> ---
>
> Key: SPARK-7097
> URL: https://issues.apache.org/jira/browse/SPARK-7097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
>Reporter: Yash Datta
>
> Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
> the size estimation of partitioned tables involved considers the size of 
> entire table. This results in many query plans using shuffle hash joins , 
> where infact only a small number of partitions may be being referred by the 
> actual query (due to additional filters), and hence these could be run using 
> BroadCastHash join.
> The query plan should consider the size of only the referred partitions in 
> such cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-08 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558762#comment-15558762
 ] 

Ioana Delaney commented on SPARK-17626:
---

[~mikewzh] Thank you. Yes, having informational RI constraints available in 
Spark will open many opportunities for optimizations. Star schema detection is 
just one of them. Our team here at IBM already started some initial design 
discussions in this direction. We are hoping to have something more concrete 
soon. 


> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558901#comment-15558901
 ] 

Xiao Li commented on SPARK-10101:
-

This has been resolved in the master. If you still hit any bug, please open a 
new JIRA.

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10101.
---
Resolution: Not A Problem

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17837) Disaster recovery of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17837:
---
Summary: Disaster recovery of offsets from WAL  (was: Disaster recover of 
offsets from WAL)

> Disaster recovery of offsets from WAL
> -
>
> Key: SPARK-17837
> URL: https://issues.apache.org/jira/browse/SPARK-17837
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
> As reynold suggests though, we should change this to use a less opaque 
> format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558528#comment-15558528
 ] 

Cody Koeninger commented on SPARK-17815:


So if you start committing offsets to kafka, there are going to be potentially 
three places offsets are stored:

1.  structured WAL
2. kafka commit topic
3. downstream store

It's going to be easy to get confused as to what the source of truth is.


> Report committed offsets
> 
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17837) Disaster recover of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-17837:
--

 Summary: Disaster recover of offsets from WAL
 Key: SPARK-17837
 URL: https://issues.apache.org/jira/browse/SPARK-17837
 Project: Spark
  Issue Type: Sub-task
Reporter: Cody Koeninger


"The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
As reynold suggests though, we should change this to use a less opaque format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Component/s: (was: Spark Shell)
 (was: Spark Core)
 Documentation

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, fun, happy, pants, spark-shell
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558532#comment-15558532
 ] 

Xiao Li commented on SPARK-10804:
-

In Spark 2.0, we rewrote the whole part, especially the load command and the 
write path. If you still have an issue, could you open a new JIRA to document 
it? Thanks!

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10804.
---
Resolution: Not A Problem

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558540#comment-15558540
 ] 

holdenk commented on SPARK-14212:
-

So I think this would be a good option to document for Python users, although 
the root CSV issue has been fixed by including the CSV format inside of Spark 
its self. You configure a package using `spark.jars.packages` in 
spark-defaults.conf.

If someone is interested adding this to the documentation 
`docs/configuration.md` would probably be a good place to document the 
`spakr.jars.packages` configuration value (and you can see how it can be by 
looking at the inside of SparkSubmit & SparkSubmitArguments together.

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, fun, happy, pants, spark-shell
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Labels: config starter  (was: config fun happy pants spark-shell)

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Priority: Trivial  (was: Major)

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558557#comment-15558557
 ] 

Cody Koeninger commented on SPARK-4960:
---

Is this idea pretty much dead at this point? It seems like most attention has 
moved off of receiver-based dstream.

> Interceptor pattern in receivers
> 
>
> Key: SPARK-4960
> URL: https://issues.apache.org/jira/browse/SPARK-4960
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Tathagata Das
>
> Sometimes it is good to intercept a message received through a receiver and 
> modify / do something with the message before it is stored into Spark. This 
> is often referred to as the interceptor pattern. There should be general way 
> to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10860:

Assignee: (was: Jihong MA)

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10860:

Component/s: (was: SQL)

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Jihong MA
>Assignee: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10646:

Assignee: (was: Jihong MA)

> Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558581#comment-15558581
 ] 

Xiao Li edited comment on SPARK-10972 at 10/8/16 7:34 PM:
--

Also try to use the SQL interface? It can be mixed with the Dataset/DataFrame 
APIs.


was (Author: smilegator):
Also try to use the SQL interface?

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558581#comment-15558581
 ] 

Xiao Li commented on SPARK-10972:
-

Also try to use the SQL interface?

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10933) Spark SQL Joins should have option to fail query when row multiplication is encountered

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10933.
---
Resolution: Won't Fix

> Spark SQL Joins should have option to fail query when row multiplication is 
> encountered
> ---
>
> Key: SPARK-10933
> URL: https://issues.apache.org/jira/browse/SPARK-10933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephen Link
>Priority: Minor
>
> When constructing spark sql queries, we commonly run into scenarios where 
> users have inadvertently caused a cartesian product/row expansion. It is 
> sometimes possible to detect this in advance with separate queries, but it 
> would be far more ideal if it was possible to have a setting that disallowed 
> join keys showing up multiple times on both sides of a join operation.
> This setting would belong in SQLConf. The functionality could likely be 
> implemented by forcing a sorted shuffle, then checking for duplication on the 
> streamed results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10933) Spark SQL Joins should have option to fail query when row multiplication is encountered

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558625#comment-15558625
 ] 

Xiao Li commented on SPARK-10933:
-

Now, we have a conf `spark.sql.crossJoin.enabled`. 

Let me close it now. If you still think we need extra conf, please reopen it. 
Thanks!

> Spark SQL Joins should have option to fail query when row multiplication is 
> encountered
> ---
>
> Key: SPARK-10933
> URL: https://issues.apache.org/jira/browse/SPARK-10933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephen Link
>Priority: Minor
>
> When constructing spark sql queries, we commonly run into scenarios where 
> users have inadvertently caused a cartesian product/row expansion. It is 
> sometimes possible to detect this in advance with separate queries, but it 
> would be far more ideal if it was possible to have a setting that disallowed 
> join keys showing up multiple times on both sides of a join operation.
> This setting would belong in SQLConf. The functionality could likely be 
> implemented by forcing a sorted shuffle, then checking for duplication on the 
> streamed results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10427) Spark-sql -f or -e will output some

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558525#comment-15558525
 ] 

Xiao Li commented on SPARK-10427:
-

This is not an issue since 2.0. Thanks!

> Spark-sql -f or -e will output some
> ---
>
> Key: SPARK-10427
> URL: https://issues.apache.org/jira/browse/SPARK-10427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
> Environment: Spark 1.4.1 
>Reporter: cen yuhai
>Priority: Minor
>
> We use  spark-sql -f 1.sql  > 1.txt 
> It will print these information in 1.txt :
> spark.sql.parquet.binaryAsString=...
> spark.sql.hive.metastore.version=.
> .etc and so on 
> We dont' need these information and hive will not print these in the standard 
> outputstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-6649.

Resolution: Fixed

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558572#comment-15558572
 ] 

Cody Koeninger commented on SPARK-17147:


I talked with Sean in person about this, and think there's a way to move 
forward.  I'll start hacking on it.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10805) JSON Data Frame does not return correct string lengths

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-10805.
-
Resolution: Won't Fix

> JSON Data Frame does not return correct string lengths
> --
>
> Key: SPARK-10805
> URL: https://issues.apache.org/jira/browse/SPARK-10805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jeff Li
>Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>   DataFrame jsonDataFrame = 
> sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
>   jsonDataFrame.printSchema();
>   StructType jsonSchema = jsonDataFrame.schema();
>   StructField[] dataFields = jsonSchema.fields();
>   for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
> fieldIndex++) {
>   StructField aField = dataFields[fieldIndex];
>   DataType aType = aField.dataType();
>   System.out.println("name: " + aField.name() + " type: " 
> + aType.typeName()
>   + " size: " +aType.defaultSize());
>   }
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
> characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of 
> each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10805) JSON Data Frame does not return correct string lengths

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558595#comment-15558595
 ] 

Xiao Li commented on SPARK-10805:
-

This is pretty expensive to find the max length for each field. That means we 
need to read all the records. When you read the schema, the schema is inferred 
from the file. Even if we can find it, but the new recorded could be appended. 

Now, CBO is being implemented. Thus, this part should be resolved with CBO.

> JSON Data Frame does not return correct string lengths
> --
>
> Key: SPARK-10805
> URL: https://issues.apache.org/jira/browse/SPARK-10805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jeff Li
>Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>   DataFrame jsonDataFrame = 
> sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
>   jsonDataFrame.printSchema();
>   StructType jsonSchema = jsonDataFrame.schema();
>   StructField[] dataFields = jsonSchema.fields();
>   for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
> fieldIndex++) {
>   StructField aField = dataFields[fieldIndex];
>   DataType aType = aField.dataType();
>   System.out.println("name: " + aField.name() + " type: " 
> + aType.typeName()
>   + " size: " +aType.defaultSize());
>   }
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
> characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of 
> each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11055) Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11055.
---
Resolution: Duplicate

> Use mixing hash-based and sort-based aggregation in 
> TungstenAggregationIterator
> ---
>
> Key: SPARK-11055
> URL: https://issues.apache.org/jira/browse/SPARK-11055
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In TungstenAggregationIterator we switch to sort-based aggregation when we 
> can't allocate more memory for hashmap.
> However, using external sorter-based aggregation will write too much 
> key-value pairs into disk. We should use mixing hash-based and sort-based 
> aggregation to reduce the key-value pairs needed to write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11055) Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558706#comment-15558706
 ] 

Xiao Li commented on SPARK-11055:
-

Based on the PR, Davies did a similar work. [SPARK-11425] [SPARK-11486] Improve 
hybrid aggregation

> Use mixing hash-based and sort-based aggregation in 
> TungstenAggregationIterator
> ---
>
> Key: SPARK-11055
> URL: https://issues.apache.org/jira/browse/SPARK-11055
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In TungstenAggregationIterator we switch to sort-based aggregation when we 
> can't allocate more memory for hashmap.
> However, using external sorter-based aggregation will write too much 
> key-value pairs into disk. We should use mixing hash-based and sort-based 
> aggregation to reduce the key-value pairs needed to write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5818) unable to use "add jar" in hql

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-5818.
--
Resolution: Not A Problem

This has been supported. Please try the latest branch. Thanks!

> unable to use "add jar" in hql
> --
>
> Key: SPARK-5818
> URL: https://issues.apache.org/jira/browse/SPARK-5818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>
> In the spark 1.2.1 and 1.2.0, it's unable the use the hive command "add jar"  
> in hql.
> It seems that the problem in spark-2219 is still existed.
> the problem can be reproduced as described in the below. Suppose the jar file 
> is named brickhouse-0.6.0.jar and is placed in the /tmp directory
> {code}
> spark-shell>import org.apache.spark.sql.hive._
> spark-shell>val sqlContext = new HiveContext(sc)
> spark-shell>import sqlContext._
> spark-shell>hql("add jar /tmp/brickhouse-0.6.0.jar")
> {code}
> the error message is showed as blow
> {code:title=Error Log}
> 15/02/15 01:36:31 ERROR SessionState: Unable to register 
> /tmp/brickhouse-0.6.0.jar
> Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
> cast to java.net.URLClassLoader
> java.lang.ClassCastException: 
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
> java.net.URLClassLoader
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
>   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
>   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
>   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC.(:39)
>   at $line30.$read$$iwC$$iwC$$iwC.(:41)
>   at $line30.$read$$iwC$$iwC.(:43)
>   at $line30.$read$$iwC.(:45)
>   at $line30.$read.(:47)
>   at $line30.$read$.(:51)
>   at $line30.$read$.()
>   at $line30.$eval$.(:7)
>   at $line30.$eval$.()
>   at $line30.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
>   at 

[jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum

2016-10-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558724#comment-15558724
 ] 

Reynold Xin commented on SPARK-10496:
-

I think there are two separate issues here:

1. The API to run cumulative sum right now is fairly awkward. Either do it 
through a complicated join, or through window functions that still look fairly 
verbose. I've created a notebook that contains two short examples to do this in 
SQL and in DataFrames: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html

It would make sense to me to create a simpler API for this case, since it is 
very common. This API under the hood can just call the existing window function 
API.

2. The implementation, for cases when there is a single window partition, is 
slow, because it requires shuffling all the data. This can technically be run 
just a prefix scan. In this case, I'd have an optimizer rule or physical plan 
changes to improve this.



> Efficient DataFrame cumulative sum
> --
>
> Key: SPARK-10496
> URL: https://issues.apache.org/jira/browse/SPARK-10496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which 
> is the cumulative sum of X.
> This can be done with window functions, but it is not efficient for a large 
> number of rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9265.
--
Resolution: Not A Problem

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558875#comment-15558875
 ] 

Xiao Li commented on SPARK-9265:


This has been resolved since our Optimizer push down `Limit` below `Sort`. 
Close it now. Thanks!

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14017) dataframe.dtypes -> pyspark.sql.types aliases

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14017.
---
Resolution: Won't Fix

Thanks for bringing this issue up - I don't think we necessarily want to add 
these alias - the type differences are documented in 
http://spark.apache.org/docs/latest/sql-programming-guide.html

> dataframe.dtypes -> pyspark.sql.types aliases
> -
>
> Key: SPARK-14017
> URL: https://issues.apache.org/jira/browse/SPARK-14017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: Python 2.7; Spark 1.5; Java 1.7; Hadoop 2.6; Scala 2.10
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, datatypes, pyspark, python
>
> Running following:
> #fix schema for gaid which should not be Double 
> from pyspark.sql.types import *
> customSchema = StructType()
> for (col,typ) in tsp_orig.dtypes:
> if col=='Agility_GAID':
> typ='string'
> customSchema.add(col,typ,True)
> Getting 
>   ValueError: Could not parse datatype: bigint
> Looks like pyspark.sql.types doesn't know anything about bigint.. 
> Should it be aliased to LongType in pyspark.sql.types?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-3146.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
> Fix For: 1.3.0
>
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558550#comment-15558550
 ] 

Cody Koeninger commented on SPARK-3146:
---

SPARK-4964 / the direct stream added a messageHandler.


> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10794) Spark-SQL- select query on table column with binary Data Type displays error message- java.lang.ClassCastException: java.lang.String cannot be cast to [B

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558631#comment-15558631
 ] 

Xiao Li commented on SPARK-10794:
-

The related parts are changed a lot. Could you retry it? Thanks!

> Spark-SQL- select query on table column with binary Data Type displays error 
> message- java.lang.ClassCastException: java.lang.String cannot be cast to [B
> -
>
> Key: SPARK-10794
> URL: https://issues.apache.org/jira/browse/SPARK-10794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 running on MapR 5.0 sandbox
>Reporter: Anilkumar Kalshetti
>Priority: Critical
> Attachments: binaryDataType.png, spark_1_5_0.png, testbinary.txt
>
>
> Spark-SQL connected to Hive Metastore-- MapR5.0 has Hive 1.0.0
> Use beeline interface for Spark-SQL
> 1] Execute below query to create Table,
> CREATE TABLE default.testbinary  ( 
> c1 binary, 
> c2 string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
> 2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
> and execute below script to load data in table
> LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary
> //testbinary.txt  contains data
> 1001,'russia'
> 3] Execute below 'Describe' command to get table information, and select 
> command to get table data
> describe  testbinary;
> SELECT c1 FROM testbinary;
> 4] Select query displays error message:
>  java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> Info:  for same table - select query on column c2 - string datatype works 
> properly
> SELECT c2 FROM testbinary;
> Please refer screenshot- binaryDataType.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11062) Thrift server does not support operationLog

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-11062.
-
Resolution: Duplicate

> Thrift server does not support operationLog
> ---
>
> Key: SPARK-11062
> URL: https://issues.apache.org/jira/browse/SPARK-11062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Navis
>Priority: Trivial
>
> Currently, SparkExecuteStatementOperation is skipping beforeRun/afterRun 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559052#comment-15559052
 ] 

Xiao Li commented on SPARK-9205:


This is not an issue, right? Since this JIRA is stale, let us close it now. If 
needed, we can create a new JIRA against new versions.

> org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
> -
>
> Key: SPARK-9205
> URL: https://issues.apache.org/jira/browse/SPARK-9205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9205.
--
Resolution: Cannot Reproduce

> org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
> -
>
> Key: SPARK-9205
> URL: https://issues.apache.org/jira/browse/SPARK-9205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9359) Support IntervalType for Parquet

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-9359.

Resolution: Duplicate

> Support IntervalType for Parquet
> 
>
> Key: SPARK-9359
> URL: https://issues.apache.org/jira/browse/SPARK-9359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>
> SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
> {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9359) Support IntervalType for Parquet

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9359.
--
Assignee: (was: Liang-Chi Hsieh)

> Support IntervalType for Parquet
> 
>
> Key: SPARK-9359
> URL: https://issues.apache.org/jira/browse/SPARK-9359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
> {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559068#comment-15559068
 ] 

Xiao Li commented on SPARK-11087:
-

Can you retry it using the latest master/2.0.1 branch? Thanks!

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> 

[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559240#comment-15559240
 ] 

holdenk commented on SPARK-11758:
-

I believe dropping the index field is intentional (but we should probably 
document it). I'm less certain about time information, what do you think we 
should do with timestamprecords?

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10501) support UUID as an atomic type

2016-10-08 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559271#comment-15559271
 ] 

Russell Spitzer commented on SPARK-10501:
-

It's not that we need it as a unique identifier. It's already a datatype in the 
Cassandra database but there is no direct translation to a spark sql type so a 
conversion to string must be done. In addition TimeUUIDs require a custom 
non-bytewise comparator so a greater than or less than lexical comparison of 
them is always incorrect. 

https://datastax-oss.atlassian.net/browse/SPARKC-405

> support UUID as an atomic type
> --
>
> Key: SPARK-10501
> URL: https://issues.apache.org/jira/browse/SPARK-10501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jon Haddad
>Priority: Minor
>
> It's pretty common to use UUIDs instead of integers in order to avoid 
> distributed counters.  
> I've added this, which at least lets me load dataframes that use UUIDs that I 
> can cast to strings:
> {code}
> class UUIDType(AtomicType):
> pass
> _type_mappings[UUID] = UUIDType
> _atomic_types.append(UUIDType)
> {code}
> But if I try to do anything else with the UUIDs, like this:
> {code}
> ratings.select("userid").distinct().collect()
> {code}
> I get this pile of fun: 
> {code}
> scala.MatchError: UUIDType (of class 
> org.apache.spark.sql.cassandra.types.UUIDType$)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11523) spark_partition_id() considered invalid function

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11523.
---
Resolution: Not A Problem

> spark_partition_id() considered invalid function
> 
>
> Key: SPARK-11523
> URL: https://issues.apache.org/jira/browse/SPARK-11523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: hive, sql, views
>
> {{spark_partition_id()}} works correctly in top-level {{SELECT}} statements 
> but is not recognized in {{SELECT}} statements that define views. It seems 
> DDL processing vs. execution in Spark SQL use two different parsers and/or 
> environments.
> In the following examples, instead of the {{test_data}} table you can use any 
> defined table name.
> A top-level statement works:
> {code}
> scala> ctx.sql("select spark_partition_id() as partition_id from 
> test_data").show
> ++
> |partition_id|
> ++
> |   0|
> ...
> |   0|
> ++
> only showing top 20 rows
> {code}
> The same query in a view definition fails with {{Invalid function 
> 'spark_partition_id'}}.
> {code}
> scala> ctx.sql("create view test_view as select spark_partition_id() as 
> partition_id from test_data")
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  end=1446703538519 duration=1 from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
> position=12
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  
> cmd=get_database: default
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=test_data
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
> 15/11/05 01:05:38 INFO Context: New scratch dir is 
> hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
> Analysis
> 15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
> statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create 
> view
> 15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
> 1:32 Invalid function 'spark_partition_id'
> org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
> 'spark_partition_id'
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10512)
>   at 
> 

[jira] [Commented] (SPARK-11523) spark_partition_id() considered invalid function

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559072#comment-15559072
 ] 

Xiao Li commented on SPARK-11523:
-

Native views are supported since 2.0. Thus, this JIRA is not needed. Thanks!

> spark_partition_id() considered invalid function
> 
>
> Key: SPARK-11523
> URL: https://issues.apache.org/jira/browse/SPARK-11523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: hive, sql, views
>
> {{spark_partition_id()}} works correctly in top-level {{SELECT}} statements 
> but is not recognized in {{SELECT}} statements that define views. It seems 
> DDL processing vs. execution in Spark SQL use two different parsers and/or 
> environments.
> In the following examples, instead of the {{test_data}} table you can use any 
> defined table name.
> A top-level statement works:
> {code}
> scala> ctx.sql("select spark_partition_id() as partition_id from 
> test_data").show
> ++
> |partition_id|
> ++
> |   0|
> ...
> |   0|
> ++
> only showing top 20 rows
> {code}
> The same query in a view definition fails with {{Invalid function 
> 'spark_partition_id'}}.
> {code}
> scala> ctx.sql("create view test_view as select spark_partition_id() as 
> partition_id from test_data")
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  end=1446703538519 duration=1 from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
> position=12
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  
> cmd=get_database: default
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=test_data
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
> 15/11/05 01:05:38 INFO Context: New scratch dir is 
> hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
> Analysis
> 15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
> statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create 
> view
> 15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
> 1:32 Invalid function 'spark_partition_id'
> org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
> 'spark_partition_id'
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
>   at 
> 

[jira] [Commented] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559074#comment-15559074
 ] 

Xiao Li commented on SPARK-6413:


This has been well supported since Spark 2.0. Thus, close it now. Thanks!

> For data source tables, we should provide better output for DESCRIBE FORMATTED
> --
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559103#comment-15559103
 ] 

Xiao Li commented on SPARK-10318:
-

Yeah. Will follow your guideline in the future. Thanks!

> Getting issue in spark connectivity with cassandra
> --
>
> Key: SPARK-10318
> URL: https://issues.apache.org/jira/browse/SPARK-10318
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark on local mode with centos 6.x
>Reporter: Poorvi Lashkary
>Priority: Minor
>
> Use case: I have to craete spark sql dataframe with the table on cassandra 
> with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11479) add kmeans example for Dataset

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11479.
---
Resolution: Won't Fix

> add kmeans example for Dataset
> --
>
> Key: SPARK-11479
> URL: https://issues.apache.org/jira/browse/SPARK-11479
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11479) add kmeans example for Dataset

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559114#comment-15559114
 ] 

Xiao Li commented on SPARK-11479:
-

Based on the PR, we should close it now. Please reopen it if you still think it 
is needed. Thanks!

> add kmeans example for Dataset
> --
>
> Key: SPARK-11479
> URL: https://issues.apache.org/jira/browse/SPARK-11479
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14420) keepLastCheckpoint Param for Python LDA with EM

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14420.
---
Resolution: Duplicate

> keepLastCheckpoint Param for Python LDA with EM
> ---
>
> Key: SPARK-14420
> URL: https://issues.apache.org/jira/browse/SPARK-14420
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See linked JIRA for Scala API.  This can add it in spark.ml.  Adding to 
> spark.mllib is optional IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14420) keepLastCheckpoint Param for Python LDA with EM

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14420:

Fix Version/s: 2.0.0

> keepLastCheckpoint Param for Python LDA with EM
> ---
>
> Key: SPARK-14420
> URL: https://issues.apache.org/jira/browse/SPARK-14420
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> See linked JIRA for Scala API.  This can add it in spark.ml.  Adding to 
> spark.mllib is optional IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-08 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559044#comment-15559044
 ] 

Ron Hu commented on SPARK-17626:


In the CBO design spec we posted in 
https://issues.apache.org/jira/browse/SPARK-16026,
we illustrated a Multi-way Join Ordering Optimization algorithm using dynamic 
programming technique.  This algorithm should be able to pick up the best join 
re-ordering plan. It is possible that the search space is big.  We need some 
heuristics to reduce the search space. 

As Zhenhua pointed out, we can identify all the primary-key/foreign-key joins 
as we collect number of distinct values to infer whether or not a join column 
is a primary key.  If a join relation has primary key join column, then it is a 
dimension table.  If a join relation has foreign key columns, then it is a fact 
table.  Once a fact table is identified, we form a star schema by finding out 
all the dimension tables that have join conditions with the given fact table.

As for the selectivity hint, we do not need selectivity hint to deal with 
comparison expression like:
  column_nameconstant_value
where a comparison operator is =, <, <=, >, >=, etc. 
This is because, with the histogram we are implementing now in CBO, we can find 
the filtering selectivity properly.  However, for the following cases, a 
selectivity hint will be helpful.

Case 1:
  WHERE o_comment not like '%special%request%'  /* TPC-H Q13 */
Histogram cannot provide such detailed statistics information for a string 
pattern which can a complex regular expression.

Case 2:
  WHERE l_commitdate < l_receiptdate /* TPC-H Q4 */
Today we define one-dimensional histogram to keep track the data distribution 
of a single column.  We do not handle the non-equal relationship between two 
columns.


> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Closed] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-6413.
--
Resolution: Not A Problem

> For data source tables, we should provide better output for DESCRIBE FORMATTED
> --
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559117#comment-15559117
 ] 

Xiao Li commented on SPARK-7659:


This should have been fixed in 2.0. Please reopen it if you still hit it. 
Thanks!

> Sort by attributes that are not present in the SELECT clause when there is 
> windowfunction analysis error
> 
>
> Key: SPARK-7659
> URL: https://issues.apache.org/jira/browse/SPARK-7659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Fei Wang
>
> flowing sql get error:
> select month,
> sum(product) over (partition by month)
> from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7659.
--
Resolution: Not A Problem

> Sort by attributes that are not present in the SELECT clause when there is 
> windowfunction analysis error
> 
>
> Key: SPARK-7659
> URL: https://issues.apache.org/jira/browse/SPARK-7659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Fei Wang
>
> flowing sql get error:
> select month,
> sum(product) over (partition by month)
> from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8115) Remove TestData

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-8115.

Resolution: Later

It sounds like this is not being fixed in the short term. Please reopen it if 
it is still needed.

> Remove TestData
> ---
>
> Key: SPARK-8115
> URL: https://issues.apache.org/jira/browse/SPARK-8115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Minor
>
> TestData was from the era when we didn't have easy ways to generate test 
> datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
> the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10502) tidy up the exception message text to be less verbose/"User friendly"

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10502.
---
Resolution: Won't Fix

> tidy up the exception message text to be less verbose/"User friendly"
> -
>
> Key: SPARK-10502
> URL: https://issues.apache.org/jira/browse/SPARK-10502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> When a statement is parsed, it would be preferred is the exception text were 
> more aligned with other vendors re indicating the syntax error without the 
> inclusion of the verbose parse tree.
>  select tbint.rnum,tbint.cbint, nth_value( tbint.cbint, '4' ) over ( order by 
> tbint.rnum) from certstring.tbint 
> Errors:
> org.apache.spark.sql.AnalysisException: 
> Unsupported language features in query: select tbint.rnum,tbint.cbint, 
> nth_value( tbint.cbint, '4' ) over ( order by tbint.rnum) from 
> certstring.tbint
> TOK_QUERY 1, 0,40, 94
>   TOK_FROM 1, 36,40, 94
> TOK_TABREF 1, 38,40, 94
>   TOK_TABNAME 1, 38,40, 94
> certstring 1, 38,38, 94
> tbint 1, 40,40, 105
>   TOK_INSERT 0, -1,34, 0
> TOK_DESTINATION 0, -1,-1, 0
>   TOK_DIR 0, -1,-1, 0
> TOK_TMP_FILE 0, -1,-1, 0
> TOK_SELECT 1, 0,34, 12
>   TOK_SELEXPR 1, 2,4, 12
> . 1, 2,4, 12
>   TOK_TABLE_OR_COL 1, 2,2, 7
> tbint 1, 2,2, 7
>   rnum 1, 4,4, 13
>   TOK_SELEXPR 1, 6,8, 23
> . 1, 6,8, 23
>   TOK_TABLE_OR_COL 1, 6,6, 18
> tbint 1, 6,6, 18
>   cbint 1, 8,8, 24
>   TOK_SELEXPR 1, 11,34, 31
> TOK_FUNCTION 1, 11,34, 31
>   nth_value 1, 11,11, 31
>   . 1, 14,16, 47
> TOK_TABLE_OR_COL 1, 14,14, 42
>   tbint 1, 14,14, 42
> cbint 1, 16,16, 48
>   '4' 1, 19,19, 55
>   TOK_WINDOWSPEC 1, 25,34, 82
> TOK_PARTITIONINGSPEC 1, 27,33, 82
>   TOK_ORDERBY 1, 27,33, 82
> TOK_TABSORTCOLNAMEASC 1, 31,33, 82
>   . 1, 31,33, 82
> TOK_TABLE_OR_COL 1, 31,31, 77
>   tbint 1, 31,31, 77
> rnum 1, 33,33, 83
> scala.NotImplementedError: No parse rules for ASTNode type: 882, text: 
> TOK_WINDOWSPEC :
> TOK_WINDOWSPEC 1, 25,34, 82
>   TOK_PARTITIONINGSPEC 1, 27,33, 82
> TOK_ORDERBY 1, 27,33, 82
>   TOK_TABSORTCOLNAMEASC 1, 31,33, 82
> . 1, 31,33, 82
>   TOK_TABLE_OR_COL 1, 31,31, 77
> tbint 1, 31,31, 77
>   rnum 1, 33,33, 83
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10318) Getting issue in spark connectivity with cassandra

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10318.
---
Resolution: Fixed

> Getting issue in spark connectivity with cassandra
> --
>
> Key: SPARK-10318
> URL: https://issues.apache.org/jira/browse/SPARK-10318
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark on local mode with centos 6.x
>Reporter: Poorvi Lashkary
>Priority: Minor
>
> Use case: I have to craete spark sql dataframe with the table on cassandra 
> with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17835) Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17835:

Description: 
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
mllib as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.

During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then encode the labels for training. It involves extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
pass aggregation during training, add another one seems less efficient. We can 
get the labels in a single pass along with {{NaiveBayes}} training and send 
them to MLlib side.

  was:
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
mllib as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.

During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.


> Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
> -
>
> Key: SPARK-17835
> URL: https://issues.apache.org/jira/browse/SPARK-17835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and 
> left mllib as a wrapper. However, there are some difference between mllib and 
> ml to handle {{labels}}:
> * mllib allow input labels as {-1, +1}, however, ml assumes the input labels 
> in range [0, numClasses).
> * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
> assumption mention above.
> During the copy in SPARK-14077, we use {{val labels = 
> data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
> firstly, and then encode the labels for training. It involves extra Spark job 
> compared with the original implementation. Since {{NaiveBayes}} only do one 
> pass aggregation during training, add another one seems less efficient. We 
> can get the labels in a single pass along with {{NaiveBayes}} training and 
> send them to MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17820) Spark sqlContext.sql() performs only first insert for HiveQL "FROM target INSERT INTO dest" command to insert into multiple target tables from same source

2016-10-08 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557433#comment-15557433
 ] 

Jiang Xingbo commented on SPARK-17820:
--

Looks we could support this by expanding `multiInsertQueryBody` rule, should we 
do this?[~hvanhovell]

> Spark sqlContext.sql() performs only first insert for HiveQL "FROM target 
> INSERT INTO dest" command to insert into multiple target tables from same 
> source
> --
>
> Key: SPARK-17820
> URL: https://issues.apache.org/jira/browse/SPARK-17820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera Quickstart VM 5.7
>Reporter: Kiran Miryala
>
> I am executing a HiveQL in spark-shell, I intend to insert a record into 2 
> destination tables from the same source table using same statement. But it 
> inserts in only first destination table. My statement:
> scala>val departmentsData = sqlContext.sql("from sqoop_import.departments 
> insert into sqoop_import.names_count1 select department_name, count(1) where 
> department_id=2 group by department_name insert into 
> sqoop_import.names_count2 select department_name, count(1) where 
> department_id=4 group by department_name")
> Same query inserts into both destination tables on hive shell:
> from sqoop_import.departments 
> insert into sqoop_import.names_count1 
> select department_name, count(1) 
> where department_id=2 group by department_name 
> insert into sqoop_import.names_count2 
> select department_name, count(1) 
> where department_id=4 group by department_name;
> Both target table definitions are:
> hive>use sqoop_import;
> hive> create table names_count1 (department_name String, count Int);
> hive> create table names_count2 (department_name String, count Int);
> Not sure why it is skipping next one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17820) Spark sqlContext.sql() performs only first insert for HiveQL "FROM target INSERT INTO dest" command to insert into multiple target tables from same source

2016-10-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557458#comment-15557458
 ] 

Herman van Hovell commented on SPARK-17820:
---

Yeah, sure we should take a look at this . Just know that although we (should) 
support multi-insert, it might not be a performant as one might think: we plan 
these as separate queries.

> Spark sqlContext.sql() performs only first insert for HiveQL "FROM target 
> INSERT INTO dest" command to insert into multiple target tables from same 
> source
> --
>
> Key: SPARK-17820
> URL: https://issues.apache.org/jira/browse/SPARK-17820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera Quickstart VM 5.7
>Reporter: Kiran Miryala
>
> I am executing a HiveQL in spark-shell, I intend to insert a record into 2 
> destination tables from the same source table using same statement. But it 
> inserts in only first destination table. My statement:
> {noformat}
> scala>val departmentsData = sqlContext.sql("from sqoop_import.departments 
> insert into sqoop_import.names_count1 select department_name, count(1) where 
> department_id=2 group by department_name insert into 
> sqoop_import.names_count2 select department_name, count(1) where 
> department_id=4 group by department_name")
> {noformat}
> Same query inserts into both destination tables on hive shell:
> {noformat}
> from sqoop_import.departments 
> insert into sqoop_import.names_count1 
> select department_name, count(1) 
> where department_id=2 group by department_name 
> insert into sqoop_import.names_count2 
> select department_name, count(1) 
> where department_id=4 group by department_name;
> {noformat}
> Both target table definitions are:
> {noformat}
> hive>use sqoop_import;
> hive> create table names_count1 (department_name String, count Int);
> hive> create table names_count2 (department_name String, count Int);
> {noformat}
> Not sure why it is skipping next one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10318) Getting issue in spark connectivity with cassandra

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-10318:
---

[~smilegator] shouldn't we resolve this as a duplicate of the main, fixed issue 
rather than the other way around?

> Getting issue in spark connectivity with cassandra
> --
>
> Key: SPARK-10318
> URL: https://issues.apache.org/jira/browse/SPARK-10318
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark on local mode with centos 6.x
>Reporter: Poorvi Lashkary
>Priority: Minor
>
> Use case: I have to craete spark sql dataframe with the table on cassandra 
> with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10318) Getting issue in spark connectivity with cassandra

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10318.
---
Resolution: Duplicate

> Getting issue in spark connectivity with cassandra
> --
>
> Key: SPARK-10318
> URL: https://issues.apache.org/jira/browse/SPARK-10318
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark on local mode with centos 6.x
>Reporter: Poorvi Lashkary
>Priority: Minor
>
> Use case: I have to craete spark sql dataframe with the table on cassandra 
> with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8377) Identifiers caseness information should be available at any time

2016-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557542#comment-15557542
 ] 

Sean Owen commented on SPARK-8377:
--

This sounds more like "Not a Problem" if the resolution wasn't a particular 
change, but just taking a different action?

> Identifiers caseness information should be available at any time
> 
>
> Key: SPARK-8377
> URL: https://issues.apache.org/jira/browse/SPARK-8377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Santiago M. Mola
>
> Currently, we have the option of having a case sensitive catalog or not. A 
> case insensitive catalog just lowercases all identifiers. However, when 
> pushing down to a data source, we lose the information about if an identifier 
> should be case insensitive or strictly lowercase.
> Ideally, we would be able to distinguish a case insensitive identifier from a 
> case sensitive one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1792) Missing Spark-Shell Configure Options

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1792:
-
Fix Version/s: 1.1.0

> Missing Spark-Shell Configure Options
> -
>
> Key: SPARK-1792
> URL: https://issues.apache.org/jira/browse/SPARK-1792
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Joseph E. Gonzalez
> Fix For: 1.1.0
>
>
> The `conf/spark-env.sh.template` does not have configure options for the 
> spark shell.   For example to enable Kryo for GraphX when using the spark 
> shell in stand alone mode it appears you must add:
> {code}
> SPARK_SUBMIT_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
>  "
> SPARK_SUBMIT_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
>   "
> {code}
> However SPARK_SUBMIT_OPTS is not documented anywhere.  Perhaps the 
> spark-shell should have its own options (e.g., SPARK_SHELL_OPTS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms

2016-10-08 Thread Lei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei Wang updated SPARK-17836:
-
Issue Type: New Feature  (was: Bug)

> Use cross validation to determine the number of clusters for EM or KMeans 
> algorithms
> 
>
> Key: SPARK-17836
> URL: https://issues.apache.org/jira/browse/SPARK-17836
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Sometimes it's not easy for users to determine number of clusters.
> It would be very useful If spark ml can support this. 
> There are several methods to do this according to wiki 
> https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
> Weka uses crossing validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9685) "Unsupported dataType: char(X)" in Hive

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9685.
--
Resolution: Duplicate

> "Unsupported dataType: char(X)" in Hive
> ---
>
> Key: SPARK-9685
> URL: https://issues.apache.org/jira/browse/SPARK-9685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ángel Álvarez
> Attachments: SPARK-9685.1.patch.txt
>
>
> I'm getting the following error when I try to read a Hive table with char(X) 
> fields:
> {code}
> 15/08/06 11:38:51 INFO parse.ParseDriver: Parse Completed
> org.apache.spark.sql.types.DataTypeException: Unsupported dataType: char(8). 
> If you have a struct and a field name of it has any special characters, 
> please use backticks (`) to quote that field name, e.g. `x+y`. Please note 
> that backtick itself is not supported in a field name.
> at 
> org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
> at 
> org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
> at 
> org.apache.spark.sql.types.DataTypeParser$.parse(DataTypeParser.scala:111)
> at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:769)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:742)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
> {code}
> It seems there is no "char" DataType defined in the DataTypeParser class
> {code}
>   protected lazy val primitiveType: Parser[DataType] =
> "(?i)string".r ^^^ StringType |
> "(?i)float".r ^^^ FloatType |
> "(?i)(?:int|integer)".r ^^^ IntegerType |
> "(?i)tinyint".r ^^^ ByteType |
> "(?i)smallint".r ^^^ ShortType |
> "(?i)double".r ^^^ DoubleType |
> "(?i)(?:bigint|long)".r ^^^ LongType |
> "(?i)binary".r ^^^ BinaryType |
> "(?i)boolean".r ^^^ BooleanType |
> fixedDecimalType |
> "(?i)decimal".r ^^^ DecimalType.USER_DEFAULT |
> "(?i)date".r ^^^ DateType |
> "(?i)timestamp".r ^^^ TimestampType |
> varchar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8842) Spark SQL - Insert into table Issue

2016-10-08 Thread James Greenwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557584#comment-15557584
 ] 

James Greenwood commented on SPARK-8842:


No, does no work

> Spark SQL - Insert into table Issue
> ---
>
> Key: SPARK-8842
> URL: https://issues.apache.org/jira/browse/SPARK-8842
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: James Greenwood
>
> I am running spark 1.4 and currently experiencing an issue when inserting 
> data into a table. The data is loaded into an initial table and then selected 
> from this table, processed and then inserted into a second table. The issue 
> is that some of the data goes missing when inserted into the second table 
> when running in a multi-worker configuration (a master, a worker on the 
> master and then a worker on a different host). 
> I have narrowed down the problem to the insert into the second table. An 
> example process to generate the problem is below. 
> Generate a file (for example /home/spark/test) with the numbers 1 to 50 on 
> separate lines. 
> spark-sql --master spark://spark-master:7077 --hiveconf 
> hive.metastore.warehouse.dir=/spark 
> (/spark is shared between all hosts) 
> create table test(field string); 
> load data inpath '/home/spark/test' into table test; 
> create table processed(field string); 
> from test insert into table processed select *; 
> select * from processed; 
> The result from the final select does not contain all the numbers 1 to 50. 
> I have also run the above example in some different configurations :- 
> - When there is just one worker running on the master. The result of the 
> final select is the rows 1-50 i.e all data as expected. 
> - When there is just one worker running on a host which is not the master. 
> The final select returns no rows.
> No errors are logged in the log files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9685) "Unsupported dataType: char(X)" in Hive

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-9685:
--

> "Unsupported dataType: char(X)" in Hive
> ---
>
> Key: SPARK-9685
> URL: https://issues.apache.org/jira/browse/SPARK-9685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ángel Álvarez
> Attachments: SPARK-9685.1.patch.txt
>
>
> I'm getting the following error when I try to read a Hive table with char(X) 
> fields:
> {code}
> 15/08/06 11:38:51 INFO parse.ParseDriver: Parse Completed
> org.apache.spark.sql.types.DataTypeException: Unsupported dataType: char(8). 
> If you have a struct and a field name of it has any special characters, 
> please use backticks (`) to quote that field name, e.g. `x+y`. Please note 
> that backtick itself is not supported in a field name.
> at 
> org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
> at 
> org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
> at 
> org.apache.spark.sql.types.DataTypeParser$.parse(DataTypeParser.scala:111)
> at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:769)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:742)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
> at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
> {code}
> It seems there is no "char" DataType defined in the DataTypeParser class
> {code}
>   protected lazy val primitiveType: Parser[DataType] =
> "(?i)string".r ^^^ StringType |
> "(?i)float".r ^^^ FloatType |
> "(?i)(?:int|integer)".r ^^^ IntegerType |
> "(?i)tinyint".r ^^^ ByteType |
> "(?i)smallint".r ^^^ ShortType |
> "(?i)double".r ^^^ DoubleType |
> "(?i)(?:bigint|long)".r ^^^ LongType |
> "(?i)binary".r ^^^ BinaryType |
> "(?i)boolean".r ^^^ BooleanType |
> fixedDecimalType |
> "(?i)decimal".r ^^^ DecimalType.USER_DEFAULT |
> "(?i)date".r ^^^ DateType |
> "(?i)timestamp".r ^^^ TimestampType |
> varchar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15989:
--
Assignee: Liang-Chi Hsieh

> PySpark SQL python-only UDTs don't support nested types
> ---
>
> Key: SPARK-15989
> URL: https://issues.apache.org/jira/browse/SPARK-15989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.1, 2.1.0
>
>
> [This 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html]
>  demonstrates the bug.
> The obvious issue is that nested UDTs are not supported if the UDT is 
> Python-only. Looking at the exception thrown, this seems to be because the 
> encoder on the Java end tries to encode the UDT as a Java class, which 
> doesn't exist for the [[PythonOnlyUDT]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16186:
--
Assignee: Dongjoon Hyun

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about over 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds 
> (Before)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 
> 90ms (After)
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.
> Note that this optimization is applied for IN. To apply IN predicate having 
> more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option 
> should be increased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15487:
--
Assignee: Gurvinder

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16804:
--
Assignee: Nattavut Sutyanyong

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
> Fix For: 2.1.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16596:
--
Assignee: Eric Liang

> Refactor DataSourceScanExec to do partition discovery at execution instead of 
> planning time
> ---
>
> Key: SPARK-16596
> URL: https://issues.apache.org/jira/browse/SPARK-16596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Partition discovery is rather expensive, so we should do it at execution time 
> instead of during physical planning. Right now there is not much benefit 
> since ListingFileCatalog will read scan for all partitions at planning time 
> anyways, but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16525) Enable Row Based HashMap in HashAggregateExec

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16525:
--
Assignee: Qifan Pu

> Enable Row Based HashMap in HashAggregateExec
> -
>
> Key: SPARK-16525
> URL: https://issues.apache.org/jira/browse/SPARK-16525
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Qifan Pu
>Assignee: Qifan Pu
> Fix For: 2.1.0
>
>
> Allow `RowBasedHashMapGenerator` to be used in `HashAggregateExec`, so that 
> we can turn codegen `RowBasedHashMap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10221) RowReaderFactory does not work with blobs

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557326#comment-15557326
 ] 

Xiao Li edited comment on SPARK-10221 at 10/8/16 6:09 AM:
--

This should be a bug in the connector. Thus, close it now. Thanks!


was (Author: smilegator):
This should be the bug in the connector. Thus, close it now. Thanks!

> RowReaderFactory does not work with blobs
> -
>
> Key: SPARK-10221
> URL: https://issues.apache.org/jira/browse/SPARK-10221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Max Schmidt
>
> While using a RowReaderFactory out of the Util API here: 
> com.datastax.spark.connector.japi.CassandraJavaUtil.mapRowToTuple(, 
> Class) against a cassandra table with a column which is described 
> as a ByteBuffer get the following stacktrace:
> {quote}
> 8786 [task-result-getter-0] ERROR org.apache.spark.scheduler.TaskSetManager  
> - Task 0.0 in stage 0.0 (TID 0) had a not serializable result: 
> java.nio.HeapByteBuffer
> Serialization stack:
> - object not serializable (class: java.nio.HeapByteBuffer, value: 
> java.nio.HeapByteBuffer[pos=0 lim=2 cap=2])
> - field (class: scala.Tuple4, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple4, 
> (/104.130.160.121,java.nio.HeapByteBuffer[pos=0 lim=2 cap=2],Tue Aug 25 
> 11:00:23 CEST 2015,76.808)); not retrying
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable 
> result: java.nio.HeapByteBuffer
> Serialization stack:
> - object not serializable (class: java.nio.HeapByteBuffer, value: 
> java.nio.HeapByteBuffer[pos=0 lim=2 cap=2])
> - field (class: scala.Tuple4, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple4, 
> (/104.130.160.121,java.nio.HeapByteBuffer[pos=0 lim=2 cap=2],Tue Aug 25 
> 11:00:23 CEST 2015,76.808))
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {quote}
> Using a kind of wrapper-class following bean conventions, doesn't work either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10221) RowReaderFactory does not work with blobs

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10221.
---
Resolution: Won't Fix

> RowReaderFactory does not work with blobs
> -
>
> Key: SPARK-10221
> URL: https://issues.apache.org/jira/browse/SPARK-10221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Max Schmidt
>
> While using a RowReaderFactory out of the Util API here: 
> com.datastax.spark.connector.japi.CassandraJavaUtil.mapRowToTuple(, 
> Class) against a cassandra table with a column which is described 
> as a ByteBuffer get the following stacktrace:
> {quote}
> 8786 [task-result-getter-0] ERROR org.apache.spark.scheduler.TaskSetManager  
> - Task 0.0 in stage 0.0 (TID 0) had a not serializable result: 
> java.nio.HeapByteBuffer
> Serialization stack:
> - object not serializable (class: java.nio.HeapByteBuffer, value: 
> java.nio.HeapByteBuffer[pos=0 lim=2 cap=2])
> - field (class: scala.Tuple4, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple4, 
> (/104.130.160.121,java.nio.HeapByteBuffer[pos=0 lim=2 cap=2],Tue Aug 25 
> 11:00:23 CEST 2015,76.808)); not retrying
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable 
> result: java.nio.HeapByteBuffer
> Serialization stack:
> - object not serializable (class: java.nio.HeapByteBuffer, value: 
> java.nio.HeapByteBuffer[pos=0 lim=2 cap=2])
> - field (class: scala.Tuple4, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple4, 
> (/104.130.160.121,java.nio.HeapByteBuffer[pos=0 lim=2 cap=2],Tue Aug 25 
> 11:00:23 CEST 2015,76.808))
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {quote}
> Using a kind of wrapper-class following bean conventions, doesn't work either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10502) tidy up the exception message text to be less verbose/"User friendly"

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557353#comment-15557353
 ] 

Xiao Li commented on SPARK-10502:
-

In 2.0, we introduced a new Parser. Thus, this becomes invalid. 

> tidy up the exception message text to be less verbose/"User friendly"
> -
>
> Key: SPARK-10502
> URL: https://issues.apache.org/jira/browse/SPARK-10502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> When a statement is parsed, it would be preferred is the exception text were 
> more aligned with other vendors re indicating the syntax error without the 
> inclusion of the verbose parse tree.
>  select tbint.rnum,tbint.cbint, nth_value( tbint.cbint, '4' ) over ( order by 
> tbint.rnum) from certstring.tbint 
> Errors:
> org.apache.spark.sql.AnalysisException: 
> Unsupported language features in query: select tbint.rnum,tbint.cbint, 
> nth_value( tbint.cbint, '4' ) over ( order by tbint.rnum) from 
> certstring.tbint
> TOK_QUERY 1, 0,40, 94
>   TOK_FROM 1, 36,40, 94
> TOK_TABREF 1, 38,40, 94
>   TOK_TABNAME 1, 38,40, 94
> certstring 1, 38,38, 94
> tbint 1, 40,40, 105
>   TOK_INSERT 0, -1,34, 0
> TOK_DESTINATION 0, -1,-1, 0
>   TOK_DIR 0, -1,-1, 0
> TOK_TMP_FILE 0, -1,-1, 0
> TOK_SELECT 1, 0,34, 12
>   TOK_SELEXPR 1, 2,4, 12
> . 1, 2,4, 12
>   TOK_TABLE_OR_COL 1, 2,2, 7
> tbint 1, 2,2, 7
>   rnum 1, 4,4, 13
>   TOK_SELEXPR 1, 6,8, 23
> . 1, 6,8, 23
>   TOK_TABLE_OR_COL 1, 6,6, 18
> tbint 1, 6,6, 18
>   cbint 1, 8,8, 24
>   TOK_SELEXPR 1, 11,34, 31
> TOK_FUNCTION 1, 11,34, 31
>   nth_value 1, 11,11, 31
>   . 1, 14,16, 47
> TOK_TABLE_OR_COL 1, 14,14, 42
>   tbint 1, 14,14, 42
> cbint 1, 16,16, 48
>   '4' 1, 19,19, 55
>   TOK_WINDOWSPEC 1, 25,34, 82
> TOK_PARTITIONINGSPEC 1, 27,33, 82
>   TOK_ORDERBY 1, 27,33, 82
> TOK_TABSORTCOLNAMEASC 1, 31,33, 82
>   . 1, 31,33, 82
> TOK_TABLE_OR_COL 1, 31,31, 77
>   tbint 1, 31,31, 77
> rnum 1, 33,33, 83
> scala.NotImplementedError: No parse rules for ASTNode type: 882, text: 
> TOK_WINDOWSPEC :
> TOK_WINDOWSPEC 1, 25,34, 82
>   TOK_PARTITIONINGSPEC 1, 27,33, 82
> TOK_ORDERBY 1, 27,33, 82
>   TOK_TABSORTCOLNAMEASC 1, 31,33, 82
> . 1, 31,33, 82
>   TOK_TABLE_OR_COL 1, 31,31, 77
> tbint 1, 31,31, 77
>   rnum 1, 33,33, 83
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17835) Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17835:

Description: 
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
mllib as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.

During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.

  was:
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
ml as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.

During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.


> Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
> -
>
> Key: SPARK-17835
> URL: https://issues.apache.org/jira/browse/SPARK-17835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and 
> left mllib as a wrapper. However, there are some difference between mllib and 
> ml to handle {{labels}}:
> * mllib allow input labels as {-1, +1}, however, ml assumes the input labels 
> in range [0, numClasses).
> * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
> assumption mention above.
> During the copy in SPARK-14077, we use {{val labels = 
> data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
> firstly, and then feed to training. It inovlves another extra Spark job 
> compared with the original implementation. Since {{NaiveBayes}} only do one 
> aggregation during training, add another one seems not efficient. We can get 
> the labels in a single pass along with {{NaiveBayes}} training and send them 
> to MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17835) Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17835:

Description: 
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
ml as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.

During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.

  was:
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
ml as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.
During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.


> Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
> -
>
> Key: SPARK-17835
> URL: https://issues.apache.org/jira/browse/SPARK-17835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and 
> left ml as a wrapper. However, there are some difference between mllib and ml 
> to handle {{labels}}:
> * mllib allow input labels as {-1, +1}, however, ml assumes the input labels 
> in range [0, numClasses).
> * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
> assumption mention above.
> During the copy in SPARK-14077, we use {{val labels = 
> data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
> firstly, and then feed to training. It inovlves another extra Spark job 
> compared with the original implementation. Since {{NaiveBayes}} only do one 
> aggregation during training, add another one seems not efficient. We can get 
> the labels in a single pass along with {{NaiveBayes}} training and send them 
> to MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17835) Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17835:

Issue Type: Improvement  (was: Bug)

> Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
> -
>
> Key: SPARK-17835
> URL: https://issues.apache.org/jira/browse/SPARK-17835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and 
> left ml as a wrapper. However, there are some difference between mllib and ml 
> to handle {{labels}}:
> * mllib allow input labels as {-1, +1}, however, ml assumes the input labels 
> in range [0, numClasses).
> * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
> assumption mention above.
> During the copy in SPARK-14077, we use {{val labels = 
> data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
> firstly, and then feed to training. It inovlves another extra Spark job 
> compared with the original implementation. Since {{NaiveBayes}} only do one 
> aggregation during training, add another one seems not efficient. We can get 
> the labels in a single pass along with {{NaiveBayes}} training and send them 
> to MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >