[jira] [Resolved] (SPARK-8115) Remove TestData

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-8115.

Resolution: Later

It sounds like this is not being fixed in the short term. Please reopen it if 
it is still needed.

> Remove TestData
> ---
>
> Key: SPARK-8115
> URL: https://issues.apache.org/jira/browse/SPARK-8115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Minor
>
> TestData was from the era when we didn't have easy ways to generate test 
> datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
> the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10501) support UUID as an atomic type

2016-10-08 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559271#comment-15559271
 ] 

Russell Spitzer commented on SPARK-10501:
-

It's not that we need it as a unique identifier. It's already a datatype in the 
Cassandra database but there is no direct translation to a spark sql type so a 
conversion to string must be done. In addition TimeUUIDs require a custom 
non-bytewise comparator so a greater than or less than lexical comparison of 
them is always incorrect. 

https://datastax-oss.atlassian.net/browse/SPARKC-405

> support UUID as an atomic type
> --
>
> Key: SPARK-10501
> URL: https://issues.apache.org/jira/browse/SPARK-10501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jon Haddad
>Priority: Minor
>
> It's pretty common to use UUIDs instead of integers in order to avoid 
> distributed counters.  
> I've added this, which at least lets me load dataframes that use UUIDs that I 
> can cast to strings:
> {code}
> class UUIDType(AtomicType):
> pass
> _type_mappings[UUID] = UUIDType
> _atomic_types.append(UUIDType)
> {code}
> But if I try to do anything else with the UUIDs, like this:
> {code}
> ratings.select("userid").distinct().collect()
> {code}
> I get this pile of fun: 
> {code}
> scala.MatchError: UUIDType (of class 
> org.apache.spark.sql.cassandra.types.UUIDType$)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559240#comment-15559240
 ] 

holdenk commented on SPARK-11758:
-

I believe dropping the index field is intentional (but we should probably 
document it). I'm less certain about time information, what do you think we 
should do with timestamprecords?

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14420) keepLastCheckpoint Param for Python LDA with EM

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14420:

Fix Version/s: 2.0.0

> keepLastCheckpoint Param for Python LDA with EM
> ---
>
> Key: SPARK-14420
> URL: https://issues.apache.org/jira/browse/SPARK-14420
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> See linked JIRA for Scala API.  This can add it in spark.ml.  Adding to 
> spark.mllib is optional IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14420) keepLastCheckpoint Param for Python LDA with EM

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14420.
---
Resolution: Duplicate

> keepLastCheckpoint Param for Python LDA with EM
> ---
>
> Key: SPARK-14420
> URL: https://issues.apache.org/jira/browse/SPARK-14420
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See linked JIRA for Scala API.  This can add it in spark.ml.  Adding to 
> spark.mllib is optional IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7659.
--
Resolution: Not A Problem

> Sort by attributes that are not present in the SELECT clause when there is 
> windowfunction analysis error
> 
>
> Key: SPARK-7659
> URL: https://issues.apache.org/jira/browse/SPARK-7659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Fei Wang
>
> flowing sql get error:
> select month,
> sum(product) over (partition by month)
> from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559117#comment-15559117
 ] 

Xiao Li commented on SPARK-7659:


This should have been fixed in 2.0. Please reopen it if you still hit it. 
Thanks!

> Sort by attributes that are not present in the SELECT clause when there is 
> windowfunction analysis error
> 
>
> Key: SPARK-7659
> URL: https://issues.apache.org/jira/browse/SPARK-7659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Fei Wang
>
> flowing sql get error:
> select month,
> sum(product) over (partition by month)
> from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11479) add kmeans example for Dataset

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11479.
---
Resolution: Won't Fix

> add kmeans example for Dataset
> --
>
> Key: SPARK-11479
> URL: https://issues.apache.org/jira/browse/SPARK-11479
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11479) add kmeans example for Dataset

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559114#comment-15559114
 ] 

Xiao Li commented on SPARK-11479:
-

Based on the PR, we should close it now. Please reopen it if you still think it 
is needed. Thanks!

> add kmeans example for Dataset
> --
>
> Key: SPARK-11479
> URL: https://issues.apache.org/jira/browse/SPARK-11479
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559103#comment-15559103
 ] 

Xiao Li commented on SPARK-10318:
-

Yeah. Will follow your guideline in the future. Thanks!

> Getting issue in spark connectivity with cassandra
> --
>
> Key: SPARK-10318
> URL: https://issues.apache.org/jira/browse/SPARK-10318
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark on local mode with centos 6.x
>Reporter: Poorvi Lashkary
>Priority: Minor
>
> Use case: I have to craete spark sql dataframe with the table on cassandra 
> with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-6413.
--
Resolution: Not A Problem

> For data source tables, we should provide better output for DESCRIBE FORMATTED
> --
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559074#comment-15559074
 ] 

Xiao Li commented on SPARK-6413:


This has been well supported since Spark 2.0. Thus, close it now. Thanks!

> For data source tables, we should provide better output for DESCRIBE FORMATTED
> --
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11523) spark_partition_id() considered invalid function

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559072#comment-15559072
 ] 

Xiao Li commented on SPARK-11523:
-

Native views are supported since 2.0. Thus, this JIRA is not needed. Thanks!

> spark_partition_id() considered invalid function
> 
>
> Key: SPARK-11523
> URL: https://issues.apache.org/jira/browse/SPARK-11523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: hive, sql, views
>
> {{spark_partition_id()}} works correctly in top-level {{SELECT}} statements 
> but is not recognized in {{SELECT}} statements that define views. It seems 
> DDL processing vs. execution in Spark SQL use two different parsers and/or 
> environments.
> In the following examples, instead of the {{test_data}} table you can use any 
> defined table name.
> A top-level statement works:
> {code}
> scala> ctx.sql("select spark_partition_id() as partition_id from 
> test_data").show
> ++
> |partition_id|
> ++
> |   0|
> ...
> |   0|
> ++
> only showing top 20 rows
> {code}
> The same query in a view definition fails with {{Invalid function 
> 'spark_partition_id'}}.
> {code}
> scala> ctx.sql("create view test_view as select spark_partition_id() as 
> partition_id from test_data")
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  end=1446703538519 duration=1 from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
> position=12
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  
> cmd=get_database: default
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=test_data
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
> 15/11/05 01:05:38 INFO Context: New scratch dir is 
> hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
> Analysis
> 15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
> statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create 
> view
> 15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
> 1:32 Invalid function 'spark_partition_id'
> org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
> 'spark_partition_id'
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
>   at 
> 

[jira] [Closed] (SPARK-11523) spark_partition_id() considered invalid function

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11523.
---
Resolution: Not A Problem

> spark_partition_id() considered invalid function
> 
>
> Key: SPARK-11523
> URL: https://issues.apache.org/jira/browse/SPARK-11523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: hive, sql, views
>
> {{spark_partition_id()}} works correctly in top-level {{SELECT}} statements 
> but is not recognized in {{SELECT}} statements that define views. It seems 
> DDL processing vs. execution in Spark SQL use two different parsers and/or 
> environments.
> In the following examples, instead of the {{test_data}} table you can use any 
> defined table name.
> A top-level statement works:
> {code}
> scala> ctx.sql("select spark_partition_id() as partition_id from 
> test_data").show
> ++
> |partition_id|
> ++
> |   0|
> ...
> |   0|
> ++
> only showing top 20 rows
> {code}
> The same query in a view definition fails with {{Invalid function 
> 'spark_partition_id'}}.
> {code}
> scala> ctx.sql("create view test_view as select spark_partition_id() as 
> partition_id from test_data")
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
> select spark_partition_id() as partition_id from test_data
> 15/11/05 01:05:38 INFO ParseDriver: Parse Completed
> 15/11/05 01:05:38 INFO PerfLogger:  end=1446703538519 duration=1 from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
> position=12
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  
> cmd=get_database: default
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
> 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
> 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=test_data
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
> 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
> 15/11/05 01:05:38 INFO Context: New scratch dir is 
> hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
> 15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
> Analysis
> 15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
> statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create 
> view
> 15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
> 1:32 Invalid function 'spark_partition_id'
> org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
> 'spark_partition_id'
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
>   at 
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10512)
>   at 
> 

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559068#comment-15559068
 ] 

Xiao Li commented on SPARK-11087:
-

Can you retry it using the latest master/2.0.1 branch? Thanks!

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> 

[jira] [Closed] (SPARK-9359) Support IntervalType for Parquet

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9359.
--
Assignee: (was: Liang-Chi Hsieh)

> Support IntervalType for Parquet
> 
>
> Key: SPARK-9359
> URL: https://issues.apache.org/jira/browse/SPARK-9359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
> {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9359) Support IntervalType for Parquet

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-9359.

Resolution: Duplicate

> Support IntervalType for Parquet
> 
>
> Key: SPARK-9359
> URL: https://issues.apache.org/jira/browse/SPARK-9359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>
> SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
> {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559052#comment-15559052
 ] 

Xiao Li commented on SPARK-9205:


This is not an issue, right? Since this JIRA is stale, let us close it now. If 
needed, we can create a new JIRA against new versions.

> org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
> -
>
> Key: SPARK-9205
> URL: https://issues.apache.org/jira/browse/SPARK-9205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9205.
--
Resolution: Cannot Reproduce

> org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
> -
>
> Key: SPARK-9205
> URL: https://issues.apache.org/jira/browse/SPARK-9205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-08 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559044#comment-15559044
 ] 

Ron Hu commented on SPARK-17626:


In the CBO design spec we posted in 
https://issues.apache.org/jira/browse/SPARK-16026,
we illustrated a Multi-way Join Ordering Optimization algorithm using dynamic 
programming technique.  This algorithm should be able to pick up the best join 
re-ordering plan. It is possible that the search space is big.  We need some 
heuristics to reduce the search space. 

As Zhenhua pointed out, we can identify all the primary-key/foreign-key joins 
as we collect number of distinct values to infer whether or not a join column 
is a primary key.  If a join relation has primary key join column, then it is a 
dimension table.  If a join relation has foreign key columns, then it is a fact 
table.  Once a fact table is identified, we form a star schema by finding out 
all the dimension tables that have join conditions with the given fact table.

As for the selectivity hint, we do not need selectivity hint to deal with 
comparison expression like:
  column_nameconstant_value
where a comparison operator is =, <, <=, >, >=, etc. 
This is because, with the histogram we are implementing now in CBO, we can find 
the filtering selectivity properly.  However, for the following cases, a 
selectivity hint will be helpful.

Case 1:
  WHERE o_comment not like '%special%request%'  /* TPC-H Q13 */
Histogram cannot provide such detailed statistics information for a string 
pattern which can a complex regular expression.

Case 2:
  WHERE l_commitdate < l_receiptdate /* TPC-H Q4 */
Today we define one-dimensional histogram to keep track the data distribution 
of a single column.  We do not handle the non-equal relationship between two 
columns.


> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Closed] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10101.
---
Resolution: Not A Problem

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558901#comment-15558901
 ] 

Xiao Li commented on SPARK-10101:
-

This has been resolved in the master. If you still hit any bug, please open a 
new JIRA.

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9265.
--
Resolution: Not A Problem

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558875#comment-15558875
 ] 

Xiao Li commented on SPARK-9265:


This has been resolved since our Optimizer push down `Limit` below `Sort`. 
Close it now. Thanks!

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-08 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558762#comment-15558762
 ] 

Ioana Delaney commented on SPARK-17626:
---

[~mikewzh] Thank you. Yes, having informational RI constraints available in 
Spark will open many opportunities for optimizations. Star schema detection is 
just one of them. Our team here at IBM already started some initial design 
discussions in this direction. We are hoping to have something more concrete 
soon. 


> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum

2016-10-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558724#comment-15558724
 ] 

Reynold Xin commented on SPARK-10496:
-

I think there are two separate issues here:

1. The API to run cumulative sum right now is fairly awkward. Either do it 
through a complicated join, or through window functions that still look fairly 
verbose. I've created a notebook that contains two short examples to do this in 
SQL and in DataFrames: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html

It would make sense to me to create a simpler API for this case, since it is 
very common. This API under the hood can just call the existing window function 
API.

2. The implementation, for cases when there is a single window partition, is 
slow, because it requires shuffling all the data. This can technically be run 
just a prefix scan. In this case, I'd have an optimizer rule or physical plan 
changes to improve this.



> Efficient DataFrame cumulative sum
> --
>
> Key: SPARK-10496
> URL: https://issues.apache.org/jira/browse/SPARK-10496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which 
> is the cumulative sum of X.
> This can be done with window functions, but it is not efficient for a large 
> number of rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5818) unable to use "add jar" in hql

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-5818.
--
Resolution: Not A Problem

This has been supported. Please try the latest branch. Thanks!

> unable to use "add jar" in hql
> --
>
> Key: SPARK-5818
> URL: https://issues.apache.org/jira/browse/SPARK-5818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>
> In the spark 1.2.1 and 1.2.0, it's unable the use the hive command "add jar"  
> in hql.
> It seems that the problem in spark-2219 is still existed.
> the problem can be reproduced as described in the below. Suppose the jar file 
> is named brickhouse-0.6.0.jar and is placed in the /tmp directory
> {code}
> spark-shell>import org.apache.spark.sql.hive._
> spark-shell>val sqlContext = new HiveContext(sc)
> spark-shell>import sqlContext._
> spark-shell>hql("add jar /tmp/brickhouse-0.6.0.jar")
> {code}
> the error message is showed as blow
> {code:title=Error Log}
> 15/02/15 01:36:31 ERROR SessionState: Unable to register 
> /tmp/brickhouse-0.6.0.jar
> Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
> cast to java.net.URLClassLoader
> java.lang.ClassCastException: 
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
> java.net.URLClassLoader
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
>   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
>   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
>   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC.(:39)
>   at $line30.$read$$iwC$$iwC$$iwC.(:41)
>   at $line30.$read$$iwC$$iwC.(:43)
>   at $line30.$read$$iwC.(:45)
>   at $line30.$read.(:47)
>   at $line30.$read$.(:51)
>   at $line30.$read$.()
>   at $line30.$eval$.(:7)
>   at $line30.$eval$.()
>   at $line30.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
>   at 

[jira] [Commented] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558718#comment-15558718
 ] 

Xiao Li commented on SPARK-7097:


This will be resolved in the ongoing CBO work. Thus, close it now. Thanks!

> Partitioned tables should only consider referred partitions in query during 
> size estimation for checking against autoBroadcastJoinThreshold
> ---
>
> Key: SPARK-7097
> URL: https://issues.apache.org/jira/browse/SPARK-7097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
>Reporter: Yash Datta
>
> Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
> the size estimation of partitioned tables involved considers the size of 
> entire table. This results in many query plans using shuffle hash joins , 
> where infact only a small number of partitions may be being referred by the 
> actual query (due to additional filters), and hence these could be run using 
> BroadCastHash join.
> The query plan should consider the size of only the referred partitions in 
> such cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7097.
--
Resolution: Won't Fix

> Partitioned tables should only consider referred partitions in query during 
> size estimation for checking against autoBroadcastJoinThreshold
> ---
>
> Key: SPARK-7097
> URL: https://issues.apache.org/jira/browse/SPARK-7097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
>Reporter: Yash Datta
>
> Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
> the size estimation of partitioned tables involved considers the size of 
> entire table. This results in many query plans using shuffle hash joins , 
> where infact only a small number of partitions may be being referred by the 
> actual query (due to additional filters), and hence these could be run using 
> BroadCastHash join.
> The query plan should consider the size of only the referred partitions in 
> such cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11062) Thrift server does not support operationLog

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-11062.
-
Resolution: Duplicate

> Thrift server does not support operationLog
> ---
>
> Key: SPARK-11062
> URL: https://issues.apache.org/jira/browse/SPARK-11062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Navis
>Priority: Trivial
>
> Currently, SparkExecuteStatementOperation is skipping beforeRun/afterRun 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11055) Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558706#comment-15558706
 ] 

Xiao Li commented on SPARK-11055:
-

Based on the PR, Davies did a similar work. [SPARK-11425] [SPARK-11486] Improve 
hybrid aggregation

> Use mixing hash-based and sort-based aggregation in 
> TungstenAggregationIterator
> ---
>
> Key: SPARK-11055
> URL: https://issues.apache.org/jira/browse/SPARK-11055
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In TungstenAggregationIterator we switch to sort-based aggregation when we 
> can't allocate more memory for hashmap.
> However, using external sorter-based aggregation will write too much 
> key-value pairs into disk. We should use mixing hash-based and sort-based 
> aggregation to reduce the key-value pairs needed to write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11055) Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-11055.
---
Resolution: Duplicate

> Use mixing hash-based and sort-based aggregation in 
> TungstenAggregationIterator
> ---
>
> Key: SPARK-11055
> URL: https://issues.apache.org/jira/browse/SPARK-11055
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In TungstenAggregationIterator we switch to sort-based aggregation when we 
> can't allocate more memory for hashmap.
> However, using external sorter-based aggregation will write too much 
> key-value pairs into disk. We should use mixing hash-based and sort-based 
> aggregation to reduce the key-value pairs needed to write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10794) Spark-SQL- select query on table column with binary Data Type displays error message- java.lang.ClassCastException: java.lang.String cannot be cast to [B

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558631#comment-15558631
 ] 

Xiao Li commented on SPARK-10794:
-

The related parts are changed a lot. Could you retry it? Thanks!

> Spark-SQL- select query on table column with binary Data Type displays error 
> message- java.lang.ClassCastException: java.lang.String cannot be cast to [B
> -
>
> Key: SPARK-10794
> URL: https://issues.apache.org/jira/browse/SPARK-10794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 running on MapR 5.0 sandbox
>Reporter: Anilkumar Kalshetti
>Priority: Critical
> Attachments: binaryDataType.png, spark_1_5_0.png, testbinary.txt
>
>
> Spark-SQL connected to Hive Metastore-- MapR5.0 has Hive 1.0.0
> Use beeline interface for Spark-SQL
> 1] Execute below query to create Table,
> CREATE TABLE default.testbinary  ( 
> c1 binary, 
> c2 string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
> 2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
> and execute below script to load data in table
> LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary
> //testbinary.txt  contains data
> 1001,'russia'
> 3] Execute below 'Describe' command to get table information, and select 
> command to get table data
> describe  testbinary;
> SELECT c1 FROM testbinary;
> 4] Select query displays error message:
>  java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> Info:  for same table - select query on column c2 - string datatype works 
> properly
> SELECT c2 FROM testbinary;
> Please refer screenshot- binaryDataType.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10933) Spark SQL Joins should have option to fail query when row multiplication is encountered

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10933.
---
Resolution: Won't Fix

> Spark SQL Joins should have option to fail query when row multiplication is 
> encountered
> ---
>
> Key: SPARK-10933
> URL: https://issues.apache.org/jira/browse/SPARK-10933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephen Link
>Priority: Minor
>
> When constructing spark sql queries, we commonly run into scenarios where 
> users have inadvertently caused a cartesian product/row expansion. It is 
> sometimes possible to detect this in advance with separate queries, but it 
> would be far more ideal if it was possible to have a setting that disallowed 
> join keys showing up multiple times on both sides of a join operation.
> This setting would belong in SQLConf. The functionality could likely be 
> implemented by forcing a sorted shuffle, then checking for duplication on the 
> streamed results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10933) Spark SQL Joins should have option to fail query when row multiplication is encountered

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558625#comment-15558625
 ] 

Xiao Li commented on SPARK-10933:
-

Now, we have a conf `spark.sql.crossJoin.enabled`. 

Let me close it now. If you still think we need extra conf, please reopen it. 
Thanks!

> Spark SQL Joins should have option to fail query when row multiplication is 
> encountered
> ---
>
> Key: SPARK-10933
> URL: https://issues.apache.org/jira/browse/SPARK-10933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephen Link
>Priority: Minor
>
> When constructing spark sql queries, we commonly run into scenarios where 
> users have inadvertently caused a cartesian product/row expansion. It is 
> sometimes possible to detect this in advance with separate queries, but it 
> would be far more ideal if it was possible to have a setting that disallowed 
> join keys showing up multiple times on both sides of a join operation.
> This setting would belong in SQLConf. The functionality could likely be 
> implemented by forcing a sorted shuffle, then checking for duplication on the 
> streamed results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10805) JSON Data Frame does not return correct string lengths

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558595#comment-15558595
 ] 

Xiao Li commented on SPARK-10805:
-

This is pretty expensive to find the max length for each field. That means we 
need to read all the records. When you read the schema, the schema is inferred 
from the file. Even if we can find it, but the new recorded could be appended. 

Now, CBO is being implemented. Thus, this part should be resolved with CBO.

> JSON Data Frame does not return correct string lengths
> --
>
> Key: SPARK-10805
> URL: https://issues.apache.org/jira/browse/SPARK-10805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jeff Li
>Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>   DataFrame jsonDataFrame = 
> sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
>   jsonDataFrame.printSchema();
>   StructType jsonSchema = jsonDataFrame.schema();
>   StructField[] dataFields = jsonSchema.fields();
>   for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
> fieldIndex++) {
>   StructField aField = dataFields[fieldIndex];
>   DataType aType = aField.dataType();
>   System.out.println("name: " + aField.name() + " type: " 
> + aType.typeName()
>   + " size: " +aType.defaultSize());
>   }
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
> characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of 
> each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10805) JSON Data Frame does not return correct string lengths

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-10805.
-
Resolution: Won't Fix

> JSON Data Frame does not return correct string lengths
> --
>
> Key: SPARK-10805
> URL: https://issues.apache.org/jira/browse/SPARK-10805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jeff Li
>Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>   DataFrame jsonDataFrame = 
> sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
>   jsonDataFrame.printSchema();
>   StructType jsonSchema = jsonDataFrame.schema();
>   StructField[] dataFields = jsonSchema.fields();
>   for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
> fieldIndex++) {
>   StructField aField = dataFields[fieldIndex];
>   DataType aType = aField.dataType();
>   System.out.println("name: " + aField.name() + " type: " 
> + aType.typeName()
>   + " size: " +aType.defaultSize());
>   }
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
> characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of 
> each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558581#comment-15558581
 ] 

Xiao Li edited comment on SPARK-10972 at 10/8/16 7:34 PM:
--

Also try to use the SQL interface? It can be mixed with the Dataset/DataFrame 
APIs.


was (Author: smilegator):
Also try to use the SQL interface?

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558581#comment-15558581
 ] 

Xiao Li commented on SPARK-10972:
-

Also try to use the SQL interface?

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10972) UDFs in SQL joins

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558577#comment-15558577
 ] 

Xiao Li commented on SPARK-10972:
-

There is a workaround to fix it. You can specify the filter above join. 

Yeah, the performance might be not as good as treating it as a join condition. 

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558572#comment-15558572
 ] 

Cody Koeninger commented on SPARK-17147:


I talked with Sean in person about this, and think there's a way to move 
forward.  I'll start hacking on it.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558557#comment-15558557
 ] 

Cody Koeninger commented on SPARK-4960:
---

Is this idea pretty much dead at this point? It seems like most attention has 
moved off of receiver-based dstream.

> Interceptor pattern in receivers
> 
>
> Key: SPARK-4960
> URL: https://issues.apache.org/jira/browse/SPARK-4960
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Tathagata Das
>
> Sometimes it is good to intercept a message received through a receiver and 
> modify / do something with the message before it is stored into Spark. This 
> is often referred to as the interceptor pattern. There should be general way 
> to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10860:

Assignee: (was: Jihong MA)

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10646:

Assignee: (was: Jihong MA)

> Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10860:

Component/s: (was: SQL)

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Jihong MA
>Assignee: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-3146.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
> Fix For: 1.3.0
>
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558550#comment-15558550
 ] 

Cody Koeninger commented on SPARK-3146:
---

SPARK-4964 / the direct stream added a messageHandler.


> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-6649.

Resolution: Fixed

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Labels: config starter  (was: config fun happy pants spark-shell)

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Priority: Trivial  (was: Major)

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14212:

Component/s: (was: Spark Shell)
 (was: Spark Core)
 Documentation

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, fun, happy, pants, spark-shell
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558540#comment-15558540
 ] 

holdenk commented on SPARK-14212:
-

So I think this would be a good option to document for Python users, although 
the root CSV issue has been fixed by including the CSV format inside of Spark 
its self. You configure a package using `spark.jars.packages` in 
spark-defaults.conf.

If someone is interested adding this to the documentation 
`docs/configuration.md` would probably be a good place to document the 
`spakr.jars.packages` configuration value (and you can see how it can be by 
looking at the inside of SparkSubmit & SparkSubmitArguments together.

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>  Labels: config, fun, happy, pants, spark-shell
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14017) dataframe.dtypes -> pyspark.sql.types aliases

2016-10-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14017.
---
Resolution: Won't Fix

Thanks for bringing this issue up - I don't think we necessarily want to add 
these alias - the type differences are documented in 
http://spark.apache.org/docs/latest/sql-programming-guide.html

> dataframe.dtypes -> pyspark.sql.types aliases
> -
>
> Key: SPARK-14017
> URL: https://issues.apache.org/jira/browse/SPARK-14017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: Python 2.7; Spark 1.5; Java 1.7; Hadoop 2.6; Scala 2.10
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, datatypes, pyspark, python
>
> Running following:
> #fix schema for gaid which should not be Double 
> from pyspark.sql.types import *
> customSchema = StructType()
> for (col,typ) in tsp_orig.dtypes:
> if col=='Agility_GAID':
> typ='string'
> customSchema.add(col,typ,True)
> Getting 
>   ValueError: Could not parse datatype: bigint
> Looks like pyspark.sql.types doesn't know anything about bigint.. 
> Should it be aliased to LongType in pyspark.sql.types?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558532#comment-15558532
 ] 

Xiao Li commented on SPARK-10804:
-

In Spark 2.0, we rewrote the whole part, especially the load command and the 
write path. If you still have an issue, could you open a new JIRA to document 
it? Thanks!

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10804.
---
Resolution: Not A Problem

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17837) Disaster recovery of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17837:
---
Summary: Disaster recovery of offsets from WAL  (was: Disaster recover of 
offsets from WAL)

> Disaster recovery of offsets from WAL
> -
>
> Key: SPARK-17837
> URL: https://issues.apache.org/jira/browse/SPARK-17837
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
> As reynold suggests though, we should change this to use a less opaque 
> format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17837) Disaster recover of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-17837:
--

 Summary: Disaster recover of offsets from WAL
 Key: SPARK-17837
 URL: https://issues.apache.org/jira/browse/SPARK-17837
 Project: Spark
  Issue Type: Sub-task
Reporter: Cody Koeninger


"The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
As reynold suggests though, we should change this to use a less opaque format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558528#comment-15558528
 ] 

Cody Koeninger commented on SPARK-17815:


So if you start committing offsets to kafka, there are going to be potentially 
three places offsets are stored:

1.  structured WAL
2. kafka commit topic
3. downstream store

It's going to be easy to get confused as to what the source of truth is.


> Report committed offsets
> 
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10427) Spark-sql -f or -e will output some

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558525#comment-15558525
 ] 

Xiao Li commented on SPARK-10427:
-

This is not an issue since 2.0. Thanks!

> Spark-sql -f or -e will output some
> ---
>
> Key: SPARK-10427
> URL: https://issues.apache.org/jira/browse/SPARK-10427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
> Environment: Spark 1.4.1 
>Reporter: cen yuhai
>Priority: Minor
>
> We use  spark-sql -f 1.sql  > 1.txt 
> It will print these information in 1.txt :
> spark.sql.parquet.binaryAsString=...
> spark.sql.hive.metastore.version=.
> .etc and so on 
> We dont' need these information and hive will not print these in the standard 
> outputstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10427) Spark-sql -f or -e will output some

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10427.
---
Resolution: Not A Problem

> Spark-sql -f or -e will output some
> ---
>
> Key: SPARK-10427
> URL: https://issues.apache.org/jira/browse/SPARK-10427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
> Environment: Spark 1.4.1 
>Reporter: cen yuhai
>Priority: Minor
>
> We use  spark-sql -f 1.sql  > 1.txt 
> It will print these information in 1.txt :
> spark.sql.parquet.binaryAsString=...
> spark.sql.hive.metastore.version=.
> .etc and so on 
> We dont' need these information and hive will not print these in the standard 
> outputstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558512#comment-15558512
 ] 

Xiao Li commented on SPARK-9442:


Is it still a problem in the latest branch?

> java.lang.ArithmeticException: / by zero when reading Parquet
> -
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: DB Tsai
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
>  |-- profileId: long (nullable = true)
>  |-- country: string (nullable = true)
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- videoId: long (nullable = true)
>  |||-- date: long (nullable = true)
>  |||-- label: double (nullable = true)
>  |||-- weight: double (nullable = true)
>  |||-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and 
> each of the parquet file is around 600MB. The total size is around 120GB. 
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
> file 
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>   ... 21 more
> {code}
> BTW, no all the tasks fail, and some of them are successful. 
> Another note: By explicitly looping through the data to count, it will works.
> {code}
> sqlContext.read.load(hdfsPath + s"/testing/u1snappy/${date}/").map(x => 
> 1L).reduce((x, y) => x + y) 
> {code}
> I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10703) Physical filter operators should replace the general AND/OR/equality/etc with a special version that treats null as false

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558509#comment-15558509
 ] 

Xiao Li commented on SPARK-10703:
-

The problem has been resolved, I think. Try the latest branch by 
{noformat}
val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (value: String) => value.length > 2)
df.filter($"animals".rlike(".*")).filter(callUDF("simpleUDF", 
$"animals")).show()
{noformat}

> Physical filter operators should replace the general AND/OR/equality/etc with 
> a special version that treats null as false
> -
>
> Key: SPARK-10703
> URL: https://issues.apache.org/jira/browse/SPARK-10703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mingyu Kim
>
> {noformat}
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df.filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
> $"animals"))
>   .collect()
> {noformat}
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
> This feels weird. Reading that code, I wouldn't expect null to be passed to 
> the second filter. Even weirder is that if you call collect() after the first 
> filter you won't see nulls, and if you write the data to disk and reread it, 
> the NPE won't happen.
> After the discussion on the dev list, [~rxin] suggested,
> {quote}
> we can add a rule for the physical filter operator to replace the general 
> AND/OR/equality/etc with a special version that treats null as false. This 
> rule needs to be carefully written because it should only apply to subtrees 
> of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10703) Physical filter operators should replace the general AND/OR/equality/etc with a special version that treats null as false

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10703.
---
Resolution: Not A Problem

> Physical filter operators should replace the general AND/OR/equality/etc with 
> a special version that treats null as false
> -
>
> Key: SPARK-10703
> URL: https://issues.apache.org/jira/browse/SPARK-10703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mingyu Kim
>
> {noformat}
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df.filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
> $"animals"))
>   .collect()
> {noformat}
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
> This feels weird. Reading that code, I wouldn't expect null to be passed to 
> the second filter. Even weirder is that if you call collect() after the first 
> filter you won't see nulls, and if you write the data to disk and reread it, 
> the NPE won't happen.
> After the discussion on the dev list, [~rxin] suggested,
> {quote}
> we can add a rule for the physical filter operator to replace the general 
> AND/OR/equality/etc with a special version that treats null as false. This 
> rule needs to be carefully written because it should only apply to subtrees 
> of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558506#comment-15558506
 ] 

Cody Koeninger commented on SPARK-17812:


So I'm willing to do this work, mostly because I've already done it, but there 
are some user interface issues here that need to get figured out.

You already chose the name "startingOffset" for specifying the equivalent of 
auto.offset.reset.  Now we're looking at actually adding starting offsets.  
Furthermore, it should be possible to specify starting offsets for some 
partitions, while relying on the equivalent of auto.offset.reset for other 
unspecified ones (the existing DStream does this).

What are you expecting configuration of this to look like?  I can see a couple 
of options:

1. Try to cram everything into startingOffset with some horrible string-based 
DSL
2. Have a separate option for specifying starting offsets for real, with a name 
that makes it clear what it is, yet doesn't use "startingoffset".  As for the 
value, I guess in json form of some kind?   { "topicfoo" : { "0": 1234, "1": 
4567 }}

Somewhat related is that Assign needs a way of specifying topicpartitions.

As far as the idea to seek back X offsets, I think it'd be better to look at 
offset time indexing.
If you are going to do the X offsets back idea, the offsets -1L and -2L already 
have special meaning, so it's going to be kind of confusing to allow negative 
numbers in an interface that is specifying offsets.


> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek back {{X}} offsets in the stream from the moment the query starts
>  - seek to user specified offsets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10545) HiveMetastoreTypes.toMetastoreType should handle interval type

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558489#comment-15558489
 ] 

Xiao Li commented on SPARK-10545:
-

Both Hive and Spark do not support INTERVAL as a column data type. Thus, it 
should not be a bug, right? 

> HiveMetastoreTypes.toMetastoreType should handle interval type
> --
>
> Key: SPARK-10545
> URL: https://issues.apache.org/jira/browse/SPARK-10545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Minor
>
> We need to handle interval type at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L946-L965.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558474#comment-15558474
 ] 

Cody Koeninger commented on SPARK-17344:


I think this is premature until you have a fully operational battlestation, er, 
structured stream, that has all the necessary features for 0.10

Regarding the conversation with Michael about possibly using the kafka protocol 
directly as a way to work around the differences between 0.8 and 0.10, please 
don't consider that.  Every kafka consumer implementation I've ever used has 
bugs, and we don't need to spend time writing another buggy one.  

By contrast, writing a streaming source shim around the existing simple 
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have 
stuff like SSL, dynamic topics, or offset committing.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10044) AnalysisException in resolving reference for sorting with aggregation

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10044.
---
Resolution: Not A Problem

> AnalysisException in resolving reference for sorting with aggregation
> -
>
> Key: SPARK-10044
> URL: https://issues.apache.org/jira/browse/SPARK-10044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Unit test as:
> {code}
> withTempTable("mytable") {
>   sqlContext.sparkContext.parallelize(1 to 10).map(i => (i, i.toString))
> .toDF("key", "value")
> .registerTempTable("mytable")
>   checkAnswer(sql(
> """select max(value) from mytable group by key % 2
>   |order by max(concat(value,",", key)), min(substr(value, 0, 4))
>   |""".stripMargin), Row("8") :: Row("9") :: Nil)
> }
> {code}
> Exception like:
> {code}
> cannot resolve '_aggOrdering' given input columns _c0, _aggOrdering, 
> _aggOrdering;
> org.apache.spark.sql.AnalysisException: cannot resolve '_aggOrdering' given 
> input columns _c0, _aggOrdering, _aggOrdering;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10044) AnalysisException in resolving reference for sorting with aggregation

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558470#comment-15558470
 ] 

Xiao Li commented on SPARK-10044:
-

This has been resolved at least in the Spark 2.0. Thus, it should not be a 
problem now. Let me close it. Thanks!

> AnalysisException in resolving reference for sorting with aggregation
> -
>
> Key: SPARK-10044
> URL: https://issues.apache.org/jira/browse/SPARK-10044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Unit test as:
> {code}
> withTempTable("mytable") {
>   sqlContext.sparkContext.parallelize(1 to 10).map(i => (i, i.toString))
> .toDF("key", "value")
> .registerTempTable("mytable")
>   checkAnswer(sql(
> """select max(value) from mytable group by key % 2
>   |order by max(concat(value,",", key)), min(substr(value, 0, 4))
>   |""".stripMargin), Row("8") :: Row("9") :: Nil)
> }
> {code}
> Exception like:
> {code}
> cannot resolve '_aggOrdering' given input columns _c0, _aggOrdering, 
> _aggOrdering;
> org.apache.spark.sql.AnalysisException: cannot resolve '_aggOrdering' given 
> input columns _c0, _aggOrdering, _aggOrdering;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2016-10-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-7012.
--
Resolution: Not A Problem

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2016-10-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558466#comment-15558466
 ] 

Xiao Li commented on SPARK-7012:


Since 2.0, we have a native Parser. Thus, this has been resolved. Thanks!

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5511) [SQL] Possible optimisations for predicate pushdowns from Spark SQL to Parquet

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557667#comment-15557667
 ] 

Hyukjin Kwon edited comment on SPARK-5511 at 10/8/16 4:52 PM:
--

1. I agree it needs a change on Parquet.

2. We already supported this before via user-defiened filter but it was removed 
due to the performance about filtering record-by-record. Then, there is a try 
to add this back with the combinations of OR operators. See SPARK-17091


was (Author: hyukjin.kwon):
1. I agree it needs a change on Spark.

2. We already supported this before via user-defiened filter but it was removed 
due to the performance about filtering record-by-record. Then, there is a try 
to add this back with the combinations of OR operators. See SPARK-17091

> [SQL] Possible optimisations for predicate pushdowns from Spark SQL to Parquet
> --
>
> Key: SPARK-5511
> URL: https://issues.apache.org/jira/browse/SPARK-5511
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Mick Davies
>Priority: Minor
>
> The following changes could make predicate pushdown more effective under 
> certain conditions, which are not uncommon.
> 1. Parquet predicate evaluation does not use dictionary compression 
> information, furthermore it circumvents dictionary decoding optimisations 
> (https://issues.apache.org/jira/browse/PARQUET-36). This means predicates are 
> re-evaluated repeatedly for the same Strings, and also Binary->String 
> conversions are repeated. This is a change purely on the Parquet side.
> 2. Support IN clauses in predicate pushdown. This requires changes to Parquet 
> and then subsequently in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-10-08 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu closed SPARK-17540.
--
Resolution: Won't Fix

> SparkR array serde cannot work correctly when array length == 0
> ---
>
> Key: SPARK-17540
> URL: https://issues.apache.org/jira/browse/SPARK-17540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>
> SparkR cannot handle array serde when array length == 0
> when length = 0
> R side set the element type as class("somestring")
> so that scala side code receive it as a string array,
> but the array we need to transfer may be other types,
> it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11428) Schema Merging Broken for Some Queries

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558095#comment-15558095
 ] 

Hyukjin Kwon commented on SPARK-11428:
--

How about https://issues.apache.org/jira/browse/SPARK-8128 ?

> Schema Merging Broken for Some Queries
> --
>
> Key: SPARK-11428
> URL: https://issues.apache.org/jira/browse/SPARK-11428
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.1
> Environment: AWS,
>Reporter: Brad Willard
>  Labels: dataframe, parquet, pyspark, schema, sparksql
>
> I have data being written into parquet format via spark streaming. The data 
> can change slightly so schema merging is required. I load a dataframe like 
> this
> {code}
> urls = [
> "/streaming/parquet/events/key=2015-10-30*",
> "/streaming/parquet/events/key=2015-10-29*"
> ]
> sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls)
> sdf.registerTempTable('events')
> {code}
> If I print the schema you can see the contested column
> {code}
> sdf.printSchema()
> root
>  |-- _id: string (nullable = true)
> ...
>  |-- d__device_s: string (nullable = true)
>  |-- d__isActualPageLoad_s: string (nullable = true)
>  |-- d__landing_s: string (nullable = true)
>  |-- d__lang_s: string (nullable = true)
>  |-- d__os_s: string (nullable = true)
>  |-- d__performance_i: long (nullable = true)
>  |-- d__product_s: string (nullable = true)
>  |-- d__refer_s: string (nullable = true)
>  |-- d__rk_i: long (nullable = true)
>  |-- d__screen_s: string (nullable = true)
>  |-- d__submenuName_s: string (nullable = true)
> {code}
> The column that's in one but not the other file is  d__product_s
> So I'm able to run this query and it works fine.
> {code}
> sql_context.sql('''
> select 
> distinct(d__product_s) 
> from 
> events
> where 
> n = 'view'
> ''').collect()
> [Row(d__product_s=u'website'),
>  Row(d__product_s=u'store'),
>  Row(d__product_s=None),
>  Row(d__product_s=u'page')]
> {code}
> However if I instead use that column in the where clause things break.
> {code}
> sql_context.sql('''
> select 
> * 
> from 
> events
> where 
> n = 'view' and d__product_s = 'page'
> ''').take(1)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   6 where
>   7 n = 'frontsite_view' and d__product_s = 'page'
> > 8 ''').take(1)
> /root/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
> 303 with SCCallSiteSync(self._sc) as css:
> 304 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 305 self._jdf, num)
> 306 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 307 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  34 def deco(*a, **kw):
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
>  38 s = e.java_exception.toString()
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage 
> 15.0 (TID 6536, 10.X.X.X): java.lang.IllegalArgumentException: Column 
> [d__product_s] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> 

[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558088#comment-15558088
 ] 

Hyukjin Kwon commented on SPARK-8128:
-

I am not 100% sure but I recall I saw similar issue was resolved. Could you 
confirm this is still happening in the recent versions? - [~brdwrd]

> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> 
>
> Key: SPARK-8128
> URL: https://issues.apache.org/jira/browse/SPARK-8128
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and 
> loading it into one dataframe so schema merging is involved. There is some 
> bug with the schema merging that you print the schema and it shows all the 
> attributes. However when you run a query and filter on that attribute is 
> errors saying it's not in the schema. The query is incorrectly going to one 
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  ...
> \|-- items: array (nullable = true)
>  \||-- element: struct (containsNull = true)
>  \|||-- _id: string (nullable = true)
>  \|||-- addedOn: string (nullable = true)
>  \|||-- authorId: string (nullable = true)
>  \|||-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
> stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: 
> Column [mediaProcessingState] was not found in schema!
> at parquet.Preconditions.checkArgument(Preconditions.java:47)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> 

[jira] [Resolved] (SPARK-16903) nullValue in first field is not respected by CSV source when read

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16903.
--
Resolution: Duplicate

[~falaki] I am going to make this as a duplicate because the PR was merged and 
I am sure we all here agree with closing this.

> nullValue in first field is not respected by CSV source when read
> -
>
> Key: SPARK-16903
> URL: https://issues.apache.org/jira/browse/SPARK-16903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> file:
> {code}
> a,-
> -,10
> {code}
> Query:
> {code}
> create temporary table test(key string, val decimal) 
> using com.databricks.spark.csv 
> options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", 
> nullValue "-");
> {code}
> Result:
> {code}
> select count(*) from test where key is null
> 0
> {code}
> But
> {code}
> select count(*) from test where val is null
> 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558067#comment-15558067
 ] 

Hyukjin Kwon commented on SPARK-16720:
--

Hi [~holdenk] Do you mind if I ask to close this? I tried to reproduce before 
but I think it;d be great if you confirm. 

> Loading CSV file with 2k+ columns fails during attribute resolution on action
> -
>
> Key: SPARK-16720
> URL: https://issues.apache.org/jira/browse/SPARK-16720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> Example shell for repro:
> {quote}
> scala> val df =spark.read.format("csv").option("header", 
> "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv")
> df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int 
> ... 2125 more fields]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Date,StringType,true), StructField(Lifetime Total 
> Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), 
> StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged 
> Users,IntegerType,true), StructField(Weekly Page Engaged 
> Users,IntegerType,true), StructField(28 Days Page Engaged 
> Users,IntegerType,true), StructField(Daily Like Sources - On Your 
> Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), 
> StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total 
> Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), 
> StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days 
> Organic Reach,IntegerType,true), StructField(Daily T...
> scala> df.take(1)
> [GIANT LIST OF COLUMNS]
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> 

[jira] [Comment Edited] (SPARK-16386) SQLContext and HiveContext parse a query string differently

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558062#comment-15558062
 ] 

Hyukjin Kwon edited comment on SPARK-16386 at 10/8/16 2:29 PM:
---

I can;t reproduce the problematic case 2 in the current master

{code}
context.sql("select 'a\\'b'").show()
+---+
|a'b|
+---+
|a'b|
+---+
{code}


was (Author: hyukjin.kwon):
I can;t reproduce the problematic case 2

{code}
context.sql("select 'a\\'b'").show()
+---+
|a'b|
+---+
|a'b|
+---+
{code}

> SQLContext and HiveContext parse a query string differently
> ---
>
> Key: SPARK-16386
> URL: https://issues.apache.org/jira/browse/SPARK-16386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
> Environment: scala 2.10, 2.11
>Reporter: Hao Ren
>  Labels: patch
>
> I just want to figure out why the two contexts behavior differently even on a 
> simple query.
> In a netshell, I have a query in which there is a String containing single 
> quote and casting to Array/Map.
> I have tried all the combination of diff type of sql context and query call 
> api (sql, df.select, df.selectExpr).
> I can't find one rules all.
> Here is the code for reproducing the problem.
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
> object Test extends App {
>   val sc  = new SparkContext("local[2]", "test", new SparkConf)
>   val hiveContext = new HiveContext(sc)
>   val sqlContext  = new SQLContext(sc)
>   val context = hiveContext
>   //  val context = sqlContext
>   import context.implicits._
>   val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
>   df.registerTempTable("tbl")
>   df.printSchema()
>   // case 1
>   context.sql("select cast(a as array) from tbl").show()
>   // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
> input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
>   // SQLContext => OK
>   // case 2
>   context.sql("select 'a\\'b'").show()
>   // HiveContext => OK
>   // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
> literal) found
>   // case 3
>   df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
> SQLContext
>   // case 4
>   df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end 
> of input expected
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16386) SQLContext and HiveContext parse a query string differently

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16386.
--
Resolution: Cannot Reproduce

I can;t reproduce the problematic case 2

{code}
context.sql("select 'a\\'b'").show()
+---+
|a'b|
+---+
|a'b|
+---+
{code}

> SQLContext and HiveContext parse a query string differently
> ---
>
> Key: SPARK-16386
> URL: https://issues.apache.org/jira/browse/SPARK-16386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
> Environment: scala 2.10, 2.11
>Reporter: Hao Ren
>  Labels: patch
>
> I just want to figure out why the two contexts behavior differently even on a 
> simple query.
> In a netshell, I have a query in which there is a String containing single 
> quote and casting to Array/Map.
> I have tried all the combination of diff type of sql context and query call 
> api (sql, df.select, df.selectExpr).
> I can't find one rules all.
> Here is the code for reproducing the problem.
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
> object Test extends App {
>   val sc  = new SparkContext("local[2]", "test", new SparkConf)
>   val hiveContext = new HiveContext(sc)
>   val sqlContext  = new SQLContext(sc)
>   val context = hiveContext
>   //  val context = sqlContext
>   import context.implicits._
>   val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
>   df.registerTempTable("tbl")
>   df.printSchema()
>   // case 1
>   context.sql("select cast(a as array) from tbl").show()
>   // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
> input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
>   // SQLContext => OK
>   // case 2
>   context.sql("select 'a\\'b'").show()
>   // HiveContext => OK
>   // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
> literal) found
>   // case 3
>   df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
> SQLContext
>   // case 4
>   df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end 
> of input expected
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558049#comment-15558049
 ] 

Hyukjin Kwon commented on SPARK-15577:
--

If I were a naive user from Spark 1.x, I think I would prefer to use 
{{DataFrame}} in Java too.

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14621) add oracle hint optimizer

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14621.
---
Resolution: Not A Problem

> add oracle hint optimizer
> -
>
> Key: SPARK-14621
> URL: https://issues.apache.org/jira/browse/SPARK-14621
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Qingyang Hong
>Priority: Minor
>
> Current SQL parser in SparkSQL can't identify hint optimizer in query, e.g.  
> SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, DATE FROM 
> BILL_TABLE. It is necessary to add such feature which will increase query 
> efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14766) Attribute reference mismatch with Dataset filter + mapPartitions

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14766.
--
Resolution: Cannot Reproduce

{code}
scala> Seq((1, 1)).toDS().filter(_._1 != 0).mapPartitions { iter => iter 
}.count()
res4: Long = 1
{code}

I can't reproduce this too.

> Attribute reference mismatch with Dataset filter + mapPartitions
> 
>
> Key: SPARK-14766
> URL: https://issues.apache.org/jira/browse/SPARK-14766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> After a filter, the Dataset references seem to be not copied properly leading 
> to an exception. To reproduce, you may use the following code:
> {code}
> Seq((1, 1)).toDS().filter(_._1 != 0).mapPartitions { iter => iter }.count()
> {code}
> Using explain shows the problem:
> {code}
> == Physical Plan ==
> !MapPartitions , newInstance(class scala.Tuple2), [input[0, 
> scala.Tuple2]._1 AS _1#38521,input[0, scala.Tuple2]._2 AS _2#38522]
> +- WholeStageCodegen
>:  +- Filter .apply
>: +- INPUT
>+- LocalTableScan [_1#38512,_2#38513], [[0,1,1]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14621) add oracle hint optimizer

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558034#comment-15558034
 ] 

Hyukjin Kwon commented on SPARK-14621:
--

+1 for closing this.

> add oracle hint optimizer
> -
>
> Key: SPARK-14621
> URL: https://issues.apache.org/jira/browse/SPARK-14621
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Qingyang Hong
>Priority: Minor
>
> Current SQL parser in SparkSQL can't identify hint optimizer in query, e.g.  
> SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, DATE FROM 
> BILL_TABLE. It is necessary to add such feature which will increase query 
> efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558029#comment-15558029
 ] 

Hyukjin Kwon commented on SPARK-14393:
--

Still happens in the current master/2.0

{code}
scala> 
spark.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
warning: there was one deprecation warning; re-run with -deprecation for details
+-+
|monotonically_increasing_id()|
+-+
|0|
|0|
|1|
|2|
|0|
|1|
|0|
|1|
|2|
|0|
+-+
{code}

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557996#comment-15557996
 ] 

Hyukjin Kwon commented on SPARK-13699:
--

Thank you. I will try to follow it.

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13699.
---
Resolution: Duplicate

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-13699:
---

I think it's clearer to resolve these as Duplicate when they are clearly a 
duplicate, even if it's true that it's also "Fixed".

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-13699.
--
Resolution: Fixed

We introduced {{truncate}} option in https://github.com/apache/spark/pull/14086 
Please revert my change if I am wrong.

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16295) Extract SQL programming guide example snippets from source files instead of hard code them

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16295:
--
Assignee: Cheng Lian

> Extract SQL programming guide example snippets from source files instead of 
> hard code them
> --
>
> Key: SPARK-16295
> URL: https://issues.apache.org/jira/browse/SPARK-16295
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.1
>
>
> Currently, all example snippets in the SQL programming guide are hard-coded, 
> which can be pretty hard to update and verify. On the contrary, ML document 
> pages are using the {{include_example}} Jekyll plugin to extract snippets 
> from actual source files under the {{examples}} sub-project. In this way, we 
> can guarantee that Java and Scala code are compilable, and it would be much 
> easier to verify these example snippets since they are part of complete Spark 
> applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16223) Codegen failure with a Dataframe program using an array

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16223.
---
   Resolution: Duplicate
Fix Version/s: (was: 2.1.0)

> Codegen failure with a Dataframe program using an array
> ---
>
> Key: SPARK-16223
> URL: https://issues.apache.org/jira/browse/SPARK-16223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we compile a Dataframe program with an operation to large array, 
> compilation failure occurs. This is because a local variable 
> {{inputadapter_value}} cannot be referenced in {{apply()}} method that is 
> generated by {{CodegenContext.splitExpressions()}}. The local variable is 
> defined in {{processNext()}} method.
> What is better approach to resolve this?  Is it better to pass 
> {{inputadapter_value}} to {{apply()}} method?
> Example program
> {code}
> val n = 500
> val statement = (0 to n - 1).map(i => s"value + 1.0d")
>   .mkString("Array(", ",", ")")
> sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
>   .selectExpr(statement).showString(1)
> {code}
> Generated code and stack trace
> {code:java}
> 23:10:45.801 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 30, Column 36: Expression "inputadapter_value" is not 
> an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator inputadapter_input;
> /* 008 */   private Object[] project_values;
> /* 009 */   private UnsafeRow project_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> project_arrayWriter;
> /* 013 */
> /* 014 */   public GeneratedIterator(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */   }
> /* 017 */
> /* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 019 */ partitionIndex = index;
> /* 020 */ inputadapter_input = inputs[0];
> /* 021 */ this.project_values = null;
> /* 022 */ project_result = new UnsafeRow(1);
> /* 023 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  32);
> /* 024 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  1);
> /* 025 */ this.project_arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 026 */   }
> /* 027 */
> /* 028 */   private void project_apply_0(InternalRow inputadapter_row) {
> /* 029 */ double project_value1 = -1.0;
> /* 030 */ project_value1 = inputadapter_value + 1.0D;
> /* 031 */ if (false) {
> /* 032 */   project_values[0] = null;
> /* 033 */ } else {
> /* 034 */   project_values[0] = project_value1;
> /* 035 */ }
> /* 036 */
> /* 037 */ double project_value4 = -1.0;
> /* 038 */ project_value4 = inputadapter_value + 1.0D;
> /* 039 */ if (false) {
> /* 040 */   project_values[1] = null;
> /* 041 */ } else {
> /* 042 */   project_values[1] = project_value4;
> /* 043 */ }
> ...
> /* 4032 */   }
> /* 4033 */
> /* 4034 */   protected void processNext() throws java.io.IOException {
> /* 4035 */ while (inputadapter_input.hasNext()) {
> /* 4036 */   InternalRow inputadapter_row = (InternalRow) 
> inputadapter_input.next();
> /* 4037 */   System.out.println("row: " + inputadapter_row.getClass() + 
> ", " + inputadapter_row);
> /* 4038 */   double inputadapter_value = inputadapter_row.getDouble(0);
> /* 4039 */
> /* 4040 */   final boolean project_isNull = false;
> /* 4041 */   this.project_values = new 
> Object[500];project_apply_0(inputadapter_row);
> /* 4042 */   project_apply_1(inputadapter_row);
> /* 4043 */   /* final ArrayData project_value = 
> org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); 
> */
> /* 4044 */   final ArrayData project_value = new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
> /* 4045 */   this.project_values = null;
> /* 4046 */   project_holder.reset();
> /* 4047 */
> /* 4048 */   project_rowWriter.zeroOutNullBytes();
> /* 4049 */
> /* 4050 */   if (project_isNull) {
> 

[jira] [Reopened] (SPARK-16223) Codegen failure with a Dataframe program using an array

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-16223:
---

Actually I think this is a duplicate

> Codegen failure with a Dataframe program using an array
> ---
>
> Key: SPARK-16223
> URL: https://issues.apache.org/jira/browse/SPARK-16223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we compile a Dataframe program with an operation to large array, 
> compilation failure occurs. This is because a local variable 
> {{inputadapter_value}} cannot be referenced in {{apply()}} method that is 
> generated by {{CodegenContext.splitExpressions()}}. The local variable is 
> defined in {{processNext()}} method.
> What is better approach to resolve this?  Is it better to pass 
> {{inputadapter_value}} to {{apply()}} method?
> Example program
> {code}
> val n = 500
> val statement = (0 to n - 1).map(i => s"value + 1.0d")
>   .mkString("Array(", ",", ")")
> sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
>   .selectExpr(statement).showString(1)
> {code}
> Generated code and stack trace
> {code:java}
> 23:10:45.801 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 30, Column 36: Expression "inputadapter_value" is not 
> an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator inputadapter_input;
> /* 008 */   private Object[] project_values;
> /* 009 */   private UnsafeRow project_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> project_arrayWriter;
> /* 013 */
> /* 014 */   public GeneratedIterator(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */   }
> /* 017 */
> /* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 019 */ partitionIndex = index;
> /* 020 */ inputadapter_input = inputs[0];
> /* 021 */ this.project_values = null;
> /* 022 */ project_result = new UnsafeRow(1);
> /* 023 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  32);
> /* 024 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  1);
> /* 025 */ this.project_arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 026 */   }
> /* 027 */
> /* 028 */   private void project_apply_0(InternalRow inputadapter_row) {
> /* 029 */ double project_value1 = -1.0;
> /* 030 */ project_value1 = inputadapter_value + 1.0D;
> /* 031 */ if (false) {
> /* 032 */   project_values[0] = null;
> /* 033 */ } else {
> /* 034 */   project_values[0] = project_value1;
> /* 035 */ }
> /* 036 */
> /* 037 */ double project_value4 = -1.0;
> /* 038 */ project_value4 = inputadapter_value + 1.0D;
> /* 039 */ if (false) {
> /* 040 */   project_values[1] = null;
> /* 041 */ } else {
> /* 042 */   project_values[1] = project_value4;
> /* 043 */ }
> ...
> /* 4032 */   }
> /* 4033 */
> /* 4034 */   protected void processNext() throws java.io.IOException {
> /* 4035 */ while (inputadapter_input.hasNext()) {
> /* 4036 */   InternalRow inputadapter_row = (InternalRow) 
> inputadapter_input.next();
> /* 4037 */   System.out.println("row: " + inputadapter_row.getClass() + 
> ", " + inputadapter_row);
> /* 4038 */   double inputadapter_value = inputadapter_row.getDouble(0);
> /* 4039 */
> /* 4040 */   final boolean project_isNull = false;
> /* 4041 */   this.project_values = new 
> Object[500];project_apply_0(inputadapter_row);
> /* 4042 */   project_apply_1(inputadapter_row);
> /* 4043 */   /* final ArrayData project_value = 
> org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); 
> */
> /* 4044 */   final ArrayData project_value = new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
> /* 4045 */   this.project_values = null;
> /* 4046 */   project_holder.reset();
> /* 4047 */
> /* 4048 */   project_rowWriter.zeroOutNullBytes();
> /* 4049 */
> /* 4050 */   if (project_isNull) {
> /* 4051 */ 

[jira] [Updated] (SPARK-8144) For PySpark SQL, automatically convert values provided in readwriter options to string

2016-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8144:
-
Assignee: Yijie Shen

> For PySpark SQL, automatically convert values provided in readwriter options 
> to string
> --
>
> Key: SPARK-8144
> URL: https://issues.apache.org/jira/browse/SPARK-8144
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Yijie Shen
> Fix For: 1.4.2, 1.5.0
>
>
> Because of typos in lines 81 and 240 of:
> [https://github.com/apache/spark/blob/16fc49617e1dfcbe9122b224f7f63b7bfddb36ce/python/pyspark/sql/readwriter.py]
> (Search for "option(")
> CC: [~yhuai] [~davies]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12916) Support Row.fromSeq and Row.toSeq methods in pyspark

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557964#comment-15557964
 ] 

Hyukjin Kwon commented on SPARK-12916:
--

I am pretty sure we don't need this but I would like to cc [~holdenk] here 

> Support Row.fromSeq and Row.toSeq methods in pyspark
> 
>
> Key: SPARK-12916
> URL: https://issues.apache.org/jira/browse/SPARK-12916
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: dataframe, pyspark, row, sql
>
> Pyspark should also have access to the Row functions like fromSeq and toSeq 
> which are exposed in the scala api. 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
> This will be useful when constructing custom columns from function called in 
> dataframes. A good example is present in the following SO threat: 
> http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe
> {code:python}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = 
>   Seq(x * y, z.head.toInt * y)
> val schema = StructType(df.schema.fields ++
>   Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
> val rows = df.rdd.map(r => Row.fromSeq(
>   r.toSeq ++
>   foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"
> val df2 = sqlContext.createDataFrame(rows, schema)
> df2.show
> // +---++---++-+
> // |  x|   y|  z| foo|  bar|
> // +---++---++-+
> // |  1| 3.0|  a| 3.0|291.0|
> // |  2|-1.0|  b|-2.0|-98.0|
> // |  3| 0.0|  c| 0.0|  0.0|
> // +---++---++-+
> {code}
> I am ready to work on this feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12497) thriftServer does not support semicolon in sql

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-12497.

Resolution: Duplicate

It seems the duplicate is clearly linked. Please reopen this if you strongly 
think this is not the duplicate.

> thriftServer does not support semicolon in sql 
> ---
>
> Key: SPARK-12497
> URL: https://issues.apache.org/jira/browse/SPARK-12497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: nilonealex
>
> 0: jdbc:hive2://192.168.128.130:14005> SELECT ';' from tx_1 limit 1 ;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select clause; line 1 pos 8 (state=,code=0)
> 0: jdbc:hive2://192.168.128.130:14005> 
> 0: jdbc:hive2://192.168.128.130:14005> select '\;' from tx_1 limit 1 ; 
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select clause; line 1 pos 9 (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11995) Partitioning Parquet by DateType

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-11995.

Resolution: Duplicate

As SPARK-17388 has a PR, I will mark this as duplicate. Please reopen if you 
feel strongly this is not a duplicate.

> Partitioning Parquet by DateType
> 
>
> Key: SPARK-11995
> URL: https://issues.apache.org/jira/browse/SPARK-11995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jack Arenas
>Priority: Minor
>
> ... After writing to s3 and partitioning by a DateType column, reads on the 
> parquet "table" (i.e. s3n://s3_bucket_url/table where date partitions break 
> the table into date-based s3n://s3_bucket_url/table/date=2015-11-25 chunks) 
> will show the partitioned date column as a StringType...
> https://github.com/databricks/spark-redshift/issues/122



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557944#comment-15557944
 ] 

Hyukjin Kwon edited comment on SPARK-11868 at 10/8/16 1:12 PM:
---

FYI, it now prints differently:

{code}
>>> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
>>> rows = [pyspark.sql.Row(**r) for r in dicts]
>>> rows_rdd = sc.parallelize(rows)
>>> dicts_rdd = sc.parallelize(dicts)
>>> rows_df = sqlContext.createDataFrame(rows_rdd)
>>> dicts_df = sqlContext.createDataFrame(dicts_rdd)
.../spark/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to 
inferSchema is deprecated. Use pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>>
>>> print(rows_df.select(['2']).collect()[10])
16/10/08 22:10:03 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 9)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 3 fields are required while 2 values are 
provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
...
>>> print(dicts_df.select(['2']).collect()[10])
Row(2=None)
{code}


was (Author: hyukjin.kwon):
FYI, it now prints differently:

{code}
>>> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
>>> rows = [pyspark.sql.Row(**r) for r in dicts]
>>> rows_rdd = sc.parallelize(rows)
>>> dicts_rdd = sc.parallelize(dicts)
>>> rows_df = sqlContext.createDataFrame(rows_rdd)
>>> dicts_df = sqlContext.createDataFrame(dicts_rdd)
/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/session.py:336:
 UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>>
>>> print(rows_df.select(['2']).collect()[10])
16/10/08 22:10:03 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 9)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 3 fields are required while 2 values are 
provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
...
>>> print(dicts_df.select(['2']).collect()[10])
Row(2=None)
{code}

> wrong results returned from dataframe create from Rows without consistent 
> schma on pyspark
> --
>
> Key: SPARK-11868
> URL: https://issues.apache.org/jira/browse/SPARK-11868
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: pyspark
>Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's 
> possible to create a dataframe form dictionaries and if a key is missing, its 
> value is None. But when trying to create dataframe from corresponding rows, 
> we get inconsistent behavior (wrong values for keys) without exception. See 
> example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting 
> dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal 
> the number of in the inferred schema, and we place wrong values for wrong 
> keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks. 
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557944#comment-15557944
 ] 

Hyukjin Kwon edited comment on SPARK-11868 at 10/8/16 1:11 PM:
---

FYI, it now prints differently:

{code}
>>> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
>>> rows = [pyspark.sql.Row(**r) for r in dicts]
>>> rows_rdd = sc.parallelize(rows)
>>> dicts_rdd = sc.parallelize(dicts)
>>> rows_df = sqlContext.createDataFrame(rows_rdd)
>>> dicts_df = sqlContext.createDataFrame(dicts_rdd)
/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/session.py:336:
 UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>>
>>> print(rows_df.select(['2']).collect()[10])
16/10/08 22:10:03 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 9)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 3 fields are required while 2 values are 
provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
...
>>> print(dicts_df.select(['2']).collect()[10])
Row(2=None)
{code}


was (Author: hyukjin.kwon):
FYI, it now prints differently:

{code}
>>> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
>>> rows = [pyspark.sql.Row(**r) for r in dicts]
>>> rows_rdd = sc.parallelize(rows)
>>> dicts_rdd = sc.parallelize(dicts)
>>> rows_df = sqlContext.createDataFrame(rows_rdd)
>>> dicts_df = sqlContext.createDataFrame(dicts_rdd)
/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/session.py:336:
 UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>>
>>> print(rows_df.select(['2']).collect()[10])
16/10/08 22:10:03 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 9)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 3 fields are required while 2 values are 
provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/08 22:10:03 WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 9, 
localhost): java.lang.IllegalStateException: Input row doesn't have expected 
number of values required by the schema. 3 fields are required while 2 values 
are provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 

[jira] [Commented] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557944#comment-15557944
 ] 

Hyukjin Kwon commented on SPARK-11868:
--

FYI, it now prints differently:

{code}
>>> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
>>> rows = [pyspark.sql.Row(**r) for r in dicts]
>>> rows_rdd = sc.parallelize(rows)
>>> dicts_rdd = sc.parallelize(dicts)
>>> rows_df = sqlContext.createDataFrame(rows_rdd)
>>> dicts_df = sqlContext.createDataFrame(dicts_rdd)
/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/session.py:336:
 UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>>
>>> print(rows_df.select(['2']).collect()[10])
16/10/08 22:10:03 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 9)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 3 fields are required while 2 values are 
provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/08 22:10:03 WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 9, 
localhost): java.lang.IllegalStateException: Input row doesn't have expected 
number of values required by the schema. 3 fields are required while 2 values 
are provided.
at 
org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at 
org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:656)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 

[jira] [Commented] (SPARK-11784) enable Timestamp filter pushdown

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557940#comment-15557940
 ] 

Hyukjin Kwon commented on SPARK-11784:
--

Could you feel up the description? it seems you referred the {{TimestampType}} 
support in Parquet data source?

> enable Timestamp filter pushdown
> 
>
> Key: SPARK-11784
> URL: https://issues.apache.org/jira/browse/SPARK-11784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2016-10-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-11660.

Resolution: Duplicate

It seems a duplicate. Please reopen this if you feel strongly this is not a 
duplciate and different issue.

> Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING
> -
>
> Key: SPARK-11660
> URL: https://issues.apache.org/jira/browse/SPARK-11660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chip Sands
>
> In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
> describes the result set metadata, reports a column that is defined as a 
> VARCHAR in the database, as Native type of STRING. Data still returns 
> correctly in the thrift string type but ODBC/JDBC is not able to correctly 
> describe the data type being return or its defined maximum length.
> FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557931#comment-15557931
 ] 

Hyukjin Kwon commented on SPARK-11620:
--

[~swethakasireddy] Could you please check if this still happens in the current 
master or latest versions?

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >