[jira] [Assigned] (SPARK-22668) CodegenContext.splitExpressions() creates incorrect results with global variable arguments

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22668:


Assignee: (was: Apache Spark)

> CodegenContext.splitExpressions() creates incorrect results with global 
> variable arguments 
> ---
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{CodegenContext.splitExpressions()}} creates incorrect results with 
> arguments that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22668) CodegenContext.splitExpressions() creates incorrect results with global variable arguments

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275439#comment-16275439
 ] 

Apache Spark commented on SPARK-22668:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19865

> CodegenContext.splitExpressions() creates incorrect results with global 
> variable arguments 
> ---
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{CodegenContext.splitExpressions()}} creates incorrect results with 
> arguments that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22668) CodegenContext.splitExpressions() creates incorrect results with global variable arguments

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22668:


Assignee: Apache Spark

> CodegenContext.splitExpressions() creates incorrect results with global 
> variable arguments 
> ---
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> {{CodegenContext.splitExpressions()}} creates incorrect results with 
> arguments that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275426#comment-16275426
 ] 

Nick Pentreath commented on SPARK-8418:
---

*1 I’m ok with throwing an exception. We can update the previous and in
progress PRs accordingly.

*2 where modifying an existing API obviously we need to keep both.

But I prefer only inputCols for new Components. We can provide convenience
method to set single (or a few) input columns - I did that for
FeatureHasher.

Like setInputCol(col: String, others: String*). But the param set is
inputCols under the hood.

Java still must use setInputCols as the above only works for Scala I think.

We can also deprecate the single column variants for 3.0 if we like?

*3 yes we must thoroughly test this before 2.3 release. I think it should
be fine as it’s just adding a few new parameters which is nothing out of
the ordinary.

*4 I will create JIRAs for Python APIs - ideally we’d like them for 2.3.
Fortunately it should be pretty trivial to complete.
On Sat, 2 Dec 2017 at 00:00, Joseph K. Bradley (JIRA) 



> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-12-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22651.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19845
[https://github.com/apache/spark/pull/19845]

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-12-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22651:


Assignee: Hyukjin Kwon

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22673) InMemoryRelation should utilize on-disk table stats whenever possible

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275141#comment-16275141
 ] 

Apache Spark commented on SPARK-22673:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/19864

> InMemoryRelation should utilize on-disk table stats whenever possible
> -
>
> Key: SPARK-22673
> URL: https://issues.apache.org/jira/browse/SPARK-22673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> The current implementation of InMemoryRelation always uses the most expensive 
> execution plan when writing cache
> With CBO enabled, we can actually have a more exact estimation of the 
> underlying table size...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22673) InMemoryRelation should utilize on-disk table stats whenever possible

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22673:


Assignee: (was: Apache Spark)

> InMemoryRelation should utilize on-disk table stats whenever possible
> -
>
> Key: SPARK-22673
> URL: https://issues.apache.org/jira/browse/SPARK-22673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> The current implementation of InMemoryRelation always uses the most expensive 
> execution plan when writing cache
> With CBO enabled, we can actually have a more exact estimation of the 
> underlying table size...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22673) InMemoryRelation should utilize on-disk table stats whenever possible

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22673:


Assignee: Apache Spark

> InMemoryRelation should utilize on-disk table stats whenever possible
> -
>
> Key: SPARK-22673
> URL: https://issues.apache.org/jira/browse/SPARK-22673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>Assignee: Apache Spark
>
> The current implementation of InMemoryRelation always uses the most expensive 
> execution plan when writing cache
> With CBO enabled, we can actually have a more exact estimation of the 
> underlying table size...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22673) InMemoryRelation should utilize on-disk table stats whenever possible

2017-12-01 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-22673:
---

 Summary: InMemoryRelation should utilize on-disk table stats 
whenever possible
 Key: SPARK-22673
 URL: https://issues.apache.org/jira/browse/SPARK-22673
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Nan Zhu


The current implementation of InMemoryRelation always uses the most expensive 
execution plan when writing cache

With CBO enabled, we can actually have a more exact estimation of the 
underlying table size...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2017-12-01 Thread Dmitriy Reshetnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275066#comment-16275066
 ] 

Dmitriy Reshetnikov commented on SPARK-7736:


Spark 2.2 still facing that issue.
In my case Azkaban executes Spark Job and finalStatus of this job in Resource 
Manager is SUCCESS in anycase.

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22672) Move OrcTest to `sql/core`

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22672:


Assignee: Apache Spark

> Move OrcTest to `sql/core`
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
> instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22672) Move OrcTest to `sql/core`

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275053#comment-16275053
 ] 

Apache Spark commented on SPARK-22672:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19863

> Move OrcTest to `sql/core`
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
> instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22672) Move OrcTest to `sql/core`

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22672:


Assignee: (was: Apache Spark)

> Move OrcTest to `sql/core`
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
> instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275049#comment-16275049
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

I just glanced through the various PRs adding multi-column support and wanted 
to get consensus about a few items to make sure we have consistent APIs.  CC 
[~mlnick], [~yuhaoyan], [~yanboliang], [~WeichenXu123], [~huaxing], [~viirya]  
Let me know what you think!

*1. When both inputCol and inputCols are specified, what should we do?*

* [SPARK-20542]: Bucketizer: logWarning
* [SPARK-13030]: OneHotEncoder: n/a (no single-column support)
* [SPARK-11215]: StringIndexer: throw exception
* [SPARK-22397]: QuantileDiscretizer: logWarning
* my vote: throw exception (safer since it's easier for users to recognize 
their error)

*2. Should we have single- and multi-column support or just multi-column?  
E.g., should we have (a) inputCol and inputCols or (b) only inputCols?*

Currently, [SPARK-13030] only has multi-column support for the new 
OneHotEncoderEstimator.  The other PRs have both single- and multi-column 
support since they are modifying existing APIs.
*Q*: Should we add single-column to OneHotEncoderEstimator for consistency or 
not bother?  I'm ambivalent.

*3. Backwards compatibility for ML persistence*

We'll have to be aware of whether we're breaking compatibility.  I don't see 
problems in most PRs but have not tested it manually.  The only PR with an 
issue is [SPARK-13030] for OneHotEncoder; however, that's pretty reasonable to 
break compatibility for persistence there.

*4. Python APIs*

I don't see follow-ups for Python APIs yet.  Are those planned for 2.3?

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22672) Move OrcTest to `sql/core`

2017-12-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22672:
-

 Summary: Move OrcTest to `sql/core`
 Key: SPARK-22672
 URL: https://issues.apache.org/jira/browse/SPARK-22672
 Project: Spark
  Issue Type: Task
  Components: SQL, Tests
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun
Priority: Trivial


To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22638) Use a separate query for StreamingQueryListenerBus

2017-12-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22638.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Use a separate query for StreamingQueryListenerBus
> --
>
> Key: SPARK-22638
> URL: https://issues.apache.org/jira/browse/SPARK-22638
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: We can use a separate Spark event query for 
> StreamingQueryListenerBus so that if there are many non-streaming events, 
> streaming query listeners don't need to wait for other Spark listeners and 
> can catch up.
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22671) SortMergeJoin read more data when wholeStageCodegen is off compared with when it is on

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22671:


Assignee: (was: Apache Spark)

> SortMergeJoin read more data when wholeStageCodegen is off compared with when 
> it is on
> --
>
> Key: SPARK-22671
> URL: https://issues.apache.org/jira/browse/SPARK-22671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Chenzhao Guo
>
> In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if 
> the left table of a partition is empty then there is no need to read the 
> right table of this corresponding partition. This benefits the case in which 
> many partitions of left table is empty and the right table is big.
> While in the code path without wholeStageCodegen, this optimization doesn't 
> happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16139) Audit tests for leaked threads

2017-12-01 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274647#comment-16274647
 ] 

Gabor Somogyi commented on SPARK-16139:
---

I would like to solve this ticket. Please notify me if somebody is already 
working on it.


> Audit tests for leaked threads
> --
>
> Key: SPARK-16139
> URL: https://issues.apache.org/jira/browse/SPARK-16139
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Lots of our tests don't properly shutdown everything they create, and end up 
> leaking lots of threads.  For example, {{TaskSetManagerSuite}} doesn't stop 
> the extra {{TaskScheduler}} and {{DAGScheduler}} it creates.  There are a 
> couple more instances I've run into recently, eg. in 
> [{{DAGSchedulerSuite}}|https://github.com/apache/spark/commit/cf1995a97645f0b44c997f4fdbba631fd6b91a16#diff-f3b410b16818d8f34bb1eb4120a60d51R235
>  ]
> I'm fixing these piecemeal when I see them (eg., TaskSetManagerSuite should 
> be fixed by my pr for SPARK-16136), but it would be great to have a 
> comprehensive audit and fix this across all tests.
> This should be semi-automatable.  In {{SparkFunSuite}}, you could grab all 
> threads before the tests starts, and after it completes.  Then you could 
> clearly log all threads started after the test started but still going.  
> Unfortunately this isn't perfect, it seems that netty threads aren't killed 
> immediately on shutdown, . Its OK if some of them linger beyond the test, so 
> you may need to do some whitelisting based on thread-name & a little more 
> manual inspection.  But you could at least clearly log the relevant info, so 
> that after a jenkins run you could easily pull the info from the logs.
> Bonus points if you can figure out some way to make this output visible 
> outside of the logs, ideally even in the test report that makes it to github, 
> but that isn't necessary, and unless its very easy probably best for a 
> separate task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22671) SortMergeJoin read more data when wholeStageCodegen is off compared with when it is on

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22671:


Assignee: Apache Spark

> SortMergeJoin read more data when wholeStageCodegen is off compared with when 
> it is on
> --
>
> Key: SPARK-22671
> URL: https://issues.apache.org/jira/browse/SPARK-22671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Chenzhao Guo
>Assignee: Apache Spark
>
> In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if 
> the left table of a partition is empty then there is no need to read the 
> right table of this corresponding partition. This benefits the case in which 
> many partitions of left table is empty and the right table is big.
> While in the code path without wholeStageCodegen, this optimization doesn't 
> happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22671) SortMergeJoin read more data when wholeStageCodegen is off compared with when it is on

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274646#comment-16274646
 ] 

Apache Spark commented on SPARK-22671:
--

User 'gczsjdy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19862

> SortMergeJoin read more data when wholeStageCodegen is off compared with when 
> it is on
> --
>
> Key: SPARK-22671
> URL: https://issues.apache.org/jira/browse/SPARK-22671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Chenzhao Guo
>
> In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if 
> the left table of a partition is empty then there is no need to read the 
> right table of this corresponding partition. This benefits the case in which 
> many partitions of left table is empty and the right table is big.
> While in the code path without wholeStageCodegen, this optimization doesn't 
> happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22671) SortMergeJoin read more data when wholeStageCodegen is off compared with when it is on

2017-12-01 Thread Chenzhao Guo (JIRA)
Chenzhao Guo created SPARK-22671:


 Summary: SortMergeJoin read more data when wholeStageCodegen is 
off compared with when it is on
 Key: SPARK-22671
 URL: https://issues.apache.org/jira/browse/SPARK-22671
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Chenzhao Guo


In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if 
the left table of a partition is empty then there is no need to read the right 
table of this corresponding partition. This benefits the case in which many 
partitions of left table is empty and the right table is big.

While in the code path without wholeStageCodegen, this optimization doesn't 
happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-12-01 Thread Mark Petruska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274588#comment-16274588
 ] 

Mark Petruska commented on SPARK-22267:
---

All evidence suggests that this is a hive bug.
In https://github.com/apache/spark/pull/19744 I tried a couple of 
configurations/properties for the hive `OrcInputFormat` and `OrcSerde`; none of 
them had any effect, the data was always read in the order as written (and not 
in the order requested).

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-12-01 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274586#comment-16274586
 ] 

Marco Gaido commented on SPARK-20299:
-

I found the problems. I think some checks are missing in Spark, but the main 
problem is in the Java access to Scala tuples. Indeed, if you have this scala 
class:
{code}
class Foo {
  def intsNullTuple = (null.asInstanceOf[Int], 2)
  def intAndStringNullTuple =  (null.asInstanceOf[Int], "2")
}
{code}

and you try to access it using in a Java class, this is the surprising behavior:

{code}
Tuple2 t = (new Fuffa()).intsNullTuple();
t._1(); // returns 0 !
t._1; // return null
Tuple2 t2 = (new Fuffa()).intAndStringNullTuple();
t._1(); // returns null
t._1; // return null
{code}

I am not sure about the reason of this behavior and I am worried about changing 
this code because it might break backward compatibility. Does anybody know 
something more about the reason if this behavior?

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-12-01 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274526#comment-16274526
 ] 

Wenchen Fan commented on SPARK-22267:
-

just to confirm, is it a hive bug or a Spark only bug?

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22387) propagate session configs to data source read/write options

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274456#comment-16274456
 ] 

Apache Spark commented on SPARK-22387:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/19861

> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a typo in the config key and spend a lot of time to figure out why 
> things don't work as expected. We should allow data source to validate the 
> given options and throw exception if an option can't be recognized.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22387) propagate session configs to data source read/write options

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22387:


Assignee: (was: Apache Spark)

> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a typo in the config key and spend a lot of time to figure out why 
> things don't work as expected. We should allow data source to validate the 
> given options and throw exception if an option can't be recognized.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22387) propagate session configs to data source read/write options

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22387:


Assignee: Apache Spark

> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a typo in the config key and spend a lot of time to figure out why 
> things don't work as expected. We should allow data source to validate the 
> given options and throw exception if an option can't be recognized.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-12-01 Thread Mark Petruska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274447#comment-16274447
 ] 

Mark Petruska commented on SPARK-22267:
---

Hi [~dongjoon], sorry for the late reply. I do not have a patch for that case; 
my first idea would be to force the `convertMetastoreOrc=true` behaviour even 
when it is set to false. Is it viable? Is there a better solution?

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274388#comment-16274388
 ] 

Apache Spark commented on SPARK-22669:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19860

> Avoid unnecessary function calls in code generation
> ---
>
> Key: SPARK-22669
> URL: https://issues.apache.org/jira/browse/SPARK-22669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> In many parts of the codebase for code generation, we are splitting the code 
> to avoid exceptions due to the 64KB method size limit. This is generating a 
> lot of methods which are called every time, even though sometime this is not 
> needed. As pointed out here: 
> https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a 
> not negligible overhead which can be avoided.
> In this JIRA, I propose to use the same approach throughout all the other 
> cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22669:


Assignee: (was: Apache Spark)

> Avoid unnecessary function calls in code generation
> ---
>
> Key: SPARK-22669
> URL: https://issues.apache.org/jira/browse/SPARK-22669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> In many parts of the codebase for code generation, we are splitting the code 
> to avoid exceptions due to the 64KB method size limit. This is generating a 
> lot of methods which are called every time, even though sometime this is not 
> needed. As pointed out here: 
> https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a 
> not negligible overhead which can be avoided.
> In this JIRA, I propose to use the same approach throughout all the other 
> cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22669:


Assignee: Apache Spark

> Avoid unnecessary function calls in code generation
> ---
>
> Key: SPARK-22669
> URL: https://issues.apache.org/jira/browse/SPARK-22669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>
> In many parts of the codebase for code generation, we are splitting the code 
> to avoid exceptions due to the 64KB method size limit. This is generating a 
> lot of methods which are called every time, even though sometime this is not 
> needed. As pointed out here: 
> https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a 
> not negligible overhead which can be avoided.
> In this JIRA, I propose to use the same approach throughout all the other 
> cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22670) Not able to create table in HIve with SparkSession when JavaSparkContext is already initialized.

2017-12-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274353#comment-16274353
 ] 

Sean Owen commented on SPARK-22670:
---

This is because you already created and configured a context, and then try to 
configure it again. Don't create the context first.

> Not able to create table in HIve with SparkSession when JavaSparkContext is 
> already initialized.
> 
>
> Key: SPARK-22670
> URL: https://issues.apache.org/jira/browse/SPARK-22670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Naresh Meena
>Priority: Blocker
> Fix For: 2.1.1
>
>
> Not able to create table in Hive with SparkSession when SparkContext is 
> already initialized.
> Below is the code snippet and error logs.
> JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
> SparkSession hiveCtx = SparkSession
>   .builder()
>   
> .config(HiveConf.ConfVars.METASTOREURIS.toString(),
>   "..:9083")
>   .config("spark.sql.warehouse.dir", 
> "/apps/hive/warehouse")
>   .enableHiveSupport().getOrCreate();
> 2017-11-29 13:11:33 Driver [ERROR] SparkBatchSubmitter - Failed to start the 
> driver for Batch_JDBC_PipelineTest
> org.apache.spark.sql.AnalysisException: 
> Hive support is required to insert into the following tables:
> `default`.`testhivedata`
>;;
> 'InsertIntoTable 'SimpleCatalogRelation default, CatalogTable(
>   Table: `default`.`testhivedata`
>   Created: Wed Nov 29 13:11:33 IST 2017
>   Last Access: Thu Jan 01 05:29:59 IST 1970
>   Type: MANAGED
>   Schema: [StructField(empID,LongType,true), 
> StructField(empDate,DateType,true), StructField(empName,StringType,true), 
> StructField(empSalary,DoubleType,true), 
> StructField(empLocation,StringType,true), 
> StructField(empConditions,BooleanType,true), 
> StructField(empCity,StringType,true), 
> StructField(empSystemIP,StringType,true)]
>   Provider: hive
>   Storage(Location: 
> file:/hadoop/yarn/local/usercache/sax/appcache/application_1511627000183_0190/container_e34_1511627000183_0190_01_01/spark-warehouse/testhivedata,
>  InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> OverwriteOptions(false,Map()), false
> +- LogicalRDD [empID#49L, empDate#50, empName#51, empSalary#52, 
> empLocation#53, empConditions#54, empCity#55, empSystemIP#56]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:405)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263)
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243)
>   at 
> com.streamanalytix.spark.processor.HiveEmitter.persistRDDToHive(HiveEmitter.java:690)
>

[jira] [Assigned] (SPARK-22634) Update Bouncy castle dependency

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22634:


Assignee: (was: Apache Spark)

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Priority: Minor
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274349#comment-16274349
 ] 

Apache Spark commented on SPARK-22634:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19859

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Priority: Minor
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22634) Update Bouncy castle dependency

2017-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22634:


Assignee: Apache Spark

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Apache Spark
>Priority: Minor
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22670) Not able to create table in HIve with SparkSession when JavaSparkContext is already initialized.

2017-12-01 Thread Naresh Meena (JIRA)
Naresh Meena created SPARK-22670:


 Summary: Not able to create table in HIve with SparkSession when 
JavaSparkContext is already initialized.
 Key: SPARK-22670
 URL: https://issues.apache.org/jira/browse/SPARK-22670
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Naresh Meena
Priority: Blocker
 Fix For: 2.1.1


Not able to create table in Hive with SparkSession when SparkContext is already 
initialized.

Below is the code snippet and error logs.

JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SparkSession hiveCtx = SparkSession
.builder()

.config(HiveConf.ConfVars.METASTOREURIS.toString(),
"..:9083")
.config("spark.sql.warehouse.dir", 
"/apps/hive/warehouse")
.enableHiveSupport().getOrCreate();




2017-11-29 13:11:33 Driver [ERROR] SparkBatchSubmitter - Failed to start the 
driver for Batch_JDBC_PipelineTest
org.apache.spark.sql.AnalysisException: 
Hive support is required to insert into the following tables:
`default`.`testhivedata`
   ;;
'InsertIntoTable 'SimpleCatalogRelation default, CatalogTable(
Table: `default`.`testhivedata`
Created: Wed Nov 29 13:11:33 IST 2017
Last Access: Thu Jan 01 05:29:59 IST 1970
Type: MANAGED
Schema: [StructField(empID,LongType,true), 
StructField(empDate,DateType,true), StructField(empName,StringType,true), 
StructField(empSalary,DoubleType,true), 
StructField(empLocation,StringType,true), 
StructField(empConditions,BooleanType,true), 
StructField(empCity,StringType,true), StructField(empSystemIP,StringType,true)]
Provider: hive
Storage(Location: 
file:/hadoop/yarn/local/usercache/sax/appcache/application_1511627000183_0190/container_e34_1511627000183_0190_01_01/spark-warehouse/testhivedata,
 InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
OverwriteOptions(false,Map()), false
+- LogicalRDD [empID#49L, empDate#50, empName#51, empSalary#52, empLocation#53, 
empConditions#54, empCity#55, empSystemIP#56]

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:405)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:76)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:76)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243)
at 
com.streamanalytix.spark.processor.HiveEmitter.persistRDDToHive(HiveEmitter.java:690)
at 
com.streamanalytix.spark.processor.HiveEmitter.executeWithRDD(HiveEmitter.java:395)
at 
com.streamanalytix.spark.core.AbstractProcessor.processRDDMap(AbstractProcessor.java:227)
at 
com.streamanalytix.spark.core.pipeline.SparkBatchSubmitter.definePipelineFlow(SparkBatchSubmitter.java:353)
at 
com.streamanalytix.spark.core.pipeline.SparkBatchSubmitter.getContext(SparkBatchSubmitter.java:302)
at 

[jira] [Commented] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-12-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274310#comment-16274310
 ] 

Steve Loughran commented on SPARK-22657:


No, more that we need to change how that service search works, changing the 
rescan could be part of it

Today: 
* loads the entire FS class, calls a getScheme() method to get the scheme. 
* Works for FileSystem, but not FileContext, which is always registered by a 
property. 
* if a schema for a FileSystem service isn't found, falls back to fs.SCHEMA.impl
* because the implementation class is loaded, it transitively loads all 
dependencies of the class, getting into trouble if they aren't on the CP. (was: 
fail, then: warn, now: silent)
* and if you have shaded dependencies, takes a very long time to load, even 
when a schema isn't used.

It's expensive for that scan, and we mustn't redo it automatically.

I'd prefer
# a minimal service class which only declares: (schema, fs impl class, file 
context impl class, version, homepage).
# a way to reset the cache or at least repeat the scan.

Anyway, for now, use s3a which is self registered, don't worry about this


> Hadoop fs implementation classes are not loaded if they are part of the app 
> jar or other jar when --packages flag is used 
> --
>
> Key: SPARK-22657
> URL: https://issues.apache.org/jira/browse/SPARK-22657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf 
> spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
>  \
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job 
> http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
>  \
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
> s3n://arand-sandbox-mesosphere/linecount.out
> within a container created with 
> mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image
> You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
> for scheme: s3n"
> This can be run reproduced with local[*] as well, no need to use mesos, this 
> is not mesos bug.
> The specific spark job used above can be found here: 
> https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
>   
> Can be built with sbt assembly in that dir.
> Using this code : 
> https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the 
> beginning of the main method...
> you get the following output : 
> https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
> (Use 
> http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
>  to to get the modified job)
> The job works fine if --packages is not used.
> The commit that introduced this issue is (before that things work as 
> expected):
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
> glob support for resources adding to Spark [32m(5 months ago) 
> [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
> The exception comes from here: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311
> https://github.com/apache/spark/pull/18235/files, check line 950, this is 
> where a filesystem is first created.
> The Filesystem class is initialized there, before the main of the spark job 
> is launched... the reason is --packages logic uses hadoop libraries to 
> download files
> Maven resolution happens before the app jar and the resolved jars are added 
> to the classpath. So at that moment there is no s3n to add to the static map 
> when the Filesystem static members are first initialized and also filled due 
> to the first FileSystem instance created (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem (create 
> a second filesystem) we get the exception (at this point the app jar has the 
> s3n implementation in it and its on the class path but that scheme is not 
> loaded in the static map of the Filesystem class)... 
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
> problem is with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output(gist)  above 
> when --packages is used. The first print is before creating the s3n 
> filesystem. We use reflection there to get the static map's entries. When 
> --packages is not used 

[jira] [Resolved] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-12-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22393.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19846
[https://github.com/apache/spark/pull/19846]

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
> Fix For: 2.3.0
>
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-12-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22393:
-

Assignee: Mark Petruska

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Assignee: Mark Petruska
>Priority: Minor
> Fix For: 2.3.0
>
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22664) The logs about "Connected to Zookeeper" in ReliableKafkaReceiver.scala are in wrong position

2017-12-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22664.
---
Resolution: Not A Problem

> The logs about  "Connected to Zookeeper" in ReliableKafkaReceiver.scala are 
> in wrong position
> -
>
> Key: SPARK-22664
> URL: https://issues.apache.org/jira/browse/SPARK-22664
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: liuzhaokun
>Priority: Trivial
>
> The logs about  "Connecting to Zookeeper" and "Connected to Zookeeper" in 
> ReliableKafkaReceiver.scala should be printed when new a zkClient() in line 
> 122.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22668) CodegenContext.splitExpressions() creates incorrect results with global variable arguments

2017-12-01 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22668:
-
Summary: CodegenContext.splitExpressions() creates incorrect results with 
global variable arguments   (was: CodegenContext.splitExpression() creates 
incorrect results with global variable arguments )

> CodegenContext.splitExpressions() creates incorrect results with global 
> variable arguments 
> ---
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{CodegenContext.splitExpression()}} creates incorrect results with arguments 
> that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22668) CodegenContext.splitExpressions() creates incorrect results with global variable arguments

2017-12-01 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22668:
-
Description: 
{{CodegenContext.splitExpressions()}} creates incorrect results with arguments 
that were declared as global variable.

{code}
class Test {
  int global1;

  void splittedFunction(int global1) {
...
global1 = 2;
  }

  void apply() {
global1 = 1;
...
splittedFunction(global1);
// global1 should be 2
  }
}
{code}

  was:
{{CodegenContext.splitExpression()}} creates incorrect results with arguments 
that were declared as global variable.

{code}
class Test {
  int global1;

  void splittedFunction(int global1) {
...
global1 = 2;
  }

  void apply() {
global1 = 1;
...
splittedFunction(global1);
// global1 should be 2
  }
}
{code}


> CodegenContext.splitExpressions() creates incorrect results with global 
> variable arguments 
> ---
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{CodegenContext.splitExpressions()}} creates incorrect results with 
> arguments that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22635) FileNotFoundException again while reading ORC files containing special characters

2017-12-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22635:
-
Fix Version/s: 2.2.2

> FileNotFoundException again while reading ORC files containing special 
> characters
> -
>
> Key: SPARK-22635
> URL: https://issues.apache.org/jira/browse/SPARK-22635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.2.2, 2.3.0
>
>
> SPARK-22146 fix the issue only for the {{inferSchema}}, ie. only for the 
> schema inference, but it doesn't fix the problem when actually reading the 
> data. Thus nearly the same exception happens when someone tries to use the 
> data.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 98, host-172-22-127-77.example.com, executor 3): 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://XXX/tmp/aaa/start=2017-11-27%2009%253A30%253A00/part-0-c1477c9f-9d48-4341-89de-81056b6b618e.c000.snappy.orc
> It is possible the underlying files have been updated. You can explicitly 
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
> SQL or by recreating the Dataset/DataFrame involved.
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-12-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22146:
-
Fix Version/s: 2.2.2

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.2.2, 2.3.0
>
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-01 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22669:
---

 Summary: Avoid unnecessary function calls in code generation
 Key: SPARK-22669
 URL: https://issues.apache.org/jira/browse/SPARK-22669
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


In many parts of the codebase for code generation, we are splitting the code to 
avoid exceptions due to the 64KB method size limit. This is generating a lot of 
methods which are called every time, even though sometime this is not needed. 
As pointed out here: 
https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a not 
negligible overhead which can be avoided.

In this JIRA, I propose to use the same approach throughout all the other 
cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274135#comment-16274135
 ] 

Apache Spark commented on SPARK-22489:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19858

> Shouldn't change broadcast join buildSide if user clearly specified
> ---
>
> Key: SPARK-22489
> URL: https://issues.apache.org/jira/browse/SPARK-22489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> How to reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
> t1.key = t2.key").queryExecution.executedPlan
> println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
> {code}
> The result is {{BuildRight}}, but should be {{BuildLeft}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22668) CodegenContext.splitExpression() creates incorrect results with global variable arguments

2017-12-01 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274128#comment-16274128
 ] 

Kazuaki Ishizaki commented on SPARK-22668:
--

I am working for this.

> CodegenContext.splitExpression() creates incorrect results with global 
> variable arguments 
> --
>
> Key: SPARK-22668
> URL: https://issues.apache.org/jira/browse/SPARK-22668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{CodegenContext.splitExpression()}} creates incorrect results with arguments 
> that were declared as global variable.
> {code}
> class Test {
>   int global1;
>   void splittedFunction(int global1) {
> ...
> global1 = 2;
>   }
>   void apply() {
> global1 = 1;
> ...
> splittedFunction(global1);
> // global1 should be 2
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22668) CodegenContext.splitExpression() creates incorrect results with global variable arguments

2017-12-01 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22668:


 Summary: CodegenContext.splitExpression() creates incorrect 
results with global variable arguments 
 Key: SPARK-22668
 URL: https://issues.apache.org/jira/browse/SPARK-22668
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


{{CodegenContext.splitExpression()}} creates incorrect results with arguments 
that were declared as global variable.

{code}
class Test {
  int global1;

  void splittedFunction(int global1) {
...
global1 = 2;
  }

  void apply() {
global1 = 1;
...
splittedFunction(global1);
// global1 should be 2
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org