[jira] [Updated] (SPARK-48148) JSON objects should not be modified when read as STRING

2024-05-06 Thread Eric Maynard (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-48148:
-
Description: 
Currently, when reading a JSON like this:


bq. {"a": {"b": -999.995}}


With the schema:

bq. a STRING


Spark will yield a result like this:


bq. {"b": -1000.0}


This is due to how we convert a non-string value to a string in JacksonParser

  was:
Currently, when reading a JSON like this:

```
{"a": {"b": -999.995}}
```

With the schema:

```
a STRING
```

Spark will yield a result like this:

```
{"b": -1000.0}
```

This is due to how we convert a non-string value to a string in JacksonParser


> JSON objects should not be modified when read as STRING
> ---
>
> Key: SPARK-48148
> URL: https://issues.apache.org/jira/browse/SPARK-48148
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Maynard
>Priority: Major
>
> Currently, when reading a JSON like this:
> bq. {"a": {"b": -999.995}}
> With the schema:
> bq. a STRING
> Spark will yield a result like this:
> bq. {"b": -1000.0}
> This is due to how we convert a non-string value to a string in JacksonParser



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41995) schema_of_json only accepts foldable expressions

2023-01-11 Thread Eric Maynard (Jira)
Eric Maynard created SPARK-41995:


 Summary: schema_of_json only accepts foldable expressions
 Key: SPARK-41995
 URL: https://issues.apache.org/jira/browse/SPARK-41995
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Eric Maynard


Right now schema_of_json only accepts foldable expressions, or literals. But it 
could be extended to accept any arbitrary expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2019-06-18 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-24936:
-
Description: 
*strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk 
if their over 2GB.  However, this will fail with an external shuffle service 
running < spark 2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.

  was:
After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 
2GB.  However, this will fail with an external shuffle service running < spark 
2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.


> Better error message when trying a shuffle fetch over 2 GB
> --
>
> Key: SPARK-24936
> URL: https://issues.apache.org/jira/browse/SPARK-24936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> *strong text*After SPARK-24297, spark will try to fetch shuffle blocks to 
> disk if their over 2GB.  However, this will fail with an external shuffle 
> service running < spark 2.2, with an unhelpful error message like:
> {noformat}
> 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
> (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
> , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException
> at 
> org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
> ...
> {noformat}
> We can't do anything to make the shuffle succeed, in this situation, but we 
> should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2019-06-18 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-24936:
-
Description: 
After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 
2GB.  However, this will fail with an external shuffle service running < spark 
2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.

  was:
*strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk 
if their over 2GB.  However, this will fail with an external shuffle service 
running < spark 2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.


> Better error message when trying a shuffle fetch over 2 GB
> --
>
> Key: SPARK-24936
> URL: https://issues.apache.org/jira/browse/SPARK-24936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> After SPARK-24297, spark will try to fetch shuffle blocks to disk if their 
> over 2GB.  However, this will fail with an external shuffle service running < 
> spark 2.2, with an unhelpful error message like:
> {noformat}
> 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
> (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
> , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException
> at 
> org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
> ...
> {noformat}
> We can't do anything to make the shuffle succeed, in this situation, but we 
> should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-10 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-27421:
-
Affects Version/s: 2.4.1

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at 
> 

[jira] [Commented] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-10 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814872#comment-16814872
 ] 

Eric Maynard commented on SPARK-27421:
--

[~shivuson...@gmail.com] Any hiccup confirming the issue with the 3 lines in 
the Jira? I am able to replicate this pretty reliably on 2.4.0

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> 

[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-09 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-27421:
-
Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server 
VM, Java 1.8.0_141)

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at 
> 

[jira] [Created] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-09 Thread Eric Maynard (JIRA)
Eric Maynard created SPARK-27421:


 Summary: RuntimeException when querying a view on a partitioned 
parquet table
 Key: SPARK-27421
 URL: https://issues.apache.org/jira/browse/SPARK-27421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Eric Maynard


When running a simple query, I get the following stacktrace:


{code}
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
 at 

[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503687#comment-16503687
 ] 

Eric Maynard commented on SPARK-24469:
--

Ah, I see, I was wrongly thinking of the second case where you use e.g. MIN to 
get some legitimate input value. But I can see how *min* would yield bad 
performance. 
Maybe try *first* instead?

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503586#comment-16503586
 ] 

Eric Maynard commented on SPARK-24469:
--

bq. SELECT UPPER(text)GROUP BY UPPER(text)
bq. introduces invalid values into the output set

Can you elaborate on this?

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454365#comment-16454365
 ] 

Eric Maynard commented on SPARK-23830:
--

[~jerryshao] I agree we should not support using a {{class}}. However, I also 
believe that it's bad practice to throw a non-descriptive NPE. I created a PR 
here which adds more useful logging in the event that a proper main method 
isn't found and prevents throwing an NPE:
https://github.com/apache/spark/pull/21168  

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results

2018-04-24 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450018#comment-16450018
 ] 

Eric Maynard commented on SPARK-23852:
--

{color:#33}>There is no upstream release of Parquet that contains the fix 
for {color}PARQUET-1217{color:#33}, although a 1.10 release is 
planned.{color}

PARQUET-1217 seems to have been merged into Parquet 1.8.3 today.

> Parquet MR bug can lead to incorrect SQL results
> 
>
> Key: SPARK-23852
> URL: https://issues.apache.org/jira/browse/SPARK-23852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Priority: Blocker
>  Labels: correctness
>
> Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that 
> pushing certain predicates to Parquet scanners can return fewer results than 
> they should.
> The bug triggers in Spark when:
>  * The Parquet file being scanner has stats for the null count, but not the 
> max or min on the column with the predicate (Apache Impala writes files like 
> this).
>  * The vectorized Parquet reader path is not taken, and the parquet-mr reader 
> is used.
>  * A suitable <, <=, > or >= predicate is pushed down to Parquet.
> The bug is that the parquet-mr interprets the max and min of a row-group's 
> column as 0 in the absence of stats. So {{col > 0}} will filter all results, 
> even if some are > 0.
> There is no upstream release of Parquet that contains the fix for 
> PARQUET-1217, although a 1.10 release is planned.
> The least impactful workaround is to set the Parquet configuration 
> {{parquet.filter.stats.enabled}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-04-24 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449948#comment-16449948
 ] 

Eric Maynard commented on SPARK-23519:
--

Why is the fact that you dynamically generate the statement mean that you can't 
alias the columns in your select statement? You can generate aliases as well. 
This seems like a non-issue.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators

2017-11-17 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257004#comment-16257004
 ] 

Eric Maynard commented on SPARK-22541:
--

Yeah, this is a common problem when you have side effects in your 
transformations. If you need to enforce a specific order on your 
transformations or otherwise split them up (rather than letting spark behind 
them), you can try putting actions -- like repartitions -- between the 
transformation functions.

> Dataframes: applying multiple filters one after another using udfs and 
> accumulators results in faulty accumulators
> --
>
> Key: SPARK-22541
> URL: https://issues.apache.org/jira/browse/SPARK-22541
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: pyspark 2.2.0, ubuntu
>Reporter: Janne K. Olesen
>
> I'm using udf filters and accumulators to keep track of filtered rows in 
> dataframes.
> If I'm applying multiple filters one after the other, they seem to be 
> executed in parallel, not in sequence, which messes with the accumulators i'm 
> using to keep track of filtered data. 
> {code:title=example.py|borderStyle=solid}
> from pyspark.sql.functions import udf, col
> from pyspark.sql.types import BooleanType
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sc = spark.sparkContext
> df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", 
> "val1", "val2"])
> def __myfilter(val, acc):
> if val < 2:
> return True
> else:
> acc.add(1)
> return False
> acc1 = sc.accumulator(0)
> acc2 = sc.accumulator(0)
> def myfilter1(val):
> return __myfilter(val, acc1)
> def myfilter2(val):
> return __myfilter(val, acc2)
> my_udf1 = udf(myfilter1, BooleanType())
> my_udf2 = udf(myfilter2, BooleanType())
> df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # |  b|   2|   2|
> # |  c|   3|   3|
> # +---+++
> df = df.filter(my_udf1(col("val1")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df = df.filter(my_udf2(col("val2")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df.show()
> print("acc1: %s" % acc1.value)  # expected 2, is 2 OK
> print("acc2: %s" % acc2.value)  # expected 0, is 2 !!!
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238712#comment-16238712
 ] 

Eric Maynard commented on SPARK-22436:
--

[~asmaier] Wouldn't the right way to implement this be to use the existing 
function `regexp_replace`?
I'm not sure why this needs to be a built-in function.
Furthermore, if this was implemented, imo it should exist in all APIs, not just 
pyspark.

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19713) saveAsTable

2017-03-16 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778
 ] 

Eric Maynard edited comment on SPARK-19713 at 3/16/17 8:08 PM:
---

Not really relevant here, but to address:

>1. Hive will not be able to create the table as the folder already exists
You absolutely can construct a Hive external table on top of an existing folder.

>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 


was (Author: emaynard1121):
Not really relevant here, but to address:
>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-16 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778
 ] 

Eric Maynard commented on SPARK-19713:
--

Not really relevant here, but to address:
>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194
 ] 

Eric Maynard edited comment on SPARK-19656 at 3/5/17 11:51 AM:
---

Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setA(5);
cc.setB(6);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println(row);
System.out.println(row.get(0));
System.out.println(row.get(1));
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}
{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
private int a;
public void setA(int value){this.a = value;}
public int getA(){return this.a;}

private int b;
public void setB(int value) {this.b = value;}
public int getB(){return this.b;}
}
{code}  
  
Everything looks ok to me, and after running stdout looks like this:
{code}
[5,6]
5
6
Success =   true
{code}
  
In the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.


was (Author: emaynard):
Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;

public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setValue(5);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}



{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
public int value;
public void setValue(int value){this.value = value;}
public int getValue(){return this.value;}
}
{code}  
  
Everything looks ok to me, and the main function prints "Success = true". In 
the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase

[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194
 ] 

Eric Maynard commented on SPARK-19656:
--

Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;

public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setValue(5);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}



{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
public int value;
public void setValue(int value){this.value = value;}
public int getValue(){return this.value;}
}
{code}  
  
Everything looks ok to me, and the main function prints "Success = true". In 
the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895998#comment-15895998
 ] 

Eric Maynard edited comment on SPARK-19656 at 3/5/17 12:58 AM:
---

Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:java}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}


was (Author: emaynard):
Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:scala}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895998#comment-15895998
 ] 

Eric Maynard commented on SPARK-19656:
--

Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:scala}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895991#comment-15895991
 ] 

Eric Maynard commented on SPARK-19713:
--

In general instead of using `DataFrameWriter.saveAsTable`, I find it's better 
to create the table in advance and then insert data into it 
`DataFrameWriter.insertInto`. If you use choose to use 
`DataFrameWriter.saveAsTable` there is a chance of the folder being created and 
the Hive table not being updated, but as a developer you can handle these 
errors with `HiveContext.refreshTable` or by using `FileSystem.delete`. I think 
this is not an issue.

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18005) optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe

2017-02-09 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859913#comment-15859913
 ] 

Eric Maynard commented on SPARK-18005:
--

This bug appears in the 1.6.x branch as well.

> optional binary Dataframe Column throws (UTF8) is not a group while loading a 
> Dataframe
> ---
>
> Key: SPARK-18005
> URL: https://issues.apache.org/jira/browse/SPARK-18005
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: ABHISHEK CHOUDHARY
>
> In some scenario, while loading a Parquet file, spark is throwing exception 
> as-
> java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not 
> a group
> Entire Dataframe is not corrupted as I managed to load starting 20 rows of 
> the data but trying to load the next one throws the error and any operations 
> over entire dataset throws the same exception like count.
> Full Exception Stack -
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 
> (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains 
> (UTF8) is not a group
>   at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
>