[jira] [Created] (SPARK-45675) Specify number of partitions when creating spark dataframe from pandas dataframe

2023-10-26 Thread Jelmer Kuperus (Jira)
Jelmer Kuperus created SPARK-45675:
--

 Summary: Specify number of partitions when creating spark 
dataframe from pandas dataframe
 Key: SPARK-45675
 URL: https://issues.apache.org/jira/browse/SPARK-45675
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Jelmer Kuperus


When converting a large pandas dataframe to a spark dataframe like so


 
{code:java}
import pandas as pd pdf = pd.DataFrame([{"board_id": "3074457346698037360_0", 
"file_name": "board-content", "value": "A" * 119251} for i in range(0, 2)]) 
spark.createDataFrame(pdf).write.mode("overwrite").format("delta").saveAsTable("catalog.schema.table"){code}
 
 

You can encounter the following error

org.apache.spark.SparkException: Job aborted due to stage failure: Serialized 
task 11:1 was 366405365 bytes, which exceeds max allowed: 
spark.rpc.message.maxSize (268435456 bytes). Consider increasing 
spark.rpc.message.maxSize or using broadcast variables for large values.

 

As far as I can tell spark first converts the pandas dataframe into a python 
list and then constructs an rdd out of that list. which means that the 
parallelism is determined by the value of spark.sparkcontext.defaultparallelism 
and if the pandas dataframe is very large and the number of available cores is 
low then you end up with very large tasks that exceed the limits imposed on the 
size of tasks

 

Methods like spark.sparkContext.parallelize allow you to pass in the number of 
partitions of the resulting dataset. I think having a similar capability when 
creating a dataframe from a pandas dataframe makes a lot of sense. As right now 
I think the only workaround I can think of is changing the value of 
spark.default.parallelism but this is a system wide setting



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-02-28 Thread Jelmer Kuperus (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jelmer Kuperus updated SPARK-42622:
---
Description: 
Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we 
encountered the following problem

 

!https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!

  
was:!https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!


> StackOverflowError reading json that does not conform to schema
> ---
>
> Key: SPARK-42622
> URL: https://issues.apache.org/jira/browse/SPARK-42622
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.0
>Reporter: Jelmer Kuperus
>Priority: Critical
>
> Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we 
> encountered the following problem
>  
> !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-02-28 Thread Jelmer Kuperus (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jelmer Kuperus updated SPARK-42622:
---
Flags: Patch,Important

Patch available here https://github.com/apache/spark/pull/40219

> StackOverflowError reading json that does not conform to schema
> ---
>
> Key: SPARK-42622
> URL: https://issues.apache.org/jira/browse/SPARK-42622
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.0
>Reporter: Jelmer Kuperus
>Priority: Critical
>
> !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-02-28 Thread Jelmer Kuperus (Jira)
Jelmer Kuperus created SPARK-42622:
--

 Summary: StackOverflowError reading json that does not conform to 
schema
 Key: SPARK-42622
 URL: https://issues.apache.org/jira/browse/SPARK-42622
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.4.0
Reporter: Jelmer Kuperus


!https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-12 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532
 ] 

Jelmer Kuperus edited comment on SPARK-27710 at 2/12/20 8:11 AM:
-

This also happens in Apache Toree

 
{code:java}

val mySpark = spark 
import mySpark.implicits._

case class AttributeRow(categoryId: String, key: String, count: Long, label: 
String)

mySpark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 


was (Author: jelmer):
This also happens in Apache Toree

 
{code:java}
case class AttributeRow(categoryId: String, key: String, count: Long, label: 
String)

val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> 

[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-12 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035135#comment-17035135
 ] 

Jelmer Kuperus commented on SPARK-27710:


Turns out it was this https://issues.apache.org/jira/browse/SPARK-25587 for me 
and defining the case class on its own line works as an ugly workaround. 
Polynote does not suffer from this because it does not use the scala shell 
underneath

 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:211)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:182)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:386)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)
>   at 
> 

[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-11 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532
 ] 

Jelmer Kuperus edited comment on SPARK-27710 at 2/11/20 3:07 PM:
-

This also happens in Apache Toree

 
{code:java}
case class AttributeRow(categoryId: String, key: String, count: Long, label: 
String)

val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 


was (Author: jelmer):
This also happens in Apache Toree

 
{code:java}
val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   

[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-11 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532
 ] 

Jelmer Kuperus commented on SPARK-27710:


This also happens in Apache Toree

 
{code:java}
val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84)
>   at 
> 

[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2018-06-19 Thread Jelmer Kuperus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856
 ] 

Jelmer Kuperus edited comment on SPARK-5158 at 6/19/18 9:33 AM:


I ended up with the following workaround which at first glance seems to work

1. create a _.java.login.config_ file in the home directory of the spark with 
the following contents
{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}
2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

This configures a single principal for the enire spark process. If you want to 
change the default paths to the configuration files you can use
{noformat}
-Djava.security.krb5.conf=/etc/krb5.conf 
-Djava.security.auth.login.config=/path/to/jaas.conf{noformat}
 

 

 


was (Author: jelmer):
I ended up with the following workaround which at first glance seems to work

1. create a `.java.login.config` file in the home directory of the spark with 
the following contents
{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}
2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

This configures a single principal for the enire spark process. If you want to 
change the default paths to the configuration files you can use
{noformat}
-Djava.security.krb5.conf=/etc/krb5.conf 
-Djava.security.auth.login.config=/path/to/jaas.conf{noformat}
 

 

 

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2018-06-19 Thread Jelmer Kuperus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856
 ] 

Jelmer Kuperus edited comment on SPARK-5158 at 6/19/18 9:32 AM:


I ended up with the following workaround which at first glance seems to work

1. create a `.java.login.config` file in the home directory of the spark with 
the following contents
{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}
2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

This configures a single principal for the enire spark process. If you want to 
change the default paths to the configuration files you can use
{noformat}
-Djava.security.krb5.conf=/etc/krb5.conf 
-Djava.security.auth.login.config=/path/to/jaas.conf{noformat}
 

 

 


was (Author: jelmer):
I ended up with the following workaround which at first glance seems to work

1. create a `.java.login.config` file in the home directory of the spark with 
the following contents


{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}

2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

 

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2018-06-19 Thread Jelmer Kuperus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856
 ] 

Jelmer Kuperus commented on SPARK-5158:
---

I ended up with the following workaround which at first glance seems to work

1. create a `.java.login.config` file in the home directory of the spark with 
the following contents


{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}

2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

 

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-22 Thread Jelmer Kuperus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833386#comment-15833386
 ] 

Jelmer Kuperus commented on SPARK-18750:


I am seeing the exact same issue when using dynamic allocation and doing just a 
basic spark sql query over a large data set

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
>