[jira] [Created] (SPARK-45675) Specify number of partitions when creating spark dataframe from pandas dataframe
Jelmer Kuperus created SPARK-45675: -- Summary: Specify number of partitions when creating spark dataframe from pandas dataframe Key: SPARK-45675 URL: https://issues.apache.org/jira/browse/SPARK-45675 Project: Spark Issue Type: Improvement Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Jelmer Kuperus When converting a large pandas dataframe to a spark dataframe like so {code:java} import pandas as pd pdf = pd.DataFrame([{"board_id": "3074457346698037360_0", "file_name": "board-content", "value": "A" * 119251} for i in range(0, 2)]) spark.createDataFrame(pdf).write.mode("overwrite").format("delta").saveAsTable("catalog.schema.table"){code} You can encounter the following error org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 11:1 was 366405365 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values. As far as I can tell spark first converts the pandas dataframe into a python list and then constructs an rdd out of that list. which means that the parallelism is determined by the value of spark.sparkcontext.defaultparallelism and if the pandas dataframe is very large and the number of available cores is low then you end up with very large tasks that exceed the limits imposed on the size of tasks Methods like spark.sparkContext.parallelize allow you to pass in the number of partitions of the resulting dataset. I think having a similar capability when creating a dataframe from a pandas dataframe makes a lot of sense. As right now I think the only workaround I can think of is changing the value of spark.default.parallelism but this is a system wide setting -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema
[ https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jelmer Kuperus updated SPARK-42622: --- Description: Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we encountered the following problem !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! was:!https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! > StackOverflowError reading json that does not conform to schema > --- > > Key: SPARK-42622 > URL: https://issues.apache.org/jira/browse/SPARK-42622 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.0 >Reporter: Jelmer Kuperus >Priority: Critical > > Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we > encountered the following problem > > !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema
[ https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jelmer Kuperus updated SPARK-42622: --- Flags: Patch,Important Patch available here https://github.com/apache/spark/pull/40219 > StackOverflowError reading json that does not conform to schema > --- > > Key: SPARK-42622 > URL: https://issues.apache.org/jira/browse/SPARK-42622 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.0 >Reporter: Jelmer Kuperus >Priority: Critical > > !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42622) StackOverflowError reading json that does not conform to schema
Jelmer Kuperus created SPARK-42622: -- Summary: StackOverflowError reading json that does not conform to schema Key: SPARK-42622 URL: https://issues.apache.org/jira/browse/SPARK-42622 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.4.0 Reporter: Jelmer Kuperus !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532 ] Jelmer Kuperus edited comment on SPARK-27710 at 2/12/20 8:11 AM: - This also happens in Apache Toree {code:java} val mySpark = spark import mySpark.implicits._ case class AttributeRow(categoryId: String, key: String, count: Long, label: String) mySpark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} was (Author: jelmer): This also happens in Apache Toree {code:java} case class AttributeRow(categoryId: String, key: String, count: Long, label: String) val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035135#comment-17035135 ] Jelmer Kuperus commented on SPARK-27710: Turns out it was this https://issues.apache.org/jira/browse/SPARK-25587 for me and defining the case class on its own line works as an ugly workaround. Polynote does not suffer from this because it does not use the scala shell underneath > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84) > at > org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:211) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:182) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:386) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90) > at >
[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532 ] Jelmer Kuperus edited comment on SPARK-27710 at 2/11/20 3:07 PM: - This also happens in Apache Toree {code:java} case class AttributeRow(categoryId: String, key: String, count: Long, label: String) val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} was (Author: jelmer): This also happens in Apache Toree {code:java} val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) >
[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034532#comment-17034532 ] Jelmer Kuperus commented on SPARK-27710: This also happens in Apache Toree {code:java} val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84) > at >
[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856 ] Jelmer Kuperus edited comment on SPARK-5158 at 6/19/18 9:33 AM: I ended up with the following workaround which at first glance seems to work 1. create a _.java.login.config_ file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} This configures a single principal for the enire spark process. If you want to change the default paths to the configuration files you can use {noformat} -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/path/to/jaas.conf{noformat} was (Author: jelmer): I ended up with the following workaround which at first glance seems to work 1. create a `.java.login.config` file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} This configures a single principal for the enire spark process. If you want to change the default paths to the configuration files you can use {noformat} -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/path/to/jaas.conf{noformat} > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856 ] Jelmer Kuperus edited comment on SPARK-5158 at 6/19/18 9:32 AM: I ended up with the following workaround which at first glance seems to work 1. create a `.java.login.config` file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} This configures a single principal for the enire spark process. If you want to change the default paths to the configuration files you can use {noformat} -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/path/to/jaas.conf{noformat} was (Author: jelmer): I ended up with the following workaround which at first glance seems to work 1. create a `.java.login.config` file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856 ] Jelmer Kuperus commented on SPARK-5158: --- I ended up with the following workaround which at first glance seems to work 1. create a `.java.login.config` file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow
[ https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833386#comment-15833386 ] Jelmer Kuperus commented on SPARK-18750: I am seeing the exact same issue when using dynamic allocation and doing just a basic spark sql query over a large data set > spark should be able to control the number of executor and should not throw > stack overslow > -- > > Key: SPARK-18750 > URL: https://issues.apache.org/jira/browse/SPARK-18750 > Project: Spark > Issue Type: Bug >Reporter: Neerja Khattar > > When running Sql queries on large datasets. Job fails with stack overflow > warning and it shows it is requesting lots of executors. > Looks like there is no limit to number of executors or not even having an > upperbound based on yarn available resources. > {noformat} > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n5.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n8.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n2.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of > 32770 executor(s). > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor > containers, each with 1 cores and 6758 MB memory including 614 MB overhead > 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of > 52902 executor(s). > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n5.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n8.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n2.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of > 32770 executor(s). > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor > containers, each with 1 cores and 6758 MB memory including 614 MB overhead > 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of > 52902 executor(s). > 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 > time(s) in a row. > java.lang.StackOverflowError > at scala.collection.immutable.HashMap.$plus(HashMap.scala:57) > at scala.collection.immutable.HashMap.$plus(HashMap.scala:36) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24) > at > scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156) > at > scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105) > at scala.collection.immutable.HashMap.$plus(HashMap.scala:60) > at scala.collection.immutable.Map$Map4.updated(Map.scala:172) > at scala.collection.immutable.Map$Map4.$plus(Map.scala:173) > at scala.collection.immutable.Map$Map4.$plus(Map.scala:158) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24) > at > scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at >