[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583873#comment-15583873 ] Aris Vlasakakis commented on SPARK-17368: - That is great, thank you for the help with this. > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis >Assignee: Jakob Odersky > Fix For: 2.1.0 > > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514085#comment-15514085 ] Aris Vlasakakis commented on SPARK-17131: - Hi there, I discovered a bug, and it also pertains to code generation with many columns -- although in my case the bugs within Janino code generation in Catalyst start after several hundred columns. Are these somehow related? My bug report was merged into this one: [https://issues.apache.org/jira/browse/SPARK-16845] > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) > at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662) > at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469564#comment-15469564 ] Aris Vlasakakis commented on SPARK-17368: - It goes from inconvenient to actually prohibitive in a practical sense. I have a Dataset[Something], and inside case class Something I have various other case classes, and somewhere inside there there is a particular value class. It is so crazy to do manual unwrapping and rewrapping that at this point I just decided to eat the performance cost and use a regular class, not value class (I removed the 'extends AnyVal'). More generally, specially accommodating for value classes is *really hard* in a practical setting because if I have a whole bunch of ADTs and other case classes I'm working with, how do I know if anywhere in my domain I used a *value class* and I suddenly have to jump through a bunch of hoops just so Spark doesn't blow up? If I just had a Dataset[ThisIsAValueClass] with the top-level class being a value class, what you're saying is easy, but in practice the value class is one of many things somewhere deeper. > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15459649#comment-15459649 ] Aris Vlasakakis commented on SPARK-17368: - I actually had an identical first thought from my experience of value classes and how they disappear in the JVM byte code. It would be very helpful if in the documentation somewhere that it was said that Scala value classes were explicitly not supported by spark datasets. The error messages are extremely cryptic and very confusing. Even better would be some kind of macro support or whatever else by Spark that would find and call it out of your code, but that's wishful thinking. > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17367) Cannot define value classes in REPL
[ https://issues.apache.org/jira/browse/SPARK-17367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15457816#comment-15457816 ] Aris Vlasakakis commented on SPARK-17367: - No, value classes work in the regular Scala REPL, just not spark-shell. > Cannot define value classes in REPL > --- > > Key: SPARK-17367 > URL: https://issues.apache.org/jira/browse/SPARK-17367 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Jakob Odersky > > It is currently not possible to define a class extending `AnyVal` in the > REPL. The underlying reason is the {{-Yrepl-class-based}} option used by > Spark Shell. > The report here is more of an FYI for anyone stumbling upon the problem, see > the upstream issue [https://issues.scala-lang.org/browse/SI-9910] for any > progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456979#comment-15456979 ] Aris Vlasakakis commented on SPARK-17368: - Yes, agreed that this code cannot be pasted into {{spark-shell}}. My assumption was that this would be a compiled JAR and then passed into {{spark-submit}} -- a normal Spark application. When you do so, the code compiles but Spark breaks at runtime. > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aris Vlasakakis updated SPARK-17368: Description: Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 and 1.6.X. This simple Spark 2 application demonstrates that the code will compile, but will break at runtime with the error. The value class is of course *FeatureId*, as it extends AnyVal. {noformat} Exception in thread "main" java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Couldn't find v on int assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 +- assertnotnull(input[0, int, true], top level non-flat input object).v +- assertnotnull(input[0, int, true], top level non-flat input object) +- input[0, int, true]". at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) {noformat} Test code for Spark 2.0.0: {noformat} import org.apache.spark.sql.{Dataset, SparkSession} object BreakSpark { case class FeatureId(v: Int) extends AnyVal def main(args: Array[String]): Unit = { val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) val spark = SparkSession.builder.getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel("warn") val ds: Dataset[FeatureId] = spark.createDataset(seq) println(s"BREAK HERE: ${ds.count}") } } {noformat} was: Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 and 1.6.X. This simple Spark 2 application demonstrates that the code will compile, but will break at runtime with the error. The value class is of course *FeatureId*, as it extends AnyVal. {noformat} Exception in thread "main" java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Couldn't find v on int assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 +- assertnotnull(input[0, int, true], top level non-flat input object).v +- assertnotnull(input[0, int, true], top level non-flat input object) +- input[0, int, true]". at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) {noformat} Test code: {noformat} import org.apache.spark.sql.{Dataset, SparkSession} object BreakSpark { case class FeatureId(v: Int) extends AnyVal def main(args: Array[String]): Unit = { val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) val spark = SparkSession.builder.getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel("warn") val ds: Dataset[FeatureId] = spark.createDataset(seq) println(s"BREAK HERE: ${ds.count}") } } {noformat} > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ >
[jira] [Updated] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aris Vlasakakis updated SPARK-17368: Environment: JDK 8 on MacOS Scala 2.11.8 Spark 2.0.0 was:Java 8 on MacOS > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17368) Scala value classes create encoder problems and break at runtime
Aris Vlasakakis created SPARK-17368: --- Summary: Scala value classes create encoder problems and break at runtime Key: SPARK-17368 URL: https://issues.apache.org/jira/browse/SPARK-17368 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.0.0, 1.6.2 Environment: Java 8 on MacOS Reporter: Aris Vlasakakis Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 and 1.6.X. This simple Spark 2 application demonstrates that the code will compile, but will break at runtime with the error. The value class is of course *FeatureId*, as it extends AnyVal. {noformat} Exception in thread "main" java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Couldn't find v on int assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 +- assertnotnull(input[0, int, true], top level non-flat input object).v +- assertnotnull(input[0, int, true], top level non-flat input object) +- input[0, int, true]". at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) {noformat} Test code: {noformat} import org.apache.spark.sql.{Dataset, SparkSession} object BreakSpark { case class FeatureId(v: Int) extends AnyVal def main(args: Array[String]): Unit = { val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) val spark = SparkSession.builder.getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel("warn") val ds: Dataset[FeatureId] = spark.createDataset(seq) println(s"BREAK HERE: ${ds.count}") } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org