[jira] [Issue Comment Deleted] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
[ https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jen-Ming Chung updated SPARK-19039: --- Comment: was deleted (was: It's weird..you will not get error messages if you paste the code line-by-line. {code} 17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040 Spark context available as 'sc' (master = local[*], app id = local-1509442670084). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = spark.createDataFrame(Seq( | ("hi", 1), | ("there", 2), | ("the", 3), | ("end", 4) | )).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: string, b: int] scala> val myNumbers = Set(1,2,3) myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3) scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) } tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,BooleanType,Some(List(IntegerType))) scala> val rowHasMyNumber = tmpUDF($"b") rowHasMyNumber: org.apache.spark.sql.Column = UDF(b) scala> df.where(rowHasMyNumber).show() +-+---+ |a| b| +-+---+ | hi| 1| |there| 2| | the| 3| +-+---+ {code} ) > UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL > -- > > Key: SPARK-19039 > URL: https://issues.apache.org/jira/browse/SPARK-19039 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0 >Reporter: Joseph K. Bradley > > When I try this: > * Define UDF > * Apply UDF to get Column > * Use Column in a DataFrame > I can find weird behavior in the spark-shell when using paste mode. > To reproduce this, paste this into the spark-shell: > {code} > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > ("hi", 1), > ("there", 2), > ("the", 3), > ("end", 4) > )).toDF("a", "b") > val myNumbers = Set(1,2,3) > val tmpUDF = udf { (n: Int) => myNumbers.contains(n) } > val rowHasMyNumber = tmpUDF($"b") > df.where(rowHasMyNumber).show() > {code} > Stack trace for Spark 2.0 (similar for other versions): > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2057) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.Dataset$$a
[jira] [Commented] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
[ https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226528#comment-16226528 ] Jen-Ming Chung commented on SPARK-19039: It's weird..you will not get error messages if you paste the code line-by-line. {code} 17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040 Spark context available as 'sc' (master = local[*], app id = local-1509442670084). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = spark.createDataFrame(Seq( | ("hi", 1), | ("there", 2), | ("the", 3), | ("end", 4) | )).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: string, b: int] scala> val myNumbers = Set(1,2,3) myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3) scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) } tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,BooleanType,Some(List(IntegerType))) scala> val rowHasMyNumber = tmpUDF($"b") rowHasMyNumber: org.apache.spark.sql.Column = UDF(b) scala> df.where(rowHasMyNumber).show() +-+---+ |a| b| +-+---+ | hi| 1| |there| 2| | the| 3| +-+---+ {code} > UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL > -- > > Key: SPARK-19039 > URL: https://issues.apache.org/jira/browse/SPARK-19039 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0 >Reporter: Joseph K. Bradley > > When I try this: > * Define UDF > * Apply UDF to get Column > * Use Column in a DataFrame > I can find weird behavior in the spark-shell when using paste mode. > To reproduce this, paste this into the spark-shell: > {code} > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > ("hi", 1), > ("there", 2), > ("the", 3), > ("end", 4) > )).toDF("a", "b") > val myNumbers = Set(1,2,3) > val tmpUDF = udf { (n: Int) => myNumbers.contains(n) } > val rowHasMyNumber = tmpUDF($"b") > df.where(rowHasMyNumber).show() > {code} > Stack trace for Spark 2.0 (similar for other versions): > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2057) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apa
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224248#comment-16224248 ] Jen-Ming Chung commented on SPARK-22291: Thank you all :) > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$Task
[jira] [Comment Edited] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206988#comment-16206988 ] Jen-Ming Chung edited comment on SPARK-22283 at 10/17/17 4:45 AM: -- Hi [~kitbellew], the {{withColumn}} method is already reimplemented in this [PR|https://github.com/apache/spark/pull/19229]. was (Author: jmchung): Hi [~kitbellew], I found the `withColumn` has already reimplemented in this [PR|https://github.com/apache/spark/pull/19229]. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206988#comment-16206988 ] Jen-Ming Chung commented on SPARK-22283: Hi [~kitbellew], I found the `withColumn` has already reimplemented in this [PR|https://github.com/apache/spark/pull/19229]. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:29 AM: -- The alternative is giving the explicit schema instead inferring that you don't need to change your pojo class in above test case. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} was (Author: jmchung): The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:28 AM: -- The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} was (Author: jmchung): The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType().add("id", IntegerType).add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung commented on SPARK-22019: The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType().add("id", IntegerType).add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:18 AM: -- Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { ... long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } ... } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} was (Author: jmchung): Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData da
[jira] [Commented] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725 ] Jen-Ming Chung commented on SPARK-22019: Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atla
[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 6:16 AM: - Hi [~client.test], I wrote the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:language=Scala} case class SampleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SampleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} was (Author: jmchung): Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:language=Scala} case class SampleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SampleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 5:27 AM: - Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:language=Scala} case class SampleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SampleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} was (Author: jmchung): Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:language=Scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:58 AM: - Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:language=Scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} was (Author: jmchung): Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:Scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:56 AM: - Hi [~client.test], I write the above code in Scala and run in Spark 2.2.0 can show the schema and content you expected. {code:Scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} was (Author: jmchung): Hi [~client.test], I write the above code in scala and run in Spark 2.2.0 can show the schema and content you expected. {code:java} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung commented on SPARK-21989: Hi [~client.test], I write the above code in scala and run in Spark 2.2.0 can show the schema and content you expected. {code:scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class
[ https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152 ] Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:55 AM: - Hi [~client.test], I write the above code in scala and run in Spark 2.2.0 can show the schema and content you expected. {code:java} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} was (Author: jmchung): Hi [~client.test], I write the above code in scala and run in Spark 2.2.0 can show the schema and content you expected. {code:scala} case class SimpleData(str: String) ... import spark.implicits._ val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}") val rdd: RDD[SimpleData] = spark .sparkContext .parallelize(arr) .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData])) val ds = spark.createDataset(rdd) ds.printSchema() root |-- str: string (nullable = true) ds.show(false) ++ |str | ++ |everyone| |Hello | ++ {code} > createDataset and the schema of encoder class > - > > Key: SPARK-21989 > URL: https://issues.apache.org/jira/browse/SPARK-21989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > Hello. > public class SampleData implements Serializable { > public String str; > } > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\"}"); > arr.add("{\"str\": \"Hello\"}"); > JavaRDD data2 = sc.parallelize(arr).map(v -> {return new > Gson().fromJson(v, SampleData.class);}); > Dataset df = sqc.createDataset(data2.rdd(), > Encoders.bean(SampleData.class)); > df.printSchema(); > expected result of printSchema is str field of sampleData class. > actual result is following. > root > and if i call df.show() it displays like following. > ++ > || > ++ > || > || > ++ > what i expected is , "hello", "everyone" will be displayed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
[ https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jen-Ming Chung updated SPARK-21684: --- Comment: was deleted (was: Hi, [~taransaini43] If udf_comma with {{.option("escape", "\"")}} to save, the results will be {{ab\,cd\,ef\,gh}} without double escapes.) > df.write double escaping all the already escaped characters except the first > one > > > Key: SPARK-21684 > URL: https://issues.apache.org/jira/browse/SPARK-21684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > Attachments: SparkQuotesTest2.scala > > > Hi, > If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh > {noformat} > Then while writing it is being written as > {noformat} "ab\,cd\\,ef\\,gh" {noformat} > i.e it double escapes all the already escaped commas/delimiters but not the > first one. > This is weird behaviour considering either it should do for all or none. > If I do mention df.option("escape","") as empty then it solves this problem > but the double quotes inside the same value if any are preceded by a special > char i.e '\u00'. Why does it do so when the escape character is set as > ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
[ https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141790#comment-16141790 ] Jen-Ming Chung commented on SPARK-21684: Hi, [~taransaini43] If udf_comma with {{.option("escape", "\"")}} to save, the results will be {{ab\,cd\,ef\,gh}} without double escapes. > df.write double escaping all the already escaped characters except the first > one > > > Key: SPARK-21684 > URL: https://issues.apache.org/jira/browse/SPARK-21684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > Attachments: SparkQuotesTest2.scala > > > Hi, > If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh > {noformat} > Then while writing it is being written as > {noformat} "ab\,cd\\,ef\\,gh" {noformat} > i.e it double escapes all the already escaped commas/delimiters but not the > first one. > This is weird behaviour considering either it should do for all or none. > If I do mention df.option("escape","") as empty then it solves this problem > but the double quotes inside the same value if any are preceded by a special > char i.e '\u00'. Why does it do so when the escape character is set as > ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21804) json_tuple returns null values within repeated columns except the first one
[ https://issues.apache.org/jira/browse/SPARK-21804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136594#comment-16136594 ] Jen-Ming Chung commented on SPARK-21804: Submitted a PR at [https://github.com/apache/spark/pull/19017] > json_tuple returns null values within repeated columns except the first one > --- > > Key: SPARK-21804 > URL: https://issues.apache.org/jira/browse/SPARK-21804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jen-Ming Chung >Priority: Minor > Labels: starter > > I was testing json_tuple in extracting values from JSON but I found it could > actually returns null values within repeated columns except the first one as > below: > {code:language=scala} > scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', > 'a')""").show() > +---+---++ > | c0| c1| c2| > +---+---++ > | 1| 2|null| > +---+---++ > {code} > I think this should be consistent with Hive's implementation: > {code:language=scala} > hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); > ... > 11 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21804) json_tuple returns null values within repeated columns except the first one
[ https://issues.apache.org/jira/browse/SPARK-21804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136385#comment-16136385 ] Jen-Ming Chung commented on SPARK-21804: I’m working on this > json_tuple returns null values within repeated columns except the first one > --- > > Key: SPARK-21804 > URL: https://issues.apache.org/jira/browse/SPARK-21804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jen-Ming Chung >Priority: Minor > Labels: starter > > I was testing json_tuple in extracting values from JSON but I found it could > actually returns null values within repeated columns except the first one as > below: > {code:language=scala} > scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', > 'a')""").show() > +---+---++ > | c0| c1| c2| > +---+---++ > | 1| 2|null| > +---+---++ > {code} > I think this should be consistent with Hive's implementation: > {code:language=scala} > hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); > ... > 11 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21804) json_tuple returns null values within repeated columns except the first one
Jen-Ming Chung created SPARK-21804: -- Summary: json_tuple returns null values within repeated columns except the first one Key: SPARK-21804 URL: https://issues.apache.org/jira/browse/SPARK-21804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Jen-Ming Chung Priority: Minor I was testing json_tuple in extracting values from JSON but I found it could actually returns null values within repeated columns except the first one as below: {code:language=scala} scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show() +---+---++ | c0| c1| c2| +---+---++ | 1| 2|null| +---+---++ {code} I think this should be consistent with Hive's implementation: {code:language=scala} hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); ... 11 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121641#comment-16121641 ] Jen-Ming Chung commented on SPARK-21677: to [~hyukjin.kwon], the return {{NULL}} you mentioned does it means all fields should be null in json_tuple, or just the non-existence field as shown in the following. Thanks! {code:language=scala|borderStyle=solid} e.g., spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'not_exising_fields')""").show() +---+---++ | c0| c1| c2| +---+---++ | 1| 2|null| +---+---++ {code} > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file
[ https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116126#comment-16116126 ] Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 2:07 PM: I have created a pull request for this issue: [https://github.com/apache/spark/pull/18865] was (Author: cjm): User 'jmchung' has created a pull request for this issue: [https://github.com/apache/spark/pull/18865] > Corrupt records are not handled properly when creating a dataframe from a file > -- > > Key: SPARK-21610 > URL: https://issues.apache.org/jira/browse/SPARK-21610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 > Environment: macOs Sierra 10.12.5 >Reporter: dmtran > > Consider a jsonl file with 3 records. The third record has a value of type > string, instead of int. > {code} > echo '{"field": 1} > {"field": 2} > {"field": "3"}' >/tmp/sample.json > {code} > Create a dataframe from this file, with a schema that contains > "_corrupt_record" so that corrupt records are kept. > {code} > import org.apache.spark.sql.types._ > val schema = new StructType() > .add("field", ByteType) > .add("_corrupt_record", StringType) > val file = "/tmp/sample.json" > val dfFromFile = spark.read.schema(schema).json(file) > {code} > Run the following lines from a spark-shell: > {code} > scala> dfFromFile.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() > res1: Long = 0 > scala> dfFromFile.filter($"_corrupt_record".isNull).count() > res2: Long = 3 > {code} > The expected result is 1 corrupt record and 2 valid records, but the actual > one is 0 corrupt record and 3 valid records. > The bug is not reproduced if we create a dataframe from a RDD: > {code} > scala> val rdd = sc.textFile(file) > rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] > at textFile at :28 > scala> val dfFromRdd = spark.read.schema(schema).json(rdd) > dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: > string] > scala> dfFromRdd.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count() > res5: Long = 1 > scala> dfFromRdd.filter($"_corrupt_record".isNull).count() > res6: Long = 2 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file
[ https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116126#comment-16116126 ] Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 6:33 AM: User 'jmchung' has created a pull request for this issue: [https://github.com/apache/spark/pull/18865] was (Author: cjm): User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/18865[https://github.com/apache/spark/pull/18865] > Corrupt records are not handled properly when creating a dataframe from a file > -- > > Key: SPARK-21610 > URL: https://issues.apache.org/jira/browse/SPARK-21610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 > Environment: macOs Sierra 10.12.5 >Reporter: dmtran > > Consider a jsonl file with 3 records. The third record has a value of type > string, instead of int. > {code} > echo '{"field": 1} > {"field": 2} > {"field": "3"}' >/tmp/sample.json > {code} > Create a dataframe from this file, with a schema that contains > "_corrupt_record" so that corrupt records are kept. > {code} > import org.apache.spark.sql.types._ > val schema = new StructType() > .add("field", ByteType) > .add("_corrupt_record", StringType) > val file = "/tmp/sample.json" > val dfFromFile = spark.read.schema(schema).json(file) > {code} > Run the following lines from a spark-shell: > {code} > scala> dfFromFile.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() > res1: Long = 0 > scala> dfFromFile.filter($"_corrupt_record".isNull).count() > res2: Long = 3 > {code} > The expected result is 1 corrupt record and 2 valid records, but the actual > one is 0 corrupt record and 3 valid records. > The bug is not reproduced if we create a dataframe from a RDD: > {code} > scala> val rdd = sc.textFile(file) > rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] > at textFile at :28 > scala> val dfFromRdd = spark.read.schema(schema).json(rdd) > dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: > string] > scala> dfFromRdd.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count() > res5: Long = 1 > scala> dfFromRdd.filter($"_corrupt_record".isNull).count() > res6: Long = 2 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file
[ https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116126#comment-16116126 ] Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 6:33 AM: User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/18865[https://github.com/apache/spark/pull/18865] was (Author: cjm): User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/18865 [https://github.com/apache/spark/pull/18865] > Corrupt records are not handled properly when creating a dataframe from a file > -- > > Key: SPARK-21610 > URL: https://issues.apache.org/jira/browse/SPARK-21610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 > Environment: macOs Sierra 10.12.5 >Reporter: dmtran > > Consider a jsonl file with 3 records. The third record has a value of type > string, instead of int. > {code} > echo '{"field": 1} > {"field": 2} > {"field": "3"}' >/tmp/sample.json > {code} > Create a dataframe from this file, with a schema that contains > "_corrupt_record" so that corrupt records are kept. > {code} > import org.apache.spark.sql.types._ > val schema = new StructType() > .add("field", ByteType) > .add("_corrupt_record", StringType) > val file = "/tmp/sample.json" > val dfFromFile = spark.read.schema(schema).json(file) > {code} > Run the following lines from a spark-shell: > {code} > scala> dfFromFile.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() > res1: Long = 0 > scala> dfFromFile.filter($"_corrupt_record".isNull).count() > res2: Long = 3 > {code} > The expected result is 1 corrupt record and 2 valid records, but the actual > one is 0 corrupt record and 3 valid records. > The bug is not reproduced if we create a dataframe from a RDD: > {code} > scala> val rdd = sc.textFile(file) > rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] > at textFile at :28 > scala> val dfFromRdd = spark.read.schema(schema).json(rdd) > dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: > string] > scala> dfFromRdd.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count() > res5: Long = 1 > scala> dfFromRdd.filter($"_corrupt_record".isNull).count() > res6: Long = 2 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file
[ https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116126#comment-16116126 ] Jen-Ming Chung commented on SPARK-21610: User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/18865 [https://github.com/apache/spark/pull/18865] > Corrupt records are not handled properly when creating a dataframe from a file > -- > > Key: SPARK-21610 > URL: https://issues.apache.org/jira/browse/SPARK-21610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 > Environment: macOs Sierra 10.12.5 >Reporter: dmtran > > Consider a jsonl file with 3 records. The third record has a value of type > string, instead of int. > {code} > echo '{"field": 1} > {"field": 2} > {"field": "3"}' >/tmp/sample.json > {code} > Create a dataframe from this file, with a schema that contains > "_corrupt_record" so that corrupt records are kept. > {code} > import org.apache.spark.sql.types._ > val schema = new StructType() > .add("field", ByteType) > .add("_corrupt_record", StringType) > val file = "/tmp/sample.json" > val dfFromFile = spark.read.schema(schema).json(file) > {code} > Run the following lines from a spark-shell: > {code} > scala> dfFromFile.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() > res1: Long = 0 > scala> dfFromFile.filter($"_corrupt_record".isNull).count() > res2: Long = 3 > {code} > The expected result is 1 corrupt record and 2 valid records, but the actual > one is 0 corrupt record and 3 valid records. > The bug is not reproduced if we create a dataframe from a RDD: > {code} > scala> val rdd = sc.textFile(file) > rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] > at textFile at :28 > scala> val dfFromRdd = spark.read.schema(schema).json(rdd) > dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: > string] > scala> dfFromRdd.show(false) > +-+---+ > |field|_corrupt_record| > +-+---+ > |1|null | > |2|null | > |null |{"field": "3"} | > +-+---+ > scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count() > res5: Long = 1 > scala> dfFromRdd.filter($"_corrupt_record".isNull).count() > res6: Long = 2 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org