[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-10 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121641#comment-16121641
 ] 

Jen-Ming Chung commented on SPARK-21677:


to [~hyukjin.kwon], the return {{NULL}} you mentioned does it means all fields 
should be null in json_tuple, or just the non-existence field as shown in the 
following. Thanks!

{code:language=scala|borderStyle=solid}
e.g., spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
'not_exising_fields')""").show()

+---+---++
| c0| c1|  c2|
+---+---++
|  1|  2|null|
+---+---++
{code}



> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-08-07 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116126#comment-16116126
 ] 

Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 2:07 PM:


I have created a pull request for this issue:
[https://github.com/apache/spark/pull/18865]


was (Author: cjm):
User 'jmchung' has created a pull request for this issue:
[https://github.com/apache/spark/pull/18865]

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-08-07 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116126#comment-16116126
 ] 

Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 6:33 AM:


User 'jmchung' has created a pull request for this issue:
[https://github.com/apache/spark/pull/18865]


was (Author: cjm):
User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18865[https://github.com/apache/spark/pull/18865]

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-08-07 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116126#comment-16116126
 ] 

Jen-Ming Chung commented on SPARK-21610:


User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18865 
[https://github.com/apache/spark/pull/18865]

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-08-07 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116126#comment-16116126
 ] 

Jen-Ming Chung edited comment on SPARK-21610 at 8/7/17 6:33 AM:


User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18865[https://github.com/apache/spark/pull/18865]


was (Author: cjm):
User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18865 
[https://github.com/apache/spark/pull/18865]

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167734#comment-16167734
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:29 AM:
--

The alternative is giving the explicit schema instead inferring that you don't 
need to change your pojo class in above test case.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}



was (Author: jmchung):
The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167725#comment-16167725
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:18 AM:
--

Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
...
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
...
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}


was (Author: jmchung):
Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
>  

[jira] [Commented] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167725#comment-16167725
 ] 

Jen-Ming Chung commented on SPARK-22019:


Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167734#comment-16167734
 ] 

Jen-Ming Chung commented on SPARK-22019:


The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType().add("id", IntegerType).add("str", 
StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167734#comment-16167734
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:28 AM:
--

The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}



was (Author: jmchung):
The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType().add("id", IntegerType).add("str", 
StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-16 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206988#comment-16206988
 ] 

Jen-Ming Chung commented on SPARK-22283:


Hi [~kitbellew],
I found the `withColumn` has already reimplemented in this 
[PR|https://github.com/apache/spark/pull/19229].

> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-16 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206988#comment-16206988
 ] 

Jen-Ming Chung edited comment on SPARK-22283 at 10/17/17 4:45 AM:
--

Hi [~kitbellew], the {{withColumn}} method is already reimplemented in this 
[PR|https://github.com/apache/spark/pull/19229].


was (Author: jmchung):
Hi [~kitbellew],
I found the `withColumn` has already reimplemented in this 
[PR|https://github.com/apache/spark/pull/19229].

> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 5:27 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-13 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 6:16 AM:
-

Hi [~client.test],
I wrote the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:58 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung commented on SPARK-21989:


Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:55 AM:
-

Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:java}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:56 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:java}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21804) json_tuple returns null values within repeated columns except the first one

2017-08-22 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136594#comment-16136594
 ] 

Jen-Ming Chung commented on SPARK-21804:


Submitted a PR at [https://github.com/apache/spark/pull/19017]

> json_tuple returns null values within repeated columns except the first one
> ---
>
> Key: SPARK-21804
> URL: https://issues.apache.org/jira/browse/SPARK-21804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jen-Ming Chung
>Priority: Minor
>  Labels: starter
>
> I was testing json_tuple in extracting values from JSON but I found it could 
> actually returns null values within repeated columns except the first one as 
> below:
> {code:language=scala}
> scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
> 'a')""").show()
> +---+---++
> | c0| c1|  c2|
> +---+---++
> |  1|  2|null|
> +---+---++
> {code}
> I think this should be consistent with Hive's implementation:
> {code:language=scala}
> hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
> ...
> 11
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21804) json_tuple returns null values within repeated columns except the first one

2017-08-22 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136385#comment-16136385
 ] 

Jen-Ming Chung commented on SPARK-21804:


I’m working on this

> json_tuple returns null values within repeated columns except the first one
> ---
>
> Key: SPARK-21804
> URL: https://issues.apache.org/jira/browse/SPARK-21804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jen-Ming Chung
>Priority: Minor
>  Labels: starter
>
> I was testing json_tuple in extracting values from JSON but I found it could 
> actually returns null values within repeated columns except the first one as 
> below:
> {code:language=scala}
> scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
> 'a')""").show()
> +---+---++
> | c0| c1|  c2|
> +---+---++
> |  1|  2|null|
> +---+---++
> {code}
> I think this should be consistent with Hive's implementation:
> {code:language=scala}
> hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
> ...
> 11
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21804) json_tuple returns null values within repeated columns except the first one

2017-08-22 Thread Jen-Ming Chung (JIRA)
Jen-Ming Chung created SPARK-21804:
--

 Summary: json_tuple returns null values within repeated columns 
except the first one
 Key: SPARK-21804
 URL: https://issues.apache.org/jira/browse/SPARK-21804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jen-Ming Chung
Priority: Minor


I was testing json_tuple in extracting values from JSON but I found it could 
actually returns null values within repeated columns except the first one as 
below:

{code:language=scala}
scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
'a')""").show()
+---+---++
| c0| c1|  c2|
+---+---++
|  1|  2|null|
+---+---++
{code}

I think this should be consistent with Hive's implementation:
{code:language=scala}
hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
11
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-25 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141790#comment-16141790
 ] 

Jen-Ming Chung commented on SPARK-21684:


Hi, [~taransaini43]
If udf_comma with {{.option("escape", "\"")}} to save, the results will be 
{{ab\,cd\,ef\,gh}} without double escapes.

> df.write double escaping all the already escaped characters except the first 
> one
> 
>
> Key: SPARK-21684
> URL: https://issues.apache.org/jira/browse/SPARK-21684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
> Attachments: SparkQuotesTest2.scala
>
>
> Hi,
> If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
> {noformat}
> Then while writing it is being written as 
> {noformat} "ab\,cd\\,ef\\,gh" {noformat}
> i.e it double escapes all the already escaped commas/delimiters but not the 
> first one.
> This is weird behaviour considering either it should do for all or none.
> If I do mention df.option("escape","") as empty then it solves this problem 
> but the double quotes inside the same value if any are preceded by a special 
> char i.e '\u00'. Why does it do so when the escape character is set as 
> ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-25 Thread Jen-Ming Chung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jen-Ming Chung updated SPARK-21684:
---
Comment: was deleted

(was: Hi, [~taransaini43]
If udf_comma with {{.option("escape", "\"")}} to save, the results will be 
{{ab\,cd\,ef\,gh}} without double escapes.)

> df.write double escaping all the already escaped characters except the first 
> one
> 
>
> Key: SPARK-21684
> URL: https://issues.apache.org/jira/browse/SPARK-21684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
> Attachments: SparkQuotesTest2.scala
>
>
> Hi,
> If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
> {noformat}
> Then while writing it is being written as 
> {noformat} "ab\,cd\\,ef\\,gh" {noformat}
> i.e it double escapes all the already escaped commas/delimiters but not the 
> first one.
> This is weird behaviour considering either it should do for all or none.
> If I do mention df.option("escape","") as empty then it solves this problem 
> but the double quotes inside the same value if any are preceded by a special 
> char i.e '\u00'. Why does it do so when the escape character is set as 
> ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224248#comment-16224248
 ] 

Jen-Ming Chung commented on SPARK-22291:


Thank you all :)

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Issue Comment Deleted] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL

2017-10-31 Thread Jen-Ming Chung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jen-Ming Chung updated SPARK-19039:
---
Comment: was deleted

(was: It's weird..you will not get error messages if you paste the code 
line-by-line.

{code}
17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at 
http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1509442670084).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.createDataFrame(Seq(
 |   ("hi", 1),
 |   ("there", 2),
 |   ("the", 3),
 |   ("end", 4)
 | )).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> val myNumbers = Set(1,2,3)
myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3)

scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,BooleanType,Some(List(IntegerType)))

scala> val rowHasMyNumber = tmpUDF($"b")
rowHasMyNumber: org.apache.spark.sql.Column = UDF(b)

scala> df.where(rowHasMyNumber).show()
+-+---+
|a|  b|
+-+---+
|   hi|  1|
|there|  2|
|  the|  3|
+-+---+
{code} )

> UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
> --
>
> Key: SPARK-19039
> URL: https://issues.apache.org/jira/browse/SPARK-19039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0
>Reporter: Joseph K. Bradley
>
> When I try this:
> * Define UDF
> * Apply UDF to get Column
> * Use Column in a DataFrame
> I can find weird behavior in the spark-shell when using paste mode.
> To reproduce this, paste this into the spark-shell:
> {code}
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   ("hi", 1),
>   ("there", 2),
>   ("the", 3),
>   ("end", 4)
> )).toDF("a", "b")
> val myNumbers = Set(1,2,3)
> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
> val rowHasMyNumber = tmpUDF($"b")
> df.where(rowHasMyNumber).show()
> {code}
> Stack trace for Spark 2.0 (similar for other versions):
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2057)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
> 

[jira] [Commented] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL

2017-10-31 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226528#comment-16226528
 ] 

Jen-Ming Chung commented on SPARK-19039:


It's weird..you will not get error messages if you paste the code line-by-line.

{code}
17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at 
http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1509442670084).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.createDataFrame(Seq(
 |   ("hi", 1),
 |   ("there", 2),
 |   ("the", 3),
 |   ("end", 4)
 | )).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> val myNumbers = Set(1,2,3)
myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3)

scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,BooleanType,Some(List(IntegerType)))

scala> val rowHasMyNumber = tmpUDF($"b")
rowHasMyNumber: org.apache.spark.sql.Column = UDF(b)

scala> df.where(rowHasMyNumber).show()
+-+---+
|a|  b|
+-+---+
|   hi|  1|
|there|  2|
|  the|  3|
+-+---+
{code} 

> UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
> --
>
> Key: SPARK-19039
> URL: https://issues.apache.org/jira/browse/SPARK-19039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0
>Reporter: Joseph K. Bradley
>
> When I try this:
> * Define UDF
> * Apply UDF to get Column
> * Use Column in a DataFrame
> I can find weird behavior in the spark-shell when using paste mode.
> To reproduce this, paste this into the spark-shell:
> {code}
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   ("hi", 1),
>   ("there", 2),
>   ("the", 3),
>   ("end", 4)
> )).toDF("a", "b")
> val myNumbers = Set(1,2,3)
> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
> val rowHasMyNumber = tmpUDF($"b")
> df.where(rowHasMyNumber).show()
> {code}
> Stack trace for Spark 2.0 (similar for other versions):
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2057)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
>