[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651152#comment-16651152
 ] 

Gengliang Wang commented on SPARK-25588:


[~srowen] I tried the test case in branch-2.3, which uses avro 1.7.7.  It can 
be reproduced. It seems not related to the upgrade of Avro 1.7.7 to 1.8.2.


> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
>   

[jira] [Commented] (SPARK-25071) BuildSide is coming not as expected with join queries

2018-10-15 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651113#comment-16651113
 ] 

Sujith commented on SPARK-25071:


cc [~ZenWzh]  Please suggest as i want to take up this JIRA

> BuildSide is coming not as expected with join queries
> -
>
> Key: SPARK-25071
> URL: https://issues.apache.org/jira/browse/SPARK-25071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 
> Hadoop 2.7.3
>Reporter: Ayush Anubhava
>Priority: Major
>
> *BuildSide is not coming as expected.*
> Pre-requisites:
> *CBO is set as true &  spark.sql.cbo.joinReorder.enabled= true.*
> *import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec*
> *Steps:*
> *Scenario 1:*
> spark.sql("CREATE TABLE small3 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='600','totalSize'='800')")
>  spark.sql("CREATE TABLE big3 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='6000', 'totalSize'='800')")
>  val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  println(buildSide)
>  
> *Result 1:*
> scala> val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  plan: org.apache.spark.sql.execution.SparkPlan =
>  *(2) BroadcastHashJoin [c1#0L|#0L], [c1#1L|#1L], Inner, BuildRight
>  :- *(2) Filter isnotnull(c1#0L)
>  : +- HiveTableScan [c1#0L|#0L], HiveTableRelation `default`.`small3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#0L|#0L]
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(1) Filter isnotnull(c1#1L)
>  +- HiveTableScan [c1#1L|#1L], HiveTableRelation `default`.`big3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1L|#1L]
> scala> val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  buildSide: org.apache.spark.sql.execution.joins.BuildSide = BuildRight
> scala> println(buildSide)
>  *BuildRight*
>  
> *Scenario 2:*
> spark.sql("CREATE TABLE small4 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='600','totalSize'='80')")
>  spark.sql("CREATE TABLE big4 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='6000', 'totalSize'='800')")
>  val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  println(buildSide)
> *Result 2:*
> scala> val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  plan: org.apache.spark.sql.execution.SparkPlan =
>  *(2) BroadcastHashJoin [c1#4L|#4L], [c1#5L|#5L], Inner, BuildRight
>  :- *(2) Filter isnotnull(c1#4L)
>  : +- HiveTableScan [c1#4L|#4L], HiveTableRelation `default`.`small4`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#4L|#4L]
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(1) Filter isnotnull(c1#5L)
>  +- HiveTableScan [c1#5L|#5L], HiveTableRelation `default`.`big4`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5L|#5L]
> scala> val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  buildSide: org.apache.spark.sql.execution.joins.BuildSide = *BuildRight*
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec

2018-10-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25629.
--
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/22636

> ParquetFilterSuite: filter pushdown - decimal 16 sec
> 
>
> Key: SPARK-25629
> URL: https://issues.apache.org/jira/browse/SPARK-25629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter 
> pushdown - decimal
> Took 16 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2018-10-15 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651097#comment-16651097
 ] 

Sujith commented on SPARK-22601:


cc *[gatorsmile|https://github.com/gatorsmile]  please assign this JIRA to me 
as already this PR  is been merged. Thanks*

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Priority: Minor
> Fix For: 2.2.1
>
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2018-10-15 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith resolved SPARK-22601.

   Resolution: Fixed
Fix Version/s: 2.2.1

PR already merged.

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Priority: Minor
> Fix For: 2.2.1
>
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651081#comment-16651081
 ] 

Apache Spark commented on SPARK-25588:
--

User 'heuermh' has created a pull request for this issue:
https://github.com/apache/spark/pull/22742

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }
>   

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651080#comment-16651080
 ] 

Michael Heuer commented on SPARK-25588:
---

Created pull request [https://github.com/apache/spark/pull/22742] with failing 
unit test that demonstrates this issue.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);

[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25588:


Assignee: (was: Apache Spark)

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }
>   }
> }
> {noformat}
> Dataset code path:
> {noformat}
> $ parquet-tools schema 
> 

[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25588:


Assignee: Apache Spark

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Assignee: Apache Spark
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }
>   }
> }
> {noformat}
> Dataset code path:
> {noformat}
> $ parquet-tools 

[jira] [Updated] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25714:

Fix Version/s: 2.3.3
   2.2.3

> Null Handling in the Optimizer rule BooleanSimplification
> -
>
> Key: SPARK-25714
> URL: https://issues.apache.org/jira/browse/SPARK-25714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> {code}
> scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.write.mode("overwrite").parquet("/tmp/test1")
>   
>   
> scala> val df2 = spark.read.parquet("/tmp/test1");
> df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
> +++
> |col1|col2|
> +++
> | abc|   1|
> |null|   3|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-10-15 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651061#comment-16651061
 ] 

Jackey Lee commented on SPARK-24630:


[~shijinkui]

Without stream keyword means SQLStreaming cannot support none action queries 
with pure-sql, such as 'select * from kafka_sql_test'. User must define sink 
with Scala/Python/Java API.

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25723) spark sql External DataSource question

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Description: 
{code:java}
public class MyDatasourceRelation extends BaseRelation implements 
PrunedFilteredScan {
Map parameters;
SparkSession sparkSession;
CombinedReportHelper helper;

public MyDatasourceRelation() {

}

public MyDatasourceRelation (SQLContext sqlContext,Map 
parameters) {
this.parameters = parameters;
this.sparkSession = sqlContext.sparkSession();
this.helper = new CombinedReportHelper(parameters); //don't care 
this.helper.setRowsPerPage(1);
}


@Override
public SQLContext sqlContext() {
return this.sparkSession.sqlContext();
}

@Override
public StructType schema() {
StructType structType = transformSchema(helper.getFields(), helper.getFirst());
//helper.close();
System.out.println("get schema: "+structType);
return structType;
}



@Override
public RDD buildScan(String[] requiredColumns, Filter[] filters) {
System.out.println("build scan:");
int totalRow = helper.getTotalRow();
Partition[] partitions = getPartitions(totalRow, parameters);
System.out.println("get partition:"+partitions.length+" total row:"+totalRow);
return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, 
parameters);
}


private Partition[] getPartitions(int totalRow, Map parameters) 
{
int step = 100;
int numOfPartitions = (totalRow + step - 1) / step;

Partition[] partitions = new Partition[numOfPartitions];


for (int i = 0; i < numOfPartitions; i++) {
int start = i * step + 1;
partitions[i] = new MyPartition(null, i, start, start + step);
}
return partitions;

}
}
{code}
 

 

 

-- above is my code,some useless information are removed 
---

 

 

trait PrunedFilteredScan

{ def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
RDD[Row] }

 

if i implement this trait, i find requiredColumns param is different 
everytime,Why are the order different

{color:#ff}you can use spark.read.jdbc  and connect to your local mysql DB, 
and debug at {color}

{color:#ff}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color}

to show this param;

attachement is my screenshot 

  was:
{code:java}
public class MyDatasourceRelation extends BaseRelation implements 
PrunedFilteredScan {
Map parameters;
SparkSession sparkSession;
CombinedReportHelper helper;

public MyDatasourceRelation() {

}

public SmartbixDatasourceRelation(SQLContext sqlContext,Map 
parameters) {
this.parameters = parameters;
this.sparkSession = sqlContext.sparkSession();
this.helper = new CombinedReportHelper(parameters); //don't care 
this.helper.setRowsPerPage(1);
}


@Override
public SQLContext sqlContext() {
return this.sparkSession.sqlContext();
}

@Override
public StructType schema() {
StructType structType = transformSchema(helper.getFields(), helper.getFirst());
//helper.close();
System.out.println("get schema: "+structType);
return structType;
}



@Override
public RDD buildScan(String[] requiredColumns, Filter[] filters) {
System.out.println("build scan:");
int totalRow = helper.getTotalRow();
Partition[] partitions = getPartitions(totalRow, parameters);
System.out.println("get partition:"+partitions.length+" total row:"+totalRow);
return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, 
parameters);
}


private Partition[] getPartitions(int totalRow, Map parameters) 
{
int step = 100;
int numOfPartitions = (totalRow + step - 1) / step;

Partition[] partitions = new Partition[numOfPartitions];


for (int i = 0; i < numOfPartitions; i++) {
int start = i * step + 1;
partitions[i] = new MyPartition(null, i, start, start + step);
}
return partitions;

}
}
{code}
 

 

 

-- above is my code,some useless information are removed 
---

 

 

trait PrunedFilteredScan

{ def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
RDD[Row] }

 

if i implement this trait, i find requiredColumns param is different 
everytime,Why are the order different

{color:#FF}you can use spark.read.jdbc  and connect to your local mysql DB, 
and debug at {color}

{color:#FF}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color}

to show this param;

attachement is my screenshot 


> spark sql External DataSource question
> --
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Minor
> Attachments: QQ图片20181015182502.jpg
>
>
> {code:java}
> public class MyDatasourceRelation extends BaseRelation implements 
> PrunedFilteredScan {
> Map parameters;
> SparkSession 

[jira] [Updated] (SPARK-25723) spark sql External DataSource question

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Description: 
{code:java}
public class MyDatasourceRelation extends BaseRelation implements 
PrunedFilteredScan {
Map parameters;
SparkSession sparkSession;
CombinedReportHelper helper;

public MyDatasourceRelation() {

}

public SmartbixDatasourceRelation(SQLContext sqlContext,Map 
parameters) {
this.parameters = parameters;
this.sparkSession = sqlContext.sparkSession();
this.helper = new CombinedReportHelper(parameters); //don't care 
this.helper.setRowsPerPage(1);
}


@Override
public SQLContext sqlContext() {
return this.sparkSession.sqlContext();
}

@Override
public StructType schema() {
StructType structType = transformSchema(helper.getFields(), helper.getFirst());
//helper.close();
System.out.println("get schema: "+structType);
return structType;
}



@Override
public RDD buildScan(String[] requiredColumns, Filter[] filters) {
System.out.println("build scan:");
int totalRow = helper.getTotalRow();
Partition[] partitions = getPartitions(totalRow, parameters);
System.out.println("get partition:"+partitions.length+" total row:"+totalRow);
return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, 
parameters);
}


private Partition[] getPartitions(int totalRow, Map parameters) 
{
int step = 100;
int numOfPartitions = (totalRow + step - 1) / step;

Partition[] partitions = new Partition[numOfPartitions];


for (int i = 0; i < numOfPartitions; i++) {
int start = i * step + 1;
partitions[i] = new MyPartition(null, i, start, start + step);
}
return partitions;

}
}
{code}
 

 

 

-- above is my code,some useless information are removed 
---

 

 

trait PrunedFilteredScan

{ def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
RDD[Row] }

 

if i implement this trait, i find requiredColumns param is different 
everytime,Why are the order different

{color:#FF}you can use spark.read.jdbc  and connect to your local mysql DB, 
and debug at {color}

{color:#FF}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color}

to show this param;

attachement is my screenshot 

  was:
trait PrunedFilteredScan {
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}

 

if i implement this trait, i find requiredColumns param is different 
everytime,Why are the order different

you can use spark.read.jdbc  and connect to your local mysql DB, and debug at 

org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130)

to show this param;

attachement is my screenshot 


> spark sql External DataSource question
> --
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Minor
> Attachments: QQ图片20181015182502.jpg
>
>
> {code:java}
> public class MyDatasourceRelation extends BaseRelation implements 
> PrunedFilteredScan {
> Map parameters;
> SparkSession sparkSession;
> CombinedReportHelper helper;
> public MyDatasourceRelation() {
> }
> public SmartbixDatasourceRelation(SQLContext sqlContext,Map 
> parameters) {
> this.parameters = parameters;
> this.sparkSession = sqlContext.sparkSession();
> this.helper = new CombinedReportHelper(parameters); //don't care 
> this.helper.setRowsPerPage(1);
> }
> @Override
> public SQLContext sqlContext() {
> return this.sparkSession.sqlContext();
> }
> @Override
> public StructType schema() {
> StructType structType = transformSchema(helper.getFields(), 
> helper.getFirst());
> //helper.close();
> System.out.println("get schema: "+structType);
> return structType;
> }
> @Override
> public RDD buildScan(String[] requiredColumns, Filter[] filters) {
> System.out.println("build scan:");
> int totalRow = helper.getTotalRow();
> Partition[] partitions = getPartitions(totalRow, parameters);
> System.out.println("get partition:"+partitions.length+" total row:"+totalRow);
> return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, 
> parameters);
> }
> private Partition[] getPartitions(int totalRow, Map 
> parameters) {
> int step = 100;
> int numOfPartitions = (totalRow + step - 1) / step;
> Partition[] partitions = new Partition[numOfPartitions];
> for (int i = 0; i < numOfPartitions; i++) {
> int start = i * step + 1;
> partitions[i] = new MyPartition(null, i, start, start + step);
> }
> return partitions;
> }
> }
> {code}
>  
>  
>  
> -- above is my code,some useless information are removed 
> ---
>  
>  
> trait PrunedFilteredScan
> { def buildScan(requiredColumns: Array[String], filters: 

[jira] [Resolved] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25738.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22733
[https://github.com/apache/spark/pull/22733]

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.4.0
>
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25738:
--

Assignee: Imran Rashid

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.4.0
>
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager

2018-10-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23257:
--

Assignee: Ilan Filonenko

> Implement Kerberos Support in Kubernetes resource manager
> -
>
> Key: SPARK-23257
> URL: https://issues.apache.org/jira/browse/SPARK-23257
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Rob Keevil
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 3.0.0
>
>
> On the forked k8s branch of Spark at 
> [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support 
> has been added to the Kubernetes resource manager.  The Kubernetes code 
> between these two repositories appears to have diverged, so this commit 
> cannot be merged in easily.  Are there any plans to re-implement this work on 
> the main Spark repository?
>  
> [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help 
> with the development and testing of this, but i wanted to confirm that this 
> isn't already in progress -  I could not find any discussion about this 
> specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager

2018-10-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23257.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21669
[https://github.com/apache/spark/pull/21669]

> Implement Kerberos Support in Kubernetes resource manager
> -
>
> Key: SPARK-23257
> URL: https://issues.apache.org/jira/browse/SPARK-23257
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Rob Keevil
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 3.0.0
>
>
> On the forked k8s branch of Spark at 
> [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support 
> has been added to the Kubernetes resource manager.  The Kubernetes code 
> between these two repositories appears to have diverged, so this commit 
> cannot be merged in easily.  Are there any plans to re-implement this work on 
> the main Spark repository?
>  
> [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help 
> with the development and testing of this, but i wanted to confirm that this 
> isn't already in progress -  I could not find any discussion about this 
> specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25716) Project and Aggregate generate valid constraints with unnecessary operation

2018-10-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25716.
-
   Resolution: Fixed
 Assignee: SongYadong
Fix Version/s: 3.0.0

> Project and Aggregate generate valid constraints with unnecessary operation
> ---
>
> Key: SPARK-25716
> URL: https://issues.apache.org/jira/browse/SPARK-25716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: SongYadong
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Project logical operator generates valid constraints using two opposite 
> operations. It substracts child constraints from all constraints, than union 
> child constraints again. I think it may be not necessary.
> Aggregate operator has the same problem with Project.
> for example:
> in LogicalPlan.getAliasedConstraints(), return:
> {code:java}
> allConstraints -- child.constraints{code}
> in Project.validConstraints():
> {code:java}
> child.constraints.union(getAliasedConstraints(projectList)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25643) Performance issues querying wide rows

2018-10-15 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650866#comment-16650866
 ] 

Bruce Robbins edited comment on SPARK-25643 at 10/15/18 10:08 PM:
--

[~viirya] Yes, in the case where I said "matching rows are sprinkled fairly 
evenly throughout the table", I meant that predicate pushdown is working 
correctly, but that every data page for the filtered column contains at least 
one matching value. It's not about predicate push down so much as how records 
are distributed in the files.

For example, I have a 1 million record table with 6000 columns, where {{select 
* from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, 
so the matching records are arbitrarily distributed throughout the table. In 
this case, every data page for column id1 contains at least one entry with 
value 1. As a result, every row gets realized and passed to FilterExec.

So the "small subset of rows" in this case is 2330 rows. When I say "the 
returned result includes just a few rows", I meant a few rows relative to the 
size of the table. There has to be enough matching rows so that many data pages 
are involved.


was (Author: bersprockets):
[~viirya] Yes, in the case where I said "predicate push down is not helping", I 
meant that predicate pushdown is working correctly, but that every data page 
for the filtered column contains at least one matching value. It's not about 
predicate push down so much as how records are distributed in the files.

For example, I have a 1 million record table with 6000 columns, where {{select 
* from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, 
so the matching records are arbitrarily distributed throughout the table. In 
this case, every data page for column id1 contains at least one entry with 
value 1. As a result, every row gets realized and passed to FilterExec.

So the "small subset of rows" in this case is 2330 rows. When I say "the 
returned result includes just a few rows", I meant a few rows relative to the 
size of the table. There has to be enough matching rows so that many data pages 
are involved.

> Performance issues querying wide rows
> -
>
> Key: SPARK-25643
> URL: https://issues.apache.org/jira/browse/SPARK-25643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25643) Performance issues querying wide rows

2018-10-15 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650866#comment-16650866
 ] 

Bruce Robbins commented on SPARK-25643:
---

[~viirya] Yes, in the case where I said "predicate push down is not helping", I 
meant that predicate pushdown is working correctly, but that every data page 
for the filtered column contains at least one matching value. It's not about 
predicate push down so much as how records are distributed in the files.

For example, I have a 1 million record table with 6000 columns, where {{select 
* from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, 
so the matching records are arbitrarily distributed throughout the table. In 
this case, every data page for column id1 contains at least one entry with 
value 1. As a result, every row gets realized and passed to FilterExec.

So the "small subset of rows" in this case is 2330 rows. When I say "the 
returned result includes just a few rows", I meant a few rows relative to the 
size of the table. There has to be enough matching rows so that many data pages 
are involved.

> Performance issues querying wide rows
> -
>
> Key: SPARK-25643
> URL: https://issues.apache.org/jira/browse/SPARK-25643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Description: 
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

 

I have tested the provided code in Databricks runtime environment 5.0 and 4.1, 
and it is giving the expected output.   However in Databricks runtime 4.2 and 
4.3 (which are running spark 2.3.1) we get the incorrect output.

  was:
 Example code - 
{code}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

 

I have tested the provided code in Databricks runtime environment 5.0 beta, and 
it is giving the expected output. 


> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Description: 
 Example code - 
{code}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

 

I have tested the provided code in Databricks runtime environment 5.0 beta, and 
it is giving the expected output. 

  was:
 Example code - 
{code:scala}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.


> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 beta, 
> and it is giving the expected output. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Description: 
 Example code - 
{code:scala}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

  was:
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.


> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:scala}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Affects Version/s: (was: 2.3.2)
   2.3.1

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Description: 
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

  was:
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.


> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11)   
(was:  

Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) )

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Description: 
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment:  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Environment: 
 

Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) 

  was:
 Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

 


> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment:  
> Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-15 Thread Brian Jones (JIRA)
Brian Jones created SPARK-25739:
---

 Summary: Double quote coming in as empty value even when 
emptyValue set as null
 Key: SPARK-25739
 URL: https://issues.apache.org/jira/browse/SPARK-25739
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
 Environment:  Example code - 
{code:java}
val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
df
.repartition(1)
.write
.mode("overwrite")
.option("nullValue", null)
.option("emptyValue", null)
.option("delimiter",",")
.option("quoteMode", "NONE")
.option("escape","\\")
.format("csv")
.save("/tmp/nullcsv/")

var out = dbutils.fs.ls("/tmp/nullcsv/")
var file = out(out.size - 1)
val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
println(x)
{code}
Output - 
{code:java}
1,""
3,hi
2,hello
4,
{code}
Expected output - 
{code:java}
1,
3,hi
2,hello
4,
{code}
 

[https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
 This commit is relevant to my issue.

"Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files."

I am on Spark version 2.3.2, so empty strings should be coming as null.  Even 
then, I am passing the correct "emptyValue" option.  However, my empty values 
are stilling coming as `""` in the written file.

 
Reporter: Brian Jones






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650817#comment-16650817
 ] 

Daniel Darabos commented on SPARK-20144:


Thanks, those are good questions.

# The global option is not great, but it's the simplest. The code is already 
controlled by two global options. ({{spark.sql.files.maxPartitionBytes}} and 
{{spark.sql.files.openCostInBytes}}.) Why not one more?
 # I'm not sure what {{LOAD DATA INPATH}} does. (Sorry...) But sure, users can 
put random-name files in the directory and mess stuff up. Best protection 
against that is not putting random-name files in the directory. :D
 # The whole problem is not Parquet-specific. It affects all file types. The 
{{part-1}} naming comes from Hadoop's 
[FileOutputFormat|https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java#L270].
 It's been like this forever and will never change. (I'd say it's more than a 
convention.)

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25738:


Assignee: (was: Apache Spark)

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25738:


Assignee: Apache Spark

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650807#comment-16650807
 ] 

Apache Spark commented on SPARK-25738:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22733

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650799#comment-16650799
 ] 

Imran Rashid commented on SPARK-25738:
--

the fix is pretty trivial, I'm posting a pr now

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650796#comment-16650796
 ] 

Shixiong Zhu commented on SPARK-25738:
--

Marked as a blocker since this is a regression

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25738:
-
Priority: Blocker  (was: Critical)

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-25738:


 Summary: LOAD DATA INPATH doesn't work if hdfs conf includes port
 Key: SPARK-25738
 URL: https://issues.apache.org/jira/browse/SPARK-25738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Imran Rashid


LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
at index 8}} if your hdfs conf includes a port for the namenode.

This is because the URI is passing in the value of the hdfs conf "fs.defaultFS" 
in for the host.  Note that variable is called {{authority}}, but the 4-arg URI 
constructor actually expects a host: 
https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)

{code}
val defaultFSConf = 
sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
...
val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
{code}

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386

This was introduced by SPARK-23425.

*Workaround*: specify a fully qualified path, eg. instead of 

{noformat}
LOAD DATA INPATH '/some/path/on/hdfs'
{noformat}

use

{noformat}
LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650779#comment-16650779
 ] 

Dongjoon Hyun commented on SPARK-20144:
---

[~silvermast] and [~darabos]. 

1. The proposed `spark.sql.files.allowReordering` is a global option for all 
data sources.
2. In parquet tables, can we prevent some users execute 'LOAD DATA INPATH' to 
that folder with a random-name file?
3. Is there any way for Hive/Spark/Parquet to keep that naming convention in 
that table always?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650722#comment-16650722
 ] 

Daniel Darabos commented on SPARK-20144:


Yeah, I'm not too happy about the alphabetical ordering either. I thought I 
could simply not sort, and get the "original" order. But at the point where I 
made my change, the files are already in a jumbled order. Maybe it's the file 
system listing order, which could be anything.

99% of the time I'm just reading back a single partitioned Parquet file. In 
this case the alphabetical ordering is the right ordering. ({{part-1}}, 
{{part-2}}, ...) The rows of the resulting DataFrame will be in the same 
order as originally. So I think this issue is satisfied by the change. (The 
test also demonstrates this.)

The 1% case (for me) is when I'm reading back multiple Parquet files with a 
glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it 
would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}}, 
{{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at 
least the partitions within each Parquet file will be contiguous. That's still 
an improvement.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Victor Tso (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650704#comment-16650704
 ] 

Victor Tso commented on SPARK-20144:


It should, because by convention the parquet files are 0-padded numerically 
ordered.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-10-15 Thread Raghav Kumar Gautam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Kumar Gautam resolved SPARK-25424.
-
Resolution: Not A Bug

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>   at 
> 

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650642#comment-16650642
 ] 

Dongjoon Hyun commented on SPARK-20144:
---

For me, I don't think that PR resolve this issue, `spark.read.parquet no long 
maintains ordering of the data`.

This issue asked 'data ordering'. The alphabetical file path order doesn't 
guarantee the order of the data, does it?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650608#comment-16650608
 ] 

Apache Spark commented on SPARK-25044:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22732

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650591#comment-16650591
 ] 

Marcelo Vanzin commented on SPARK-25732:


In fact, do you even need proxy user + keytab at all?

If you just submit the application using principal / keytab from the command 
line, as is available today, isn't that the exact same thing? Other than the 
keytab needing to be available on the submitting node, seems like it does all 
you need?

Only thing missing would be to download the keytab from remote storage if 
needed, and then cleaning it up.


> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-15 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650560#comment-16650560
 ] 

Thomas Graves commented on SPARK-25732:
---

I would much rather see Spark start to push tokens and distributing them.

I'm not fond of pushing keytabs, many security folks/companies won't allow it.  
If you do this it means that all the users keytabs are in HDFS all the time, 
which in my opinion is even worse then our existing keytab/principal options 
where it can be picked up locally and its only in HDFS temporarily.   Just more 
chances permissions are messed up and people compromise keytabs which are 
indefinitely and much harder to revoke then things like tokens.

Pushing token is definitely more work but think we should go that way long 
term. Having an rpc connection between client and driver can be useful for 
other things as well. 

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25547) Pluggable jdbc connection factory

2018-10-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25547:

Target Version/s: 3.0.0

> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650541#comment-16650541
 ] 

Apache Spark commented on SPARK-25674:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22731

> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated
> -
>
> Key: SPARK-25674
> URL: https://issues.apache.org/jira/browse/SPARK-25674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated in `FileScanRDD.scala`,because it might skip 
> over the count that is an exact multiple of 
> UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650539#comment-16650539
 ] 

Apache Spark commented on SPARK-25674:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22731

> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated
> -
>
> Key: SPARK-25674
> URL: https://issues.apache.org/jira/browse/SPARK-25674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated in `FileScanRDD.scala`,because it might skip 
> over the count that is an exact multiple of 
> UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650535#comment-16650535
 ] 

Marcelo Vanzin commented on SPARK-25732:


I'd have preferred a system where Livy handles this for the users by 
periodically creating new delegation tokens for them and sending them to Spark. 
Saisai looked at something like this in the past, but distributing the tokens 
to Spark was the main issue.

With Livy you already have an RPC channel to the Spark context, so maybe that 
could be done? But it would probably still require some new API in Spark 
itself...

If those paths don't work, then this would be no worse than what Spark already 
has. Main issue is that you seem to be mixing keytab/principal with proxy user 
and that doesn't work - Spark explicitly disallows that combination.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2018-10-15 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650524#comment-16650524
 ] 

Matt Cheah commented on SPARK-18278:


The fork is no longer being maintained, because Kubernetes support is now being 
built and available on mainline.

Spark 2.3 introduced basic support for Kubernetes. Spark 2.4 will have Python 
support. Dynamic allocation is being reworked, see [my comment 
above|https://issues.apache.org/jira/browse/SPARK-18278?focusedCommentId=16646282=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16646282].

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Kubernetes, Scheduler, 
> Spark Core
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Assignee: Anirudh Ramanathan
>Priority: Major
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2018-10-15 Thread Oleg Frenkel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650519#comment-16650519
 ] 

Oleg Frenkel commented on SPARK-18278:
--

When is this fork expected to make its way to the official Spark 2.3 release? 
My company specifically looking for Spark Python support with kubernetes and 
current implementation in this fork is good enough for us. Thanks.

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Kubernetes, Scheduler, 
> Spark Core
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Assignee: Anirudh Ramanathan
>Priority: Major
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650492#comment-16650492
 ] 

Daniel Darabos commented on SPARK-20144:


Thanks Victor! I've expanded the test with a case where reordering is allowed, 
and I've added some explanatory comments in the test.

[~dongjoon], what do you think? Should I try to foster more discussion? Or what 
could be a next step?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16775) Remove deprecated accumulator v1 APIs

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16775:


Assignee: (was: Apache Spark)

> Remove deprecated accumulator v1 APIs
> -
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Major
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.
> ... but we can also just remove the old deprecated API entirely in Spark 
> 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16775) Remove deprecated accumulator v1 APIs

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650472#comment-16650472
 ] 

Apache Spark commented on SPARK-16775:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22730

> Remove deprecated accumulator v1 APIs
> -
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Major
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.
> ... but we can also just remove the old deprecated API entirely in Spark 
> 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16775) Remove deprecated accumulator v1 APIs

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16775:


Assignee: Apache Spark

> Remove deprecated accumulator v1 APIs
> -
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Major
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.
> ... but we can also just remove the old deprecated API entirely in Spark 
> 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2018-10-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650451#comment-16650451
 ] 

Marcelo Vanzin commented on SPARK-13478:


You don't need a keytab to log in to kerberos...

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.6.4, 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25737:


Assignee: Sean Owen  (was: Apache Spark)

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25737:


Assignee: Apache Spark  (was: Sean Owen)

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650425#comment-16650425
 ] 

Apache Spark commented on SPARK-25737:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22729

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16775) Remove deprecated accumulator v1 APIs

2018-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16775:
--
Affects Version/s: (was: 2.1.0)
   3.0.0
 Target Version/s: 3.0.0
  Description: 
Deprecating the old accumulator API added a large number of warnings - many of 
these could be fixed with a bit of refactoring to offer a non-deprecated 
internal class while still preserving the external deprecation warnings.

... but we can also just remove the old deprecated API entirely in Spark 3.0.0.

  was:Deprecating the old accumulator API added a large number of warnings - 
many of these could be fixed with a bit of refactoring to offer a 
non-deprecated internal class while still preserving the external deprecation 
warnings.

  Summary: Remove deprecated accumulator v1 APIs  (was: Reduce 
internal warnings from deprecated accumulator API)

I'm hijacking this older Jira to be about just removing the old deprecated API 
in Spark 3. At this point that's more likely to happen than cleaning up use of 
the deprecated API.

> Remove deprecated accumulator v1 APIs
> -
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Major
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.
> ... but we can also just remove the old deprecated API entirely in Spark 
> 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24154) AccumulatorV2 loses type information during serialization

2018-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24154.
---
Resolution: Won't Fix

> AccumulatorV2 loses type information during serialization
> -
>
> Key: SPARK-24154
> URL: https://issues.apache.org/jira/browse/SPARK-24154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1
> Environment: Scala 2.11
> Spark 2.2.0
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> AccumulatorV2 loses type information during serialization.
> It happens 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164]
>  during *writeReplace* call
> {code:scala}
> final protected def writeReplace(): Any = {
>   if (atDriverSide) {
> if (!isRegistered) {
>   throw new UnsupportedOperationException(
> "Accumulator must be registered before send to executor")
> }
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy")
> val isInternalAcc = name.isDefined && 
> name.get.startsWith(InternalAccumulator.METRICS_PREFIX)
> if (isInternalAcc) {
>   // Do not serialize the name of internal accumulator and send it to 
> executor.
>   copyAcc.metadata = metadata.copy(name = None)
> } else {
>   // For non-internal accumulators, we still need to send the name 
> because users may need to
>   // access the accumulator name at executor side, or they may keep the 
> accumulators sent from
>   // executors and access the name when the registered accumulator is 
> already garbage
>   // collected(e.g. SQLMetrics).
>   copyAcc.metadata = metadata
> }
> copyAcc
>   } else {
> this
>   }
> }
> {code}
> It means that it is hardly possible to create new accumulators easily by 
> adding new behaviour to existing ones by means of mix-ins or inheritance 
> (without overriding *copy*).
> For example the following snippet ...
> {code:scala}
> trait TripleCount {
>   self: LongAccumulator =>
>   abstract override def add(v: jl.Long): Unit = {
> self.add(v * 3)
>   }
> }
> val acc = new LongAccumulator with TripleCount
> sc.register(acc)
> val data = 1 to 10
> val rdd = sc.makeRDD(data, 5)
> rdd.foreach(acc.add(_))
> acc.value shouldBe 3 * data.sum
> {code}
> ... fails with
> {code:none}
> org.scalatest.exceptions.TestFailedException: 55 was not equal to 165
>   at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864)
> {code}
> Also such a behaviour seems to be error prone and confusing because an 
> implementor gets not the same thing as he/she sees in the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25737:
--
Environment: (was: In ancient times in 2013, JavaSparkContext got a 
superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
[http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]

I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. 

I believe we can now remove this workaround. Along the way, I think we can also 
avoid the duplicated definitions of {{union()}}. Where we should be able to 
just have one varargs method, we have up to 3 forms:

- {{union(RDD, Seq/List)}}

- {{union(RDD*)}}

- {{union(RDD, RDD*)}}

While this pattern is sometimes used to avoid type collision due to erasure, I 
don't think it applies here.

After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
(for the 3 Java RDD types), not 11 methods.

The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
{{sc.union(Seq(rdd1, rdd2): _*)}})
Description: 
In ancient times in 2013, JavaSparkContext got a superclass 
JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
[http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]

I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. 

I believe we can now remove this workaround. Along the way, I think we can also 
avoid the duplicated definitions of {{union()}}. Where we should be able to 
just have one varargs method, we have up to 3 forms:
 - {{union(RDD, Seq/List)}}

 - {{union(RDD*)}}

 - {{union(RDD, RDD*)}}

While this pattern is sometimes used to avoid type collision due to erasure, I 
don't think it applies here.

After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
(for the 3 Java RDD types), not 11 methods.

The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
{{sc.union(Seq(rdd1, rdd2): _*)}}

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25737:
--
Target Version/s: 3.0.0

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-15 Thread Sean Owen (JIRA)
Sean Owen created SPARK-25737:
-

 Summary: Remove JavaSparkContextVarargsWorkaround and standardize 
union() methods
 Key: SPARK-25737
 URL: https://issues.apache.org/jira/browse/SPARK-25737
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: In ancient times in 2013, JavaSparkContext got a 
superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
[http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]

I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. 

I believe we can now remove this workaround. Along the way, I think we can also 
avoid the duplicated definitions of {{union()}}. Where we should be able to 
just have one varargs method, we have up to 3 forms:

- {{union(RDD, Seq/List)}}

- {{union(RDD*)}}

- {{union(RDD, RDD*)}}

While this pattern is sometimes used to avoid type collision due to erasure, I 
don't think it applies here.

After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
(for the 3 Java RDD types), not 11 methods.

The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
{{sc.union(Seq(rdd1, rdd2): _*)}}
Reporter: Sean Owen
Assignee: Sean Owen






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25736) add tests to verify the behavior of multi-column count

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25736:


Assignee: Apache Spark  (was: Wenchen Fan)

> add tests to verify the behavior of multi-column count
> --
>
> Key: SPARK-25736
> URL: https://issues.apache.org/jira/browse/SPARK-25736
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25736) add tests to verify the behavior of multi-column count

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650375#comment-16650375
 ] 

Apache Spark commented on SPARK-25736:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22728

> add tests to verify the behavior of multi-column count
> --
>
> Key: SPARK-25736
> URL: https://issues.apache.org/jira/browse/SPARK-25736
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25736) add tests to verify the behavior of multi-column count

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25736:


Assignee: Wenchen Fan  (was: Apache Spark)

> add tests to verify the behavior of multi-column count
> --
>
> Key: SPARK-25736
> URL: https://issues.apache.org/jira/browse/SPARK-25736
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25736) add tests to verify the behavior of multi-column count

2018-10-15 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25736:
---

 Summary: add tests to verify the behavior of multi-column count
 Key: SPARK-25736
 URL: https://issues.apache.org/jira/browse/SPARK-25736
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25369) Replace Java shim functional interfaces like java.api.Function with Java 8 equivalents

2018-10-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650356#comment-16650356
 ] 

Sean Owen commented on SPARK-25369:
---

Example:
{code:java}
@FunctionalInterface
public interface Function extends Serializable {
  R call(T1 v1) throws Exception;
}{code}
becomes:
{code:java}
@FunctionalInterface
public interface Function extends Serializable, 
java.util.function.Function {
  R call(T1 v1) throws Exception;
  default R apply(T1 v1) {
try {
  return call(v1);
} catch (Exception e) {
  throw new RuntimeException(e);
}
  }
}{code}
The upside, as far as I can tell, is that you'd be able to reuse a single 
function implementation in calls to Spark APIs and JDK APIs like in 
Collections. Then again... you can already write a lambda that calls a function 
easily.

I end up neutral on it.

> Replace Java shim functional interfaces like java.api.Function with Java 8 
> equivalents
> --
>
> Key: SPARK-25369
> URL: https://issues.apache.org/jira/browse/SPARK-25369
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Minor
>  Labels: release-notes
>
> In Spark 3, we should remove interfaces like 
> org.apache.spark.api.java.function.Function and replace with 
> java.util.function equivalents, for better compatibility with Java 8. This 
> would let callers pass, in more cases, an existing functional object in Java 
> rather than wrap in a lambda.
> It's possible to have the functional interfaces in Spark just extend Java 8 
> functional interfaces to interoperate better with existing code, but might be 
> as well to remove them in Spark 3 to clean up.
> A partial list of transitions from Spark to Java interfaces:
>  * Function -> Function
>  * Function0 -> Supplier
>  * Function2 -> BiFunction
>  * VoidFunction -> Consumer
>  * FlatMapFunction etc -> extends Function> etc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Victor Tso (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650303#comment-16650303
 ] 

Victor Tso commented on SPARK-20144:


I looked at the PR and liked what I saw. I would only suggest the test be 
duplicated for true and/or the PR comment be added to the test code itself.

We have a customer where the records do not have a declared ordering of values 
but the records' appearance order is significant. Turning on our feature flag 
to use Spark's parquet reader violates their expectation of ordering so we are 
unable to shepherd them to this feature flag. Any help along the lines of this 
Jira and PR would be appreciated.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2018-10-15 Thread Sunayan Saikia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650302#comment-16650302
 ] 

Sunayan Saikia commented on SPARK-13478:


[~vanzin] As we know 'spark-submit' command by itself doesn't officially allow 
the '\-\-proxy\-user' and '\-\-principal/\-ketab' options to be used together. 
So, how this patch working?

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.6.4, 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650247#comment-16650247
 ] 

Apache Spark commented on SPARK-25735:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22727

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650244#comment-16650244
 ] 

Apache Spark commented on SPARK-25735:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22727

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25735:


Assignee: (was: Apache Spark)

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25735:


Assignee: Apache Spark

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-15 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25735:
--

 Summary: Improve start-thriftserver.sh: print clean usage and exit 
with code 1
 Key: SPARK-25735
 URL: https://issues.apache.org/jira/browse/SPARK-25735
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Gengliang Wang


Currently if we run 
sh start-thriftserver.sh -h

we get 

...
Thrift server options:
2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.(SparkContext.scala:367)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main

After fix, the usage output is clean:
Thrift server options:
--hiveconfUse value for given property

Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25723) spark sql External DataSource question

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Priority: Minor  (was: Major)

> spark sql External DataSource question
> --
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Minor
> Attachments: QQ图片20181015182502.jpg
>
>
> trait PrunedFilteredScan {
>  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
> RDD[Row]
> }
>  
> if i implement this trait, i find requiredColumns param is different 
> everytime,Why are the order different
> you can use spark.read.jdbc  and connect to your local mysql DB, and debug at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130)
> to show this param;
> attachement is my screenshot 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25727) makeCopy failed in InMemoryRelation

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650153#comment-16650153
 ] 

Apache Spark commented on SPARK-25727:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22726

> makeCopy failed in InMemoryRelation
> ---
>
> Key: SPARK-25727
> URL: https://issues.apache.org/jira/browse/SPARK-25727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> val data = Seq(100).toDF("count").cache()
> data.queryExecution.optimizedPlan.toJSON
> {code}
> The above code can generate the following error:
> {code}
> assertion failed: InMemoryRelation fields: output, cacheBuilder, 
> statsOfPlanToCache, outputOrdering, values: List(count#178), 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) Project [value#176 AS count#178]
> +- LocalTableScan [value#176]
> ,None), Statistics(sizeInBytes=12.0 B, hints=none)
> java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, 
> cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) Project [value#176 AS count#178]
> +- LocalTableScan [value#176]
> ,None), Statistics(sizeInBytes=12.0 B, hints=none)
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24610) wholeTextFiles broken for small files

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650109#comment-16650109
 ] 

Apache Spark commented on SPARK-24610:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22725

> wholeTextFiles broken for small files
> -
>
> Key: SPARK-24610
> URL: https://issues.apache.org/jira/browse/SPARK-24610
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark is unable to read small files using the wholeTextFiles method when 
> split size related configs are specified - either explicitly or if they are 
> contained in other config files like hive-site.xml.
> For small sized files, the computed maxSplitSize by 
> `WholeTextFileInputFormat`  is way smaller than the default or commonly used 
> split size of 64/128M and spark throws an exception while trying to read 
> them.  
>  
> To reproduce the issue: 
> {code:java}
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf 
> "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"
> scala> sc.wholeTextFiles("file:///etc/passwd").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
> maximum split size 9962
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided
> // For hdfs
> sc.wholeTextFiles("smallFile").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
> maximum split size 15
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24610) wholeTextFiles broken for small files

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650106#comment-16650106
 ] 

Apache Spark commented on SPARK-24610:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22725

> wholeTextFiles broken for small files
> -
>
> Key: SPARK-24610
> URL: https://issues.apache.org/jira/browse/SPARK-24610
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark is unable to read small files using the wholeTextFiles method when 
> split size related configs are specified - either explicitly or if they are 
> contained in other config files like hive-site.xml.
> For small sized files, the computed maxSplitSize by 
> `WholeTextFileInputFormat`  is way smaller than the default or commonly used 
> split size of 64/128M and spark throws an exception while trying to read 
> them.  
>  
> To reproduce the issue: 
> {code:java}
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf 
> "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"
> scala> sc.wholeTextFiles("file:///etc/passwd").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
> maximum split size 9962
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided
> // For hdfs
> sc.wholeTextFiles("smallFile").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
> maximum split size 15
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25734) Literal should have a value corresponding to dataType

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650059#comment-16650059
 ] 

Apache Spark commented on SPARK-25734:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22724

> Literal should have a value corresponding to dataType
> -
>
> Key: SPARK-25734
> URL: https://issues.apache.org/jira/browse/SPARK-25734
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25734) Literal should have a value corresponding to dataType

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25734:


Assignee: Apache Spark

> Literal should have a value corresponding to dataType
> -
>
> Key: SPARK-25734
> URL: https://issues.apache.org/jira/browse/SPARK-25734
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25734) Literal should have a value corresponding to dataType

2018-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25734:


Assignee: (was: Apache Spark)

> Literal should have a value corresponding to dataType
> -
>
> Key: SPARK-25734
> URL: https://issues.apache.org/jira/browse/SPARK-25734
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25734) Literal should have a value corresponding to dataType

2018-10-15 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-25734:


 Summary: Literal should have a value corresponding to dataType
 Key: SPARK-25734
 URL: https://issues.apache.org/jira/browse/SPARK-25734
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25723) spark sql External DataSource question

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai reopened SPARK-25723:
---

reproduce

> spark sql External DataSource question
> --
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Major
> Attachments: QQ图片20181015182502.jpg
>
>
> trait PrunedFilteredScan {
>  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
> RDD[Row]
> }
>  
> if i implement this trait, i find requiredColumns param is different 
> everytime,Why are the order different
> you can use spark.read.jdbc  and connect to your local mysql DB, and debug at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130)
> to show this param;
> attachement is my screenshot 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25723) spark sql External DataSource question

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Description: 
trait PrunedFilteredScan {
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}

 

if i implement this trait, i find requiredColumns param is different 
everytime,Why are the order different

you can use spark.read.jdbc  and connect to your local mysql DB, and debug at 

org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130)

to show this param;

attachement is my screenshot 
Summary: spark sql External DataSource question  (was: spark sql 
External DataSource)

> spark sql External DataSource question
> --
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Major
> Attachments: QQ图片20181015182502.jpg
>
>
> trait PrunedFilteredScan {
>  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
> RDD[Row]
> }
>  
> if i implement this trait, i find requiredColumns param is different 
> everytime,Why are the order different
> you can use spark.read.jdbc  and connect to your local mysql DB, and debug at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130)
> to show this param;
> attachement is my screenshot 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25723) spark sql External DataSource

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Attachment: QQ图片20181015182502.jpg

> spark sql External DataSource
> -
>
> Key: SPARK-25723
> URL: https://issues.apache.org/jira/browse/SPARK-25723
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: local mode
>Reporter: huanghuai
>Priority: Major
> Attachments: QQ图片20181015182502.jpg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25723) spark sql External DataSource

2018-10-15 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25723:
--
Description: (was: {color:#33}*spark.read()*{color}

{color:#33}*.format("com.myself.datasource")*{color}

{color:#33}*.option("ur","")*{color}

{color:#33}*.load()*{color}

{color:#FF}*.show()*{color}

 

{color:#FF}*Driver stacktrace:*{color}
{color:#FF} *at*{color} 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
~[scala-library-2.11.8.jar:?]
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
~[scala-library-2.11.8.jar:?]
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at scala.Option.foreach(Option.scala:257) ~[scala-library-2.11.8.jar:?]
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
 ~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067) 
~[spark-core_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
 ~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
 ~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.show(Dataset.scala:723) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.show(Dataset.scala:682) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]
 at org.apache.spark.sql.Dataset.show(Dataset.scala:691) 
~[spark-sql_2.11-2.3.0.jar:2.3.0]


{color:#FF}*Caused by: scala.MatchError: 23.25 (of class 
java.lang.Double)*{color}
{color:#FF} *at*{color} 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
 ~[spark-catalyst_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
 ~[spark-catalyst_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
 ~[spark-catalyst_2.11-2.3.0.jar:2.3.0]
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
 ~[spark-catalyst_2.11-2.3.0.jar:2.3.0]
 at 

[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start

2018-10-15 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649990#comment-16649990
 ] 

Ran Haim commented on SPARK-25527:
--

Any update?

> Job stuck waiting for last stage to start
> -
>
> Key: SPARK-25527
> URL: https://issues.apache.org/jira/browse/SPARK-25527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ran Haim
>Priority: Major
> Attachments: threaddumpjob.txt
>
>
> Sometimes it can somehow happen that a job is stuck waiting for the last 
> stage to start.
> There are no Tasks waiting for completion, and the job just hangs.
> There are available Executors for the job to run.
> I do not know how to reproduce this, all I know is that it happens randomly 
> after couple days of hard load.
> Another thing that might help is that it seems to happen when some tasks fail 
> because one or more executors killed (due to memory issues or something).
> Those tasks eventually do get finished by other executors because of retries, 
> but the next stage hangs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-15 Thread Bihui Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bihui Jin updated SPARK-25733:
--
Environment: 
Spark in standalone mode, and 48 cores are available.

spark-defaults.conf as blew:
spark.pyshark.python /usr/bin/python3.6
spark.driver.memory 4g
spark.executor.memory 8g

 

other configurations are at default.

  was:Spark in standalone mode, and 48 cores are avaiable.


> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above 

[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-10-15 Thread shijinkui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649972#comment-16649972
 ] 

shijinkui commented on SPARK-24630:
---

I prefer without stream keyword. Because in the future bath and stream API must 
unify. 

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-15 Thread Bihui Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bihui Jin updated SPARK-25733:
--
Attachment: report_dataset.zip.002

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are avaiable.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 1863, in showtraceback
> stb = value._render_traceback_()
> AttributeError: 'Py4JError' object has no attribute '_render_traceback_'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 

[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-15 Thread Bihui Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bihui Jin updated SPARK-25733:
--
Attachment: report_dataset.zip.001

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are avaiable.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 1863, in showtraceback
> stb = value._render_traceback_()
> AttributeError: 'Py4JError' object has no attribute '_render_traceback_'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 

[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-15 Thread Bihui Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bihui Jin updated SPARK-25733:
--
Description: 
{color:#FF}The dataset which I used attached.{color}

 

First I loaded a dataframe from local disk:

df = spark.read.load('report_dataset')

there are about 200 partitions stored in s3, and the max size of partitions is 
28.37MB.

 

after data loaded,  I execute "df.take(1)" to test the dataframe, and expected 
output printed 
"[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
 sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 49, 
389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 125, 
85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 
179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 
180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 
575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 
88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 
575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], next_word=575, 
line_num=12)]" 

 

Then I try to convert dataframe to the local iterator and want to print one row 
in dataframe for testing, and blew code is used:

for row in df.toLocalIterator():

    print(row)

    break

{color:#ff}*But there is no output printed after that code executed.*{color}

 

Then I execute "df.take(1)" and blew error is reported:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, 
in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, 
in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, 
in send_command
response = connection.send_command(command)
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, 
in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:37735)
Traceback (most recent call last):
File 
"/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in 
df.take(1)
File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
493, in limit
jdf = self._jdf.limit(num)
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, 
in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
in deco
return f(*a, **kw)
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o29.limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File 
"/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
line 1863, in showtraceback
stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, 
in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, 
in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:37735)
Traceback (most recent call last):
File 
"/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
line 2963, in run_code
exec(code_obj, self.user_global_ns, 

[jira] [Created] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-15 Thread Bihui Jin (JIRA)
Bihui Jin created SPARK-25733:
-

 Summary: The method toLocalIterator() with dataframe doesn't work
 Key: SPARK-25733
 URL: https://issues.apache.org/jira/browse/SPARK-25733
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
 Environment: Spark in standalone mode, and 48 cores are avaiable.
Reporter: Bihui Jin


First I loaded a dataframe from local disk:

df = spark.read.load('report_dataset')

there are about 200 partitions stored in s3, and the max size of partitions is 
28.37MB.

 

after data loaded,  I execute "df.take(1)" to test the dataframe, and expected 
output printed 
"[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
 sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 49, 
389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 125, 
85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 
179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 
180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 
575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 
88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 
575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], next_word=575, 
line_num=12)]" 

 

Then I try to convert dataframe to the local iterator and want to print one row 
in dataframe for testing, and blew code is used:

for row in df.toLocalIterator():

    print(row)

    break

{color:#FF}*But there is no output printed after that code executed.*{color}

 

Then I execute "df.take(1)" and blew error is reported:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
985, in send_command
response = connection.send_command(command)
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:37735)
Traceback (most recent call last):
  File 
"/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
  File "", line 1, in 
df.take(1)
  File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
  File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
493, in limit
jdf = self._jdf.limit(num)
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
in deco
return f(*a, **kw)
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o29.limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
line 1863, in showtraceback
stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
1067, in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused

[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0

2018-10-15 Thread Chandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-25731:

Labels: beginner features  (was: )

> Spark Structured Streaming Support for Kafka 2.0
> 
>
> Key: SPARK-25731
> URL: https://issues.apache.org/jira/browse/SPARK-25731
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Chandan
>Priority: Major
>  Labels: beginner, features
>
> [https://github.com/apache/spark/tree/master/external]
> As far as I can see, 
>  This doesn't have support for newly release *kafka2.0,*
>  support is available only till *kafka-0-10.*
> If we use the 
> "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"
> for kafka2.0, below is the error I get
> 11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Completed connection to node -1. Fetching API versions.*
>  11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Initiating API versions fetch from node -1.*
>  11:46:18.452 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 
> disconnected*
>  *java.io.EOFException: null*
>  at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
>  at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
>  at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)
>  
>  
> I might be wrong, but this is the best option I thought to open an issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649917#comment-16649917
 ] 

Marco Gaido commented on SPARK-25732:
-

cc [~vanzin] [~tgraves] [~jerryshao] [~mridul]. Sorry for pinging you, may you 
please provide your view on this? Do you see concerns/comments? Thanks.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0

2018-10-15 Thread Chandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-25731:

Description: 
[https://github.com/apache/spark/tree/master/external]

As far as I can see, 
 This doesn't have support for newly release *kafka2.0,*
 support is available only till *kafka-0-10.*

If we use the 

"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"

for kafka2.0, below is the error I get

11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Completed connection to node -1. Fetching API versions.*
 11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Initiating API versions fetch from node -1.*
 11:46:18.452 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected*
 *java.io.EOFException: null*
 at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
 at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
 at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)

 

 

I might be wrong, but this is the best option I thought to open an issue. 

  was:
[https://github.com/apache/spark/tree/master/external]

As far as I can see, 
 This doesn't have support for newly release *kafka2.0,*
 support is available only till *kafka-0-10.*

If we use the 

"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"

for kafka2.0, below is the error I get

11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Completed connection to node -1. Fetching API versions.*
 11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Initiating API versions fetch from node -1.*
 11:46:18.452 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected*
 *java.io.EOFException: null*
 at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
 at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
 at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)


> Spark Structured Streaming Support for Kafka 2.0
> 
>
> Key: SPARK-25731
> URL: https://issues.apache.org/jira/browse/SPARK-25731
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Chandan
>Priority: Major
>
> [https://github.com/apache/spark/tree/master/external]
> As far as I can see, 
>  This doesn't have support for newly release *kafka2.0,*
>  support is available only till *kafka-0-10.*
> If we use the 
> "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"
> for kafka2.0, below is the error I get
> 11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Completed connection to node -1. Fetching API versions.*
>  11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> 

[jira] [Created] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-15 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-25732:
---

 Summary: Allow specifying a keytab/principal for proxy user for 
token renewal 
 Key: SPARK-25732
 URL: https://issues.apache.org/jira/browse/SPARK-25732
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.4.0
Reporter: Marco Gaido


As of now, application submitted with proxy-user fail after 2 week due to the 
lack of token renewal. In order to enable it, we need the the keytab/principal 
of the impersonated user to be specified, in order to have them available for 
the token renewal.

This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
{{--proxy-user-keytab}}, and the last letting a keytab being specified also in 
a distributed FS, so that applications can be submitted by servers (eg. Livy, 
Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0

2018-10-15 Thread Chandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-25731:

Description: 
[https://github.com/apache/spark/tree/master/external]

As far as I can see, 
 This doesn't have support for newly release *kafka2.0,*
 support is available only till *kafka-0-10.*

If we use the 

"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"

for kafka2.0, below is the error I get

11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Completed connection to node -1. Fetching API versions.*
 11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Initiating API versions fetch from node -1.*
 11:46:18.452 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected*
 *java.io.EOFException: null*
 at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
 at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
 at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)

  was:
[https://github.com/apache/spark/tree/master/external]

As far as I can see, 
This doesn't have support for newly release kafka2.0,
support is available only till kafka-0-10.

If we use the 

"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"

for kafka2.0, below is the error I get

11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Completed connection to node -1. Fetching API versions.*
11:46:18.061 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 *Initiating API versions fetch from node -1.*
11:46:18.452 [stream execution thread for [id = 
e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
 Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected*
*java.io.EOFException: null*
 at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
 at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
 at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)


> Spark Structured Streaming Support for Kafka 2.0
> 
>
> Key: SPARK-25731
> URL: https://issues.apache.org/jira/browse/SPARK-25731
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Chandan
>Priority: Major
>
> [https://github.com/apache/spark/tree/master/external]
> As far as I can see, 
>  This doesn't have support for newly release *kafka2.0,*
>  support is available only till *kafka-0-10.*
> If we use the 
> "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"
> for kafka2.0, below is the error I get
> 11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Completed connection to node -1. Fetching API versions.*
>  11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Initiating API versions fetch from node -1.*
>  

  1   2   >