[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651152#comment-16651152 ] Gengliang Wang commented on SPARK-25588: [~srowen] I tried the test case in branch-2.3, which uses avro 1.7.7. It can be reproduced. It seems not related to the upgrade of Avro 1.7.7 to 1.8.2. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); >
[jira] [Commented] (SPARK-25071) BuildSide is coming not as expected with join queries
[ https://issues.apache.org/jira/browse/SPARK-25071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651113#comment-16651113 ] Sujith commented on SPARK-25071: cc [~ZenWzh] Please suggest as i want to take up this JIRA > BuildSide is coming not as expected with join queries > - > > Key: SPARK-25071 > URL: https://issues.apache.org/jira/browse/SPARK-25071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 > Hadoop 2.7.3 >Reporter: Ayush Anubhava >Priority: Major > > *BuildSide is not coming as expected.* > Pre-requisites: > *CBO is set as true & spark.sql.cbo.joinReorder.enabled= true.* > *import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec* > *Steps:* > *Scenario 1:* > spark.sql("CREATE TABLE small3 (c1 bigint) TBLPROPERTIES ('numRows'='2', > 'rawDataSize'='600','totalSize'='800')") > spark.sql("CREATE TABLE big3 (c1 bigint) TBLPROPERTIES ('numRows'='2', > 'rawDataSize'='6000', 'totalSize'='800')") > val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = > t2.c1)").queryExecution.executedPlan > val buildSide = > plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide > println(buildSide) > > *Result 1:* > scala> val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = > t2.c1)").queryExecution.executedPlan > plan: org.apache.spark.sql.execution.SparkPlan = > *(2) BroadcastHashJoin [c1#0L|#0L], [c1#1L|#1L], Inner, BuildRight > :- *(2) Filter isnotnull(c1#0L) > : +- HiveTableScan [c1#0L|#0L], HiveTableRelation `default`.`small3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#0L|#0L] > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) > +- *(1) Filter isnotnull(c1#1L) > +- HiveTableScan [c1#1L|#1L], HiveTableRelation `default`.`big3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1L|#1L] > scala> val buildSide = > plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide > buildSide: org.apache.spark.sql.execution.joins.BuildSide = BuildRight > scala> println(buildSide) > *BuildRight* > > *Scenario 2:* > spark.sql("CREATE TABLE small4 (c1 bigint) TBLPROPERTIES ('numRows'='2', > 'rawDataSize'='600','totalSize'='80')") > spark.sql("CREATE TABLE big4 (c1 bigint) TBLPROPERTIES ('numRows'='2', > 'rawDataSize'='6000', 'totalSize'='800')") > val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = > t2.c1)").queryExecution.executedPlan > val buildSide = > plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide > println(buildSide) > *Result 2:* > scala> val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = > t2.c1)").queryExecution.executedPlan > plan: org.apache.spark.sql.execution.SparkPlan = > *(2) BroadcastHashJoin [c1#4L|#4L], [c1#5L|#5L], Inner, BuildRight > :- *(2) Filter isnotnull(c1#4L) > : +- HiveTableScan [c1#4L|#4L], HiveTableRelation `default`.`small4`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#4L|#4L] > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) > +- *(1) Filter isnotnull(c1#5L) > +- HiveTableScan [c1#5L|#5L], HiveTableRelation `default`.`big4`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5L|#5L] > scala> val buildSide = > plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide > buildSide: org.apache.spark.sql.execution.joins.BuildSide = *BuildRight* > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec
[ https://issues.apache.org/jira/browse/SPARK-25629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25629. -- Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/22636 > ParquetFilterSuite: filter pushdown - decimal 16 sec > > > Key: SPARK-25629 > URL: https://issues.apache.org/jira/browse/SPARK-25629 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter > pushdown - decimal > Took 16 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path
[ https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651097#comment-16651097 ] Sujith commented on SPARK-22601: cc *[gatorsmile|https://github.com/gatorsmile] please assign this JIRA to me as already this PR is been merged. Thanks* > Data load is getting displayed successful on providing non existing hdfs file > path > -- > > Key: SPARK-22601 > URL: https://issues.apache.org/jira/browse/SPARK-22601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sujith >Priority: Minor > Fix For: 2.2.1 > > > Data load is getting displayed successful on providing non existing hdfs file > path where as in local path proper error message is getting displayed > create table tb2 (a string, b int); > load data inpath 'hdfs://hacluster/data1.csv' into table tb2 > Note: data1.csv does not exist in HDFS > when local non existing file path is given below error message will be > displayed > "LOAD DATA input path does not exist". attached snapshots of behaviour in > spark 2.1 and spark 2.2 version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path
[ https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith resolved SPARK-22601. Resolution: Fixed Fix Version/s: 2.2.1 PR already merged. > Data load is getting displayed successful on providing non existing hdfs file > path > -- > > Key: SPARK-22601 > URL: https://issues.apache.org/jira/browse/SPARK-22601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sujith >Priority: Minor > Fix For: 2.2.1 > > > Data load is getting displayed successful on providing non existing hdfs file > path where as in local path proper error message is getting displayed > create table tb2 (a string, b int); > load data inpath 'hdfs://hacluster/data1.csv' into table tb2 > Note: data1.csv does not exist in HDFS > when local non existing file path is given below error message will be > displayed > "LOAD DATA input path does not exist". attached snapshots of behaviour in > spark 2.1 and spark 2.2 version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651081#comment-16651081 ] Apache Spark commented on SPARK-25588: -- User 'heuermh' has created a pull request for this issue: https://github.com/apache/spark/pull/22742 > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651080#comment-16651080 ] Michael Heuer commented on SPARK-25588: --- Created pull request [https://github.com/apache/spark/pull/22742] with failing unit test that demonstrates this issue. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8);
[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25588: Assignee: (was: Apache Spark) > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } > } > } > {noformat} > Dataset code path: > {noformat} > $ parquet-tools schema >
[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25588: Assignee: Apache Spark > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Assignee: Apache Spark >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } > } > } > {noformat} > Dataset code path: > {noformat} > $ parquet-tools
[jira] [Updated] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification
[ https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25714: Fix Version/s: 2.3.3 2.2.3 > Null Handling in the Optimizer rule BooleanSimplification > - > > Key: SPARK-25714 > URL: https://issues.apache.org/jira/browse/SPARK-25714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Labels: correctness > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > {code} > scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") > df: org.apache.spark.sql.DataFrame = [col1: string, col2: int] > scala> df.write.mode("overwrite").parquet("/tmp/test1") > > > scala> val df2 = spark.read.parquet("/tmp/test1"); > df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int] > scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() > +++ > |col1|col2| > +++ > | abc| 1| > |null| 3| > +++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651061#comment-16651061 ] Jackey Lee commented on SPARK-24630: [~shijinkui] Without stream keyword means SQLStreaming cannot support none action queries with pure-sql, such as 'select * from kafka_sql_test'. User must define sink with Scala/Python/Java API. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25723) spark sql External DataSource question
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Description: {code:java} public class MyDatasourceRelation extends BaseRelation implements PrunedFilteredScan { Map parameters; SparkSession sparkSession; CombinedReportHelper helper; public MyDatasourceRelation() { } public MyDatasourceRelation (SQLContext sqlContext,Map parameters) { this.parameters = parameters; this.sparkSession = sqlContext.sparkSession(); this.helper = new CombinedReportHelper(parameters); //don't care this.helper.setRowsPerPage(1); } @Override public SQLContext sqlContext() { return this.sparkSession.sqlContext(); } @Override public StructType schema() { StructType structType = transformSchema(helper.getFields(), helper.getFirst()); //helper.close(); System.out.println("get schema: "+structType); return structType; } @Override public RDD buildScan(String[] requiredColumns, Filter[] filters) { System.out.println("build scan:"); int totalRow = helper.getTotalRow(); Partition[] partitions = getPartitions(totalRow, parameters); System.out.println("get partition:"+partitions.length+" total row:"+totalRow); return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, parameters); } private Partition[] getPartitions(int totalRow, Map parameters) { int step = 100; int numOfPartitions = (totalRow + step - 1) / step; Partition[] partitions = new Partition[numOfPartitions]; for (int i = 0; i < numOfPartitions; i++) { int start = i * step + 1; partitions[i] = new MyPartition(null, i, start, start + step); } return partitions; } } {code} -- above is my code,some useless information are removed --- trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } if i implement this trait, i find requiredColumns param is different everytime,Why are the order different {color:#ff}you can use spark.read.jdbc and connect to your local mysql DB, and debug at {color} {color:#ff}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color} to show this param; attachement is my screenshot was: {code:java} public class MyDatasourceRelation extends BaseRelation implements PrunedFilteredScan { Map parameters; SparkSession sparkSession; CombinedReportHelper helper; public MyDatasourceRelation() { } public SmartbixDatasourceRelation(SQLContext sqlContext,Map parameters) { this.parameters = parameters; this.sparkSession = sqlContext.sparkSession(); this.helper = new CombinedReportHelper(parameters); //don't care this.helper.setRowsPerPage(1); } @Override public SQLContext sqlContext() { return this.sparkSession.sqlContext(); } @Override public StructType schema() { StructType structType = transformSchema(helper.getFields(), helper.getFirst()); //helper.close(); System.out.println("get schema: "+structType); return structType; } @Override public RDD buildScan(String[] requiredColumns, Filter[] filters) { System.out.println("build scan:"); int totalRow = helper.getTotalRow(); Partition[] partitions = getPartitions(totalRow, parameters); System.out.println("get partition:"+partitions.length+" total row:"+totalRow); return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, parameters); } private Partition[] getPartitions(int totalRow, Map parameters) { int step = 100; int numOfPartitions = (totalRow + step - 1) / step; Partition[] partitions = new Partition[numOfPartitions]; for (int i = 0; i < numOfPartitions; i++) { int start = i * step + 1; partitions[i] = new MyPartition(null, i, start, start + step); } return partitions; } } {code} -- above is my code,some useless information are removed --- trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } if i implement this trait, i find requiredColumns param is different everytime,Why are the order different {color:#FF}you can use spark.read.jdbc and connect to your local mysql DB, and debug at {color} {color:#FF}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color} to show this param; attachement is my screenshot > spark sql External DataSource question > -- > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Minor > Attachments: QQ图片20181015182502.jpg > > > {code:java} > public class MyDatasourceRelation extends BaseRelation implements > PrunedFilteredScan { > Map parameters; > SparkSession
[jira] [Updated] (SPARK-25723) spark sql External DataSource question
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Description: {code:java} public class MyDatasourceRelation extends BaseRelation implements PrunedFilteredScan { Map parameters; SparkSession sparkSession; CombinedReportHelper helper; public MyDatasourceRelation() { } public SmartbixDatasourceRelation(SQLContext sqlContext,Map parameters) { this.parameters = parameters; this.sparkSession = sqlContext.sparkSession(); this.helper = new CombinedReportHelper(parameters); //don't care this.helper.setRowsPerPage(1); } @Override public SQLContext sqlContext() { return this.sparkSession.sqlContext(); } @Override public StructType schema() { StructType structType = transformSchema(helper.getFields(), helper.getFirst()); //helper.close(); System.out.println("get schema: "+structType); return structType; } @Override public RDD buildScan(String[] requiredColumns, Filter[] filters) { System.out.println("build scan:"); int totalRow = helper.getTotalRow(); Partition[] partitions = getPartitions(totalRow, parameters); System.out.println("get partition:"+partitions.length+" total row:"+totalRow); return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, parameters); } private Partition[] getPartitions(int totalRow, Map parameters) { int step = 100; int numOfPartitions = (totalRow + step - 1) / step; Partition[] partitions = new Partition[numOfPartitions]; for (int i = 0; i < numOfPartitions; i++) { int start = i * step + 1; partitions[i] = new MyPartition(null, i, start, start + step); } return partitions; } } {code} -- above is my code,some useless information are removed --- trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } if i implement this trait, i find requiredColumns param is different everytime,Why are the order different {color:#FF}you can use spark.read.jdbc and connect to your local mysql DB, and debug at {color} {color:#FF}org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130){color} to show this param; attachement is my screenshot was: trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } if i implement this trait, i find requiredColumns param is different everytime,Why are the order different you can use spark.read.jdbc and connect to your local mysql DB, and debug at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130) to show this param; attachement is my screenshot > spark sql External DataSource question > -- > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Minor > Attachments: QQ图片20181015182502.jpg > > > {code:java} > public class MyDatasourceRelation extends BaseRelation implements > PrunedFilteredScan { > Map parameters; > SparkSession sparkSession; > CombinedReportHelper helper; > public MyDatasourceRelation() { > } > public SmartbixDatasourceRelation(SQLContext sqlContext,Map > parameters) { > this.parameters = parameters; > this.sparkSession = sqlContext.sparkSession(); > this.helper = new CombinedReportHelper(parameters); //don't care > this.helper.setRowsPerPage(1); > } > @Override > public SQLContext sqlContext() { > return this.sparkSession.sqlContext(); > } > @Override > public StructType schema() { > StructType structType = transformSchema(helper.getFields(), > helper.getFirst()); > //helper.close(); > System.out.println("get schema: "+structType); > return structType; > } > @Override > public RDD buildScan(String[] requiredColumns, Filter[] filters) { > System.out.println("build scan:"); > int totalRow = helper.getTotalRow(); > Partition[] partitions = getPartitions(totalRow, parameters); > System.out.println("get partition:"+partitions.length+" total row:"+totalRow); > return new SmartbixDatasourceRDD(sparkSession.sparkContext(), partitions, > parameters); > } > private Partition[] getPartitions(int totalRow, Map > parameters) { > int step = 100; > int numOfPartitions = (totalRow + step - 1) / step; > Partition[] partitions = new Partition[numOfPartitions]; > for (int i = 0; i < numOfPartitions; i++) { > int start = i * step + 1; > partitions[i] = new MyPartition(null, i, start, start + step); > } > return partitions; > } > } > {code} > > > > -- above is my code,some useless information are removed > --- > > > trait PrunedFilteredScan > { def buildScan(requiredColumns: Array[String], filters:
[jira] [Resolved] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25738. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22733 [https://github.com/apache/spark/pull/22733] > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > Fix For: 2.4.0 > > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25738: -- Assignee: Imran Rashid > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > Fix For: 2.4.0 > > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager
[ https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23257: -- Assignee: Ilan Filonenko > Implement Kerberos Support in Kubernetes resource manager > - > > Key: SPARK-23257 > URL: https://issues.apache.org/jira/browse/SPARK-23257 > Project: Spark > Issue Type: Wish > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Rob Keevil >Assignee: Ilan Filonenko >Priority: Major > Fix For: 3.0.0 > > > On the forked k8s branch of Spark at > [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support > has been added to the Kubernetes resource manager. The Kubernetes code > between these two repositories appears to have diverged, so this commit > cannot be merged in easily. Are there any plans to re-implement this work on > the main Spark repository? > > [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help > with the development and testing of this, but i wanted to confirm that this > isn't already in progress - I could not find any discussion about this > specific topic online. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager
[ https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23257. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21669 [https://github.com/apache/spark/pull/21669] > Implement Kerberos Support in Kubernetes resource manager > - > > Key: SPARK-23257 > URL: https://issues.apache.org/jira/browse/SPARK-23257 > Project: Spark > Issue Type: Wish > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Rob Keevil >Assignee: Ilan Filonenko >Priority: Major > Fix For: 3.0.0 > > > On the forked k8s branch of Spark at > [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support > has been added to the Kubernetes resource manager. The Kubernetes code > between these two repositories appears to have diverged, so this commit > cannot be merged in easily. Are there any plans to re-implement this work on > the main Spark repository? > > [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help > with the development and testing of this, but i wanted to confirm that this > isn't already in progress - I could not find any discussion about this > specific topic online. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25716) Project and Aggregate generate valid constraints with unnecessary operation
[ https://issues.apache.org/jira/browse/SPARK-25716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25716. - Resolution: Fixed Assignee: SongYadong Fix Version/s: 3.0.0 > Project and Aggregate generate valid constraints with unnecessary operation > --- > > Key: SPARK-25716 > URL: https://issues.apache.org/jira/browse/SPARK-25716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: SongYadong >Assignee: SongYadong >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Project logical operator generates valid constraints using two opposite > operations. It substracts child constraints from all constraints, than union > child constraints again. I think it may be not necessary. > Aggregate operator has the same problem with Project. > for example: > in LogicalPlan.getAliasedConstraints(), return: > {code:java} > allConstraints -- child.constraints{code} > in Project.validConstraints(): > {code:java} > child.constraints.union(getAliasedConstraints(projectList)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25643) Performance issues querying wide rows
[ https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650866#comment-16650866 ] Bruce Robbins edited comment on SPARK-25643 at 10/15/18 10:08 PM: -- [~viirya] Yes, in the case where I said "matching rows are sprinkled fairly evenly throughout the table", I meant that predicate pushdown is working correctly, but that every data page for the filtered column contains at least one matching value. It's not about predicate push down so much as how records are distributed in the files. For example, I have a 1 million record table with 6000 columns, where {{select * from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, so the matching records are arbitrarily distributed throughout the table. In this case, every data page for column id1 contains at least one entry with value 1. As a result, every row gets realized and passed to FilterExec. So the "small subset of rows" in this case is 2330 rows. When I say "the returned result includes just a few rows", I meant a few rows relative to the size of the table. There has to be enough matching rows so that many data pages are involved. was (Author: bersprockets): [~viirya] Yes, in the case where I said "predicate push down is not helping", I meant that predicate pushdown is working correctly, but that every data page for the filtered column contains at least one matching value. It's not about predicate push down so much as how records are distributed in the files. For example, I have a 1 million record table with 6000 columns, where {{select * from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, so the matching records are arbitrarily distributed throughout the table. In this case, every data page for column id1 contains at least one entry with value 1. As a result, every row gets realized and passed to FilterExec. So the "small subset of rows" in this case is 2330 rows. When I say "the returned result includes just a few rows", I meant a few rows relative to the size of the table. There has to be enough matching rows so that many data pages are involved. > Performance issues querying wide rows > - > > Key: SPARK-25643 > URL: https://issues.apache.org/jira/browse/SPARK-25643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Major > > Querying a small subset of rows from a wide table (e.g., a table with 6000 > columns) can be quite slow in the following case: > * the table has many rows (most of which will be filtered out) > * the projection includes every column of a wide table (i.e., select *) > * predicate push down is not helping: either matching rows are sprinkled > fairly evenly throughout the table, or predicate push down is switched off > Even if the filter involves only a single column and the returned result > includes just a few rows, the query can run much longer compared to an > equivalent query against a similar table with fewer columns. > According to initial profiling, it appears that most time is spent realizing > the entire row in the scan, just so the filter can look at a tiny subset of > columns and almost certainly throw the row away. The profiling shows 74% of > time is spent in FileSourceScanExec, and that time is spent across numerous > writeFields_0_xxx method calls. > If Spark must realize the entire row just to check a tiny subset of columns, > this all sounds reasonable. However, I wonder if there is an optimization > here where we can avoid realizing the entire row until after the filter has > selected the row. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25643) Performance issues querying wide rows
[ https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650866#comment-16650866 ] Bruce Robbins commented on SPARK-25643: --- [~viirya] Yes, in the case where I said "predicate push down is not helping", I meant that predicate pushdown is working correctly, but that every data page for the filtered column contains at least one matching value. It's not about predicate push down so much as how records are distributed in the files. For example, I have a 1 million record table with 6000 columns, where {{select * from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, so the matching records are arbitrarily distributed throughout the table. In this case, every data page for column id1 contains at least one entry with value 1. As a result, every row gets realized and passed to FilterExec. So the "small subset of rows" in this case is 2330 rows. When I say "the returned result includes just a few rows", I meant a few rows relative to the size of the table. There has to be enough matching rows so that many data pages are involved. > Performance issues querying wide rows > - > > Key: SPARK-25643 > URL: https://issues.apache.org/jira/browse/SPARK-25643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Major > > Querying a small subset of rows from a wide table (e.g., a table with 6000 > columns) can be quite slow in the following case: > * the table has many rows (most of which will be filtered out) > * the projection includes every column of a wide table (i.e., select *) > * predicate push down is not helping: either matching rows are sprinkled > fairly evenly throughout the table, or predicate push down is switched off > Even if the filter involves only a single column and the returned result > includes just a few rows, the query can run much longer compared to an > equivalent query against a similar table with fewer columns. > According to initial profiling, it appears that most time is spent realizing > the entire row in the scan, just so the filter can look at a tiny subset of > columns and almost certainly throw the row away. The profiling shows 74% of > time is spent in FileSourceScanExec, and that time is spent across numerous > writeFields_0_xxx method calls. > If Spark must realize the entire row just to check a tiny subset of columns, > this all sounds reasonable. However, I wonder if there is an optimization > here where we can avoid realizing the entire row until after the filter has > selected the row. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Description: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. I have tested the provided code in Databricks runtime environment 5.0 and 4.1, and it is giving the expected output. However in Databricks runtime 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output. was: Example code - {code} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. I have tested the provided code in Databricks runtime environment 5.0 beta, and it is giving the expected output. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. > > I have tested the provided code in Databricks runtime environment 5.0 and > 4.1, and it is giving the expected output. However in Databricks runtime > 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Description: Example code - {code} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. I have tested the provided code in Databricks runtime environment 5.0 beta, and it is giving the expected output. was: Example code - {code:scala} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. > > I have tested the provided code in Databricks runtime environment 5.0 beta, > and it is giving the expected output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Description: Example code - {code:scala} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. was: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:scala} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Affects Version/s: (was: 2.3.2) 2.3.1 > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Description: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.1, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. was: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.2, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) (was: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) ) > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.2, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Description: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.2, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 > Environment: Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.2, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. > >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.2, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) was: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.2, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 > Environment: > Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.2, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
Brian Jones created SPARK-25739: --- Summary: Double quote coming in as empty value even when emptyValue set as null Key: SPARK-25739 URL: https://issues.apache.org/jira/browse/SPARK-25739 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Environment: Example code - {code:java} val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") df .repartition(1) .write .mode("overwrite") .option("nullValue", null) .option("emptyValue", null) .option("delimiter",",") .option("quoteMode", "NONE") .option("escape","\\") .format("csv") .save("/tmp/nullcsv/") var out = dbutils.fs.ls("/tmp/nullcsv/") var file = out(out.size - 1) val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) println(x) {code} Output - {code:java} 1,"" 3,hi 2,hello 4, {code} Expected output - {code:java} 1, 3,hi 2,hello 4, {code} [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] This commit is relevant to my issue. "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files." I am on Spark version 2.3.2, so empty strings should be coming as null. Even then, I am passing the correct "emptyValue" option. However, my empty values are stilling coming as `""` in the written file. Reporter: Brian Jones -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650817#comment-16650817 ] Daniel Darabos commented on SPARK-20144: Thanks, those are good questions. # The global option is not great, but it's the simplest. The code is already controlled by two global options. ({{spark.sql.files.maxPartitionBytes}} and {{spark.sql.files.openCostInBytes}}.) Why not one more? # I'm not sure what {{LOAD DATA INPATH}} does. (Sorry...) But sure, users can put random-name files in the directory and mess stuff up. Best protection against that is not putting random-name files in the directory. :D # The whole problem is not Parquet-specific. It affects all file types. The {{part-1}} naming comes from Hadoop's [FileOutputFormat|https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java#L270]. It's been like this forever and will never change. (I'd say it's more than a convention.) > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25738: Assignee: (was: Apache Spark) > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25738: Assignee: Apache Spark > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650807#comment-16650807 ] Apache Spark commented on SPARK-25738: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/22733 > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650799#comment-16650799 ] Imran Rashid commented on SPARK-25738: -- the fix is pretty trivial, I'm posting a pr now > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650796#comment-16650796 ] Shixiong Zhu commented on SPARK-25738: -- Marked as a blocker since this is a regression > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
[ https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25738: - Priority: Blocker (was: Critical) > LOAD DATA INPATH doesn't work if hdfs conf includes port > > > Key: SPARK-25738 > URL: https://issues.apache.org/jira/browse/SPARK-25738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Blocker > > LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address > at index 8}} if your hdfs conf includes a port for the namenode. > This is because the URI is passing in the value of the hdfs conf > "fs.defaultFS" in for the host. Note that variable is called {{authority}}, > but the 4-arg URI constructor actually expects a host: > https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) > {code} > val defaultFSConf = > sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") > ... > val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) > {code} > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 > This was introduced by SPARK-23425. > *Workaround*: specify a fully qualified path, eg. instead of > {noformat} > LOAD DATA INPATH '/some/path/on/hdfs' > {noformat} > use > {noformat} > LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port
Imran Rashid created SPARK-25738: Summary: LOAD DATA INPATH doesn't work if hdfs conf includes port Key: SPARK-25738 URL: https://issues.apache.org/jira/browse/SPARK-25738 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Imran Rashid LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address at index 8}} if your hdfs conf includes a port for the namenode. This is because the URI is passing in the value of the hdfs conf "fs.defaultFS" in for the host. Note that variable is called {{authority}}, but the 4-arg URI constructor actually expects a host: https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String) {code} val defaultFSConf = sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") ... val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment) {code} https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386 This was introduced by SPARK-23425. *Workaround*: specify a fully qualified path, eg. instead of {noformat} LOAD DATA INPATH '/some/path/on/hdfs' {noformat} use {noformat} LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs' {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650779#comment-16650779 ] Dongjoon Hyun commented on SPARK-20144: --- [~silvermast] and [~darabos]. 1. The proposed `spark.sql.files.allowReordering` is a global option for all data sources. 2. In parquet tables, can we prevent some users execute 'LOAD DATA INPATH' to that folder with a random-name file? 3. Is there any way for Hive/Spark/Parquet to keep that naming convention in that table always? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650722#comment-16650722 ] Daniel Darabos commented on SPARK-20144: Yeah, I'm not too happy about the alphabetical ordering either. I thought I could simply not sort, and get the "original" order. But at the point where I made my change, the files are already in a jumbled order. Maybe it's the file system listing order, which could be anything. 99% of the time I'm just reading back a single partitioned Parquet file. In this case the alphabetical ordering is the right ordering. ({{part-1}}, {{part-2}}, ...) The rows of the resulting DataFrame will be in the same order as originally. So I think this issue is satisfied by the change. (The test also demonstrates this.) The 1% case (for me) is when I'm reading back multiple Parquet files with a glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}}, {{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at least the partitions within each Parquet file will be contiguous. That's still an improvement. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650704#comment-16650704 ] Victor Tso commented on SPARK-20144: It should, because by convention the parquet files are 0-padded numerically ordered. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghav Kumar Gautam resolved SPARK-25424. - Resolution: Not A Bug > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) > at >
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650642#comment-16650642 ] Dongjoon Hyun commented on SPARK-20144: --- For me, I don't think that PR resolve this issue, `spark.read.parquet no long maintains ordering of the data`. This issue asked 'data ordering'. The alphabetical file path order doesn't guarantee the order of the data, does it? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650608#comment-16650608 ] Apache Spark commented on SPARK-25044: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/22732 > Address translation of LMF closure primitive args to Object in Scala 2.12 > - > > Key: SPARK-25044 > URL: https://issues.apache.org/jira/browse/SPARK-25044 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 > Fix HandleNullInputsForUDF rule": > {code:java} > - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED *** > Results do not match for query: > ... > == Results == > == Results == > !== Correct Answer - 3 == == Spark Answer - 3 == > !struct<> struct > ![0,10,null] [0,10,0] > ![1,12,null] [1,12,1] > ![2,14,null] [2,14,2] (QueryTest.scala:163){code} > You can kind of get what's going on reading the test: > {code:java} > test("SPARK-24891 Fix HandleNullInputsForUDF rule") { > // assume(!ClosureCleanerSuite2.supportsLMFs) > // This test won't test what it intends to in 2.12, as lambda metafactory > closures > // have arg types that are not primitive, but Object > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > .withColumn("c", udf1($"a", lit(null))) > val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed > comparePlans(df.logicalPlan, plan) > checkAnswer( > df, > Seq( > Row(0, 10, null), > Row(1, 12, null), > Row(2, 14, null))) > }{code} > > It seems that the closure that is fed in as a UDF changes behavior, in a way > that primitive-type arguments are handled differently. For example an Int > argument, when fed 'null', acts like 0. > I'm sure it's a difference in the LMF closure and how its types are > understood, but not exactly sure of the cause yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650591#comment-16650591 ] Marcelo Vanzin commented on SPARK-25732: In fact, do you even need proxy user + keytab at all? If you just submit the application using principal / keytab from the command line, as is available today, isn't that the exact same thing? Other than the keytab needing to be available on the submitting node, seems like it does all you need? Only thing missing would be to download the keytab from remote storage if needed, and then cleaning it up. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650560#comment-16650560 ] Thomas Graves commented on SPARK-25732: --- I would much rather see Spark start to push tokens and distributing them. I'm not fond of pushing keytabs, many security folks/companies won't allow it. If you do this it means that all the users keytabs are in HDFS all the time, which in my opinion is even worse then our existing keytab/principal options where it can be picked up locally and its only in HDFS temporarily. Just more chances permissions are messed up and people compromise keytabs which are indefinitely and much harder to revoke then things like tokens. Pushing token is definitely more work but think we should go that way long term. Having an rpc connection between client and driver can be useful for other things as well. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25547) Pluggable jdbc connection factory
[ https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25547: Target Version/s: 3.0.0 > Pluggable jdbc connection factory > - > > Key: SPARK-25547 > URL: https://issues.apache.org/jira/browse/SPARK-25547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Sauer >Priority: Major > > The ability to provide a custom connectionFactoryProvider via JDBCOptions so > that JdbcUtils.createConnectionFactory can produce a custom connection > factory would be very useful. In our case we needed to have the ability to > load balance connections to an AWS Aurora Postgres cluster by round-robining > through the endpoints of the read replicas since their own loan balancing was > insufficient. We got away with it by copying most of the spark jdbc package > and provide this feature there and changing the format from jdbc to our new > package. However it would be nice if this were supported out of the box via > a new option in JDBCOptions providing the classname for a > ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR > which I have ready to go. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated
[ https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650541#comment-16650541 ] Apache Spark commented on SPARK-25674: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22731 > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated > - > > Key: SPARK-25674 > URL: https://issues.apache.org/jira/browse/SPARK-25674 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated in `FileScanRDD.scala`,because it might skip > over the count that is an exact multiple of > UPDATE_INPUT_METRICS_INTERVAL_RECORDS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated
[ https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650539#comment-16650539 ] Apache Spark commented on SPARK-25674: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22731 > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated > - > > Key: SPARK-25674 > URL: https://issues.apache.org/jira/browse/SPARK-25674 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated in `FileScanRDD.scala`,because it might skip > over the count that is an exact multiple of > UPDATE_INPUT_METRICS_INTERVAL_RECORDS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650535#comment-16650535 ] Marcelo Vanzin commented on SPARK-25732: I'd have preferred a system where Livy handles this for the users by periodically creating new delegation tokens for them and sending them to Spark. Saisai looked at something like this in the past, but distributing the tokens to Spark was the main issue. With Livy you already have an RPC channel to the Spark context, so maybe that could be done? But it would probably still require some new API in Spark itself... If those paths don't work, then this would be no worse than what Spark already has. Main issue is that you seem to be mixing keytab/principal with proxy user and that doesn't work - Spark explicitly disallows that combination. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650524#comment-16650524 ] Matt Cheah commented on SPARK-18278: The fork is no longer being maintained, because Kubernetes support is now being built and available on mainline. Spark 2.3 introduced basic support for Kubernetes. Spark 2.4 will have Python support. Dynamic allocation is being reworked, see [my comment above|https://issues.apache.org/jira/browse/SPARK-18278?focusedCommentId=16646282=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16646282]. > SPIP: Support native submission of spark jobs to a kubernetes cluster > - > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Kubernetes, Scheduler, > Spark Core >Affects Versions: 2.3.0 >Reporter: Erik Erlandson >Assignee: Anirudh Ramanathan >Priority: Major > Labels: SPIP > Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision > 2 (1).pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650519#comment-16650519 ] Oleg Frenkel commented on SPARK-18278: -- When is this fork expected to make its way to the official Spark 2.3 release? My company specifically looking for Spark Python support with kubernetes and current implementation in this fork is good enough for us. Thanks. > SPIP: Support native submission of spark jobs to a kubernetes cluster > - > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Kubernetes, Scheduler, > Spark Core >Affects Versions: 2.3.0 >Reporter: Erik Erlandson >Assignee: Anirudh Ramanathan >Priority: Major > Labels: SPIP > Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision > 2 (1).pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650492#comment-16650492 ] Daniel Darabos commented on SPARK-20144: Thanks Victor! I've expanded the test with a case where reordering is allowed, and I've added some explanatory comments in the test. [~dongjoon], what do you think? Should I try to foster more discussion? Or what could be a next step? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16775) Remove deprecated accumulator v1 APIs
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16775: Assignee: (was: Apache Spark) > Remove deprecated accumulator v1 APIs > - > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Major > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. > ... but we can also just remove the old deprecated API entirely in Spark > 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16775) Remove deprecated accumulator v1 APIs
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650472#comment-16650472 ] Apache Spark commented on SPARK-16775: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/22730 > Remove deprecated accumulator v1 APIs > - > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Major > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. > ... but we can also just remove the old deprecated API entirely in Spark > 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16775) Remove deprecated accumulator v1 APIs
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16775: Assignee: Apache Spark > Remove deprecated accumulator v1 APIs > - > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Major > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. > ... but we can also just remove the old deprecated API entirely in Spark > 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users
[ https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650451#comment-16650451 ] Marcelo Vanzin commented on SPARK-13478: You don't need a keytab to log in to kerberos... > Fetching delegation tokens for Hive fails when using proxy users > > > Key: SPARK-13478 > URL: https://issues.apache.org/jira/browse/SPARK-13478 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.6.4, 2.0.0 > > > If you use spark-submit's proxy user support, the code that fetches > delegation tokens for the Hive Metastore fails. It seems like the Hive > library tries to connect to the Metastore as the proxy user, and it doesn't > have a Kerberos TGT for that user, so it fails. > I don't know whether the same issue exists in the HBase code, but I'll make a > similar change so that both behave similarly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25737: Assignee: Sean Owen (was: Apache Spark) > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25737: Assignee: Apache Spark (was: Sean Owen) > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650425#comment-16650425 ] Apache Spark commented on SPARK-25737: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/22729 > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16775) Remove deprecated accumulator v1 APIs
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16775: -- Affects Version/s: (was: 2.1.0) 3.0.0 Target Version/s: 3.0.0 Description: Deprecating the old accumulator API added a large number of warnings - many of these could be fixed with a bit of refactoring to offer a non-deprecated internal class while still preserving the external deprecation warnings. ... but we can also just remove the old deprecated API entirely in Spark 3.0.0. was:Deprecating the old accumulator API added a large number of warnings - many of these could be fixed with a bit of refactoring to offer a non-deprecated internal class while still preserving the external deprecation warnings. Summary: Remove deprecated accumulator v1 APIs (was: Reduce internal warnings from deprecated accumulator API) I'm hijacking this older Jira to be about just removing the old deprecated API in Spark 3. At this point that's more likely to happen than cleaning up use of the deprecated API. > Remove deprecated accumulator v1 APIs > - > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Major > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. > ... but we can also just remove the old deprecated API entirely in Spark > 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24154) AccumulatorV2 loses type information during serialization
[ https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24154. --- Resolution: Won't Fix > AccumulatorV2 loses type information during serialization > - > > Key: SPARK-24154 > URL: https://issues.apache.org/jira/browse/SPARK-24154 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1 > Environment: Scala 2.11 > Spark 2.2.0 >Reporter: Sergey Zhemzhitsky >Priority: Major > > AccumulatorV2 loses type information during serialization. > It happens > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164] > during *writeReplace* call > {code:scala} > final protected def writeReplace(): Any = { > if (atDriverSide) { > if (!isRegistered) { > throw new UnsupportedOperationException( > "Accumulator must be registered before send to executor") > } > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy") > val isInternalAcc = name.isDefined && > name.get.startsWith(InternalAccumulator.METRICS_PREFIX) > if (isInternalAcc) { > // Do not serialize the name of internal accumulator and send it to > executor. > copyAcc.metadata = metadata.copy(name = None) > } else { > // For non-internal accumulators, we still need to send the name > because users may need to > // access the accumulator name at executor side, or they may keep the > accumulators sent from > // executors and access the name when the registered accumulator is > already garbage > // collected(e.g. SQLMetrics). > copyAcc.metadata = metadata > } > copyAcc > } else { > this > } > } > {code} > It means that it is hardly possible to create new accumulators easily by > adding new behaviour to existing ones by means of mix-ins or inheritance > (without overriding *copy*). > For example the following snippet ... > {code:scala} > trait TripleCount { > self: LongAccumulator => > abstract override def add(v: jl.Long): Unit = { > self.add(v * 3) > } > } > val acc = new LongAccumulator with TripleCount > sc.register(acc) > val data = 1 to 10 > val rdd = sc.makeRDD(data, 5) > rdd.foreach(acc.add(_)) > acc.value shouldBe 3 * data.sum > {code} > ... fails with > {code:none} > org.scalatest.exceptions.TestFailedException: 55 was not equal to 165 > at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864) > {code} > Also such a behaviour seems to be error prone and confusing because an > implementor gets not the same thing as he/she sees in the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25737: -- Environment: (was: In ancient times in 2013, JavaSparkContext got a superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. I believe we can now remove this workaround. Along the way, I think we can also avoid the duplicated definitions of {{union()}}. Where we should be able to just have one varargs method, we have up to 3 forms: - {{union(RDD, Seq/List)}} - {{union(RDD*)}} - {{union(RDD, RDD*)}} While this pattern is sometimes used to avoid type collision due to erasure, I don't think it applies here. After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods (for the 3 Java RDD types), not 11 methods. The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or {{sc.union(Seq(rdd1, rdd2): _*)}}) Description: In ancient times in 2013, JavaSparkContext got a superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. I believe we can now remove this workaround. Along the way, I think we can also avoid the duplicated definitions of {{union()}}. Where we should be able to just have one varargs method, we have up to 3 forms: - {{union(RDD, Seq/List)}} - {{union(RDD*)}} - {{union(RDD, RDD*)}} While this pattern is sometimes used to avoid type collision due to erasure, I don't think it applies here. After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods (for the 3 Java RDD types), not 11 methods. The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or {{sc.union(Seq(rdd1, rdd2): _*)}} > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25737: -- Target Version/s: 3.0.0 > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
Sean Owen created SPARK-25737: - Summary: Remove JavaSparkContextVarargsWorkaround and standardize union() methods Key: SPARK-25737 URL: https://issues.apache.org/jira/browse/SPARK-25737 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.0.0 Environment: In ancient times in 2013, JavaSparkContext got a superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. I believe we can now remove this workaround. Along the way, I think we can also avoid the duplicated definitions of {{union()}}. Where we should be able to just have one varargs method, we have up to 3 forms: - {{union(RDD, Seq/List)}} - {{union(RDD*)}} - {{union(RDD, RDD*)}} While this pattern is sometimes used to avoid type collision due to erasure, I don't think it applies here. After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods (for the 3 Java RDD types), not 11 methods. The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or {{sc.union(Seq(rdd1, rdd2): _*)}} Reporter: Sean Owen Assignee: Sean Owen -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25736) add tests to verify the behavior of multi-column count
[ https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25736: Assignee: Apache Spark (was: Wenchen Fan) > add tests to verify the behavior of multi-column count > -- > > Key: SPARK-25736 > URL: https://issues.apache.org/jira/browse/SPARK-25736 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25736) add tests to verify the behavior of multi-column count
[ https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650375#comment-16650375 ] Apache Spark commented on SPARK-25736: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22728 > add tests to verify the behavior of multi-column count > -- > > Key: SPARK-25736 > URL: https://issues.apache.org/jira/browse/SPARK-25736 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25736) add tests to verify the behavior of multi-column count
[ https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25736: Assignee: Wenchen Fan (was: Apache Spark) > add tests to verify the behavior of multi-column count > -- > > Key: SPARK-25736 > URL: https://issues.apache.org/jira/browse/SPARK-25736 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25736) add tests to verify the behavior of multi-column count
Wenchen Fan created SPARK-25736: --- Summary: add tests to verify the behavior of multi-column count Key: SPARK-25736 URL: https://issues.apache.org/jira/browse/SPARK-25736 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25369) Replace Java shim functional interfaces like java.api.Function with Java 8 equivalents
[ https://issues.apache.org/jira/browse/SPARK-25369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650356#comment-16650356 ] Sean Owen commented on SPARK-25369: --- Example: {code:java} @FunctionalInterface public interface Function extends Serializable { R call(T1 v1) throws Exception; }{code} becomes: {code:java} @FunctionalInterface public interface Function extends Serializable, java.util.function.Function { R call(T1 v1) throws Exception; default R apply(T1 v1) { try { return call(v1); } catch (Exception e) { throw new RuntimeException(e); } } }{code} The upside, as far as I can tell, is that you'd be able to reuse a single function implementation in calls to Spark APIs and JDK APIs like in Collections. Then again... you can already write a lambda that calls a function easily. I end up neutral on it. > Replace Java shim functional interfaces like java.api.Function with Java 8 > equivalents > -- > > Key: SPARK-25369 > URL: https://issues.apache.org/jira/browse/SPARK-25369 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Minor > Labels: release-notes > > In Spark 3, we should remove interfaces like > org.apache.spark.api.java.function.Function and replace with > java.util.function equivalents, for better compatibility with Java 8. This > would let callers pass, in more cases, an existing functional object in Java > rather than wrap in a lambda. > It's possible to have the functional interfaces in Spark just extend Java 8 > functional interfaces to interoperate better with existing code, but might be > as well to remove them in Spark 3 to clean up. > A partial list of transitions from Spark to Java interfaces: > * Function -> Function > * Function0 -> Supplier > * Function2 -> BiFunction > * VoidFunction -> Consumer > * FlatMapFunction etc -> extends Function> etc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650303#comment-16650303 ] Victor Tso commented on SPARK-20144: I looked at the PR and liked what I saw. I would only suggest the test be duplicated for true and/or the PR comment be added to the test code itself. We have a customer where the records do not have a declared ordering of values but the records' appearance order is significant. Turning on our feature flag to use Spark's parquet reader violates their expectation of ordering so we are unable to shepherd them to this feature flag. Any help along the lines of this Jira and PR would be appreciated. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users
[ https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650302#comment-16650302 ] Sunayan Saikia commented on SPARK-13478: [~vanzin] As we know 'spark-submit' command by itself doesn't officially allow the '\-\-proxy\-user' and '\-\-principal/\-ketab' options to be used together. So, how this patch working? > Fetching delegation tokens for Hive fails when using proxy users > > > Key: SPARK-13478 > URL: https://issues.apache.org/jira/browse/SPARK-13478 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.6.4, 2.0.0 > > > If you use spark-submit's proxy user support, the code that fetches > delegation tokens for the Hive Metastore fails. It seems like the Hive > library tries to connect to the Metastore as the proxy user, and it doesn't > have a Kerberos TGT for that user, so it fails. > I don't know whether the same issue exists in the HBase code, but I'll make a > similar change so that both behave similarly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1
[ https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650247#comment-16650247 ] Apache Spark commented on SPARK-25735: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22727 > Improve start-thriftserver.sh: print clean usage and exit with code 1 > - > > Key: SPARK-25735 > URL: https://issues.apache.org/jira/browse/SPARK-25735 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently if we run > sh start-thriftserver.sh -h > we get > ... > Thrift server options: > 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext > 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 > 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. > org.apache.spark.SparkException: A master URL must be set in your > configuration > at org.apache.spark.SparkContext.(SparkContext.scala:367) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main > After fix, the usage output is clean: > Thrift server options: > --hiveconfUse value for given property > Also exit with code 1, to follow other scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1
[ https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650244#comment-16650244 ] Apache Spark commented on SPARK-25735: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22727 > Improve start-thriftserver.sh: print clean usage and exit with code 1 > - > > Key: SPARK-25735 > URL: https://issues.apache.org/jira/browse/SPARK-25735 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently if we run > sh start-thriftserver.sh -h > we get > ... > Thrift server options: > 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext > 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 > 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. > org.apache.spark.SparkException: A master URL must be set in your > configuration > at org.apache.spark.SparkContext.(SparkContext.scala:367) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main > After fix, the usage output is clean: > Thrift server options: > --hiveconfUse value for given property > Also exit with code 1, to follow other scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1
[ https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25735: Assignee: (was: Apache Spark) > Improve start-thriftserver.sh: print clean usage and exit with code 1 > - > > Key: SPARK-25735 > URL: https://issues.apache.org/jira/browse/SPARK-25735 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently if we run > sh start-thriftserver.sh -h > we get > ... > Thrift server options: > 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext > 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 > 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. > org.apache.spark.SparkException: A master URL must be set in your > configuration > at org.apache.spark.SparkContext.(SparkContext.scala:367) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main > After fix, the usage output is clean: > Thrift server options: > --hiveconfUse value for given property > Also exit with code 1, to follow other scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1
[ https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25735: Assignee: Apache Spark > Improve start-thriftserver.sh: print clean usage and exit with code 1 > - > > Key: SPARK-25735 > URL: https://issues.apache.org/jira/browse/SPARK-25735 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Currently if we run > sh start-thriftserver.sh -h > we get > ... > Thrift server options: > 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext > 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 > 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. > org.apache.spark.SparkException: A master URL must be set in your > configuration > at org.apache.spark.SparkContext.(SparkContext.scala:367) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main > After fix, the usage output is clean: > Thrift server options: > --hiveconfUse value for given property > Also exit with code 1, to follow other scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1
Gengliang Wang created SPARK-25735: -- Summary: Improve start-thriftserver.sh: print clean usage and exit with code 1 Key: SPARK-25735 URL: https://issues.apache.org/jira/browse/SPARK-25735 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Gengliang Wang Currently if we run sh start-thriftserver.sh -h we get ... Thrift server options: 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.(SparkContext.scala:367) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main After fix, the usage output is clean: Thrift server options: --hiveconfUse value for given property Also exit with code 1, to follow other scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25723) spark sql External DataSource question
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Priority: Minor (was: Major) > spark sql External DataSource question > -- > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Minor > Attachments: QQ图片20181015182502.jpg > > > trait PrunedFilteredScan { > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > } > > if i implement this trait, i find requiredColumns param is different > everytime,Why are the order different > you can use spark.read.jdbc and connect to your local mysql DB, and debug at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130) > to show this param; > attachement is my screenshot -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25727) makeCopy failed in InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-25727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650153#comment-16650153 ] Apache Spark commented on SPARK-25727: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22726 > makeCopy failed in InMemoryRelation > --- > > Key: SPARK-25727 > URL: https://issues.apache.org/jira/browse/SPARK-25727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > {code} > val data = Seq(100).toDF("count").cache() > data.queryExecution.optimizedPlan.toJSON > {code} > The above code can generate the following error: > {code} > assertion failed: InMemoryRelation fields: output, cacheBuilder, > statsOfPlanToCache, outputOrdering, values: List(count#178), > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) Project [value#176 AS count#178] > +- LocalTableScan [value#176] > ,None), Statistics(sizeInBytes=12.0 B, hints=none) > java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, > cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) Project [value#176 AS count#178] > +- LocalTableScan [value#176] > ,None), Statistics(sizeInBytes=12.0 B, hints=none) > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599) > at > org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24610) wholeTextFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650109#comment-16650109 ] Apache Spark commented on SPARK-24610: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22725 > wholeTextFiles broken for small files > - > > Key: SPARK-24610 > URL: https://issues.apache.org/jira/browse/SPARK-24610 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1, 2.3.1 >Reporter: Dhruve Ashar >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.4.0 > > > Spark is unable to read small files using the wholeTextFiles method when > split size related configs are specified - either explicitly or if they are > contained in other config files like hive-site.xml. > For small sized files, the computed maxSplitSize by > `WholeTextFileInputFormat` is way smaller than the default or commonly used > split size of 64/128M and spark throws an exception while trying to read > them. > > To reproduce the issue: > {code:java} > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf > "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456" > scala> sc.wholeTextFiles("file:///etc/passwd").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 9962 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > // For hdfs > sc.wholeTextFiles("smallFile").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 15 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24610) wholeTextFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650106#comment-16650106 ] Apache Spark commented on SPARK-24610: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22725 > wholeTextFiles broken for small files > - > > Key: SPARK-24610 > URL: https://issues.apache.org/jira/browse/SPARK-24610 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1, 2.3.1 >Reporter: Dhruve Ashar >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.4.0 > > > Spark is unable to read small files using the wholeTextFiles method when > split size related configs are specified - either explicitly or if they are > contained in other config files like hive-site.xml. > For small sized files, the computed maxSplitSize by > `WholeTextFileInputFormat` is way smaller than the default or commonly used > split size of 64/128M and spark throws an exception while trying to read > them. > > To reproduce the issue: > {code:java} > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf > "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456" > scala> sc.wholeTextFiles("file:///etc/passwd").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 9962 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > // For hdfs > sc.wholeTextFiles("smallFile").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 15 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25734) Literal should have a value corresponding to dataType
[ https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650059#comment-16650059 ] Apache Spark commented on SPARK-25734: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22724 > Literal should have a value corresponding to dataType > - > > Key: SPARK-25734 > URL: https://issues.apache.org/jira/browse/SPARK-25734 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25734) Literal should have a value corresponding to dataType
[ https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25734: Assignee: Apache Spark > Literal should have a value corresponding to dataType > - > > Key: SPARK-25734 > URL: https://issues.apache.org/jira/browse/SPARK-25734 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25734) Literal should have a value corresponding to dataType
[ https://issues.apache.org/jira/browse/SPARK-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25734: Assignee: (was: Apache Spark) > Literal should have a value corresponding to dataType > - > > Key: SPARK-25734 > URL: https://issues.apache.org/jira/browse/SPARK-25734 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25734) Literal should have a value corresponding to dataType
Takeshi Yamamuro created SPARK-25734: Summary: Literal should have a value corresponding to dataType Key: SPARK-25734 URL: https://issues.apache.org/jira/browse/SPARK-25734 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25723) spark sql External DataSource question
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai reopened SPARK-25723: --- reproduce > spark sql External DataSource question > -- > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Major > Attachments: QQ图片20181015182502.jpg > > > trait PrunedFilteredScan { > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > } > > if i implement this trait, i find requiredColumns param is different > everytime,Why are the order different > you can use spark.read.jdbc and connect to your local mysql DB, and debug at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130) > to show this param; > attachement is my screenshot -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25723) spark sql External DataSource question
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Description: trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } if i implement this trait, i find requiredColumns param is different everytime,Why are the order different you can use spark.read.jdbc and connect to your local mysql DB, and debug at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130) to show this param; attachement is my screenshot Summary: spark sql External DataSource question (was: spark sql External DataSource) > spark sql External DataSource question > -- > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Major > Attachments: QQ图片20181015182502.jpg > > > trait PrunedFilteredScan { > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > } > > if i implement this trait, i find requiredColumns param is different > everytime,Why are the order different > you can use spark.read.jdbc and connect to your local mysql DB, and debug at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation#buildScan(scala:130) > to show this param; > attachement is my screenshot -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25723) spark sql External DataSource
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Attachment: QQ图片20181015182502.jpg > spark sql External DataSource > - > > Key: SPARK-25723 > URL: https://issues.apache.org/jira/browse/SPARK-25723 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.2 > Environment: local mode >Reporter: huanghuai >Priority: Major > Attachments: QQ图片20181015182502.jpg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25723) spark sql External DataSource
[ https://issues.apache.org/jira/browse/SPARK-25723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25723: -- Description: (was: {color:#33}*spark.read()*{color} {color:#33}*.format("com.myself.datasource")*{color} {color:#33}*.option("ur","")*{color} {color:#33}*.load()*{color} {color:#FF}*.show()*{color} {color:#FF}*Driver stacktrace:*{color} {color:#FF} *at*{color} org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) ~[spark-core_2.11-2.3.0.jar:2.3.0] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.11.8.jar:?] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~[scala-library-2.11.8.jar:?] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0.jar:2.3.0] at scala.Option.foreach(Option.scala:257) ~[scala-library-2.11.8.jar:?] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067) ~[spark-core_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.show(Dataset.scala:723) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.show(Dataset.scala:682) ~[spark-sql_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.Dataset.show(Dataset.scala:691) ~[spark-sql_2.11-2.3.0.jar:2.3.0] {color:#FF}*Caused by: scala.MatchError: 23.25 (of class java.lang.Double)*{color} {color:#FF} *at*{color} org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276) ~[spark-catalyst_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275) ~[spark-catalyst_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) ~[spark-catalyst_2.11-2.3.0.jar:2.3.0] at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379) ~[spark-catalyst_2.11-2.3.0.jar:2.3.0] at
[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start
[ https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649990#comment-16649990 ] Ran Haim commented on SPARK-25527: -- Any update? > Job stuck waiting for last stage to start > - > > Key: SPARK-25527 > URL: https://issues.apache.org/jira/browse/SPARK-25527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ran Haim >Priority: Major > Attachments: threaddumpjob.txt > > > Sometimes it can somehow happen that a job is stuck waiting for the last > stage to start. > There are no Tasks waiting for completion, and the job just hangs. > There are available Executors for the job to run. > I do not know how to reproduce this, all I know is that it happens randomly > after couple days of hard load. > Another thing that might help is that it seems to happen when some tasks fail > because one or more executors killed (due to memory issues or something). > Those tasks eventually do get finished by other executors because of retries, > but the next stage hangs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
[ https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bihui Jin updated SPARK-25733: -- Environment: Spark in standalone mode, and 48 cores are available. spark-defaults.conf as blew: spark.pyshark.python /usr/bin/python3.6 spark.driver.memory 4g spark.executor.memory 8g other configurations are at default. was:Spark in standalone mode, and 48 cores are avaiable. > The method toLocalIterator() with dataframe doesn't work > > > Key: SPARK-25733 > URL: https://issues.apache.org/jira/browse/SPARK-25733 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Spark in standalone mode, and 48 cores are available. > spark-defaults.conf as blew: > spark.pyshark.python /usr/bin/python3.6 > spark.driver.memory 4g > spark.executor.memory 8g > > other configurations are at default. >Reporter: Bihui Jin >Priority: Major > Attachments: report_dataset.zip.001, report_dataset.zip.002 > > > {color:#FF}The dataset which I used attached.{color} > > First I loaded a dataframe from local disk: > df = spark.read.load('report_dataset') > there are about 200 partitions stored in s3, and the max size of partitions > is 28.37MB. > > after data loaded, I execute "df.take(1)" to test the dataframe, and > expected output printed > "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', > sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, > 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, > 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, > 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, > 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, > 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], > next_word=575, line_num=12)]" > > Then I try to convert dataframe to the local iterator and want to print one > row in dataframe for testing, and blew code is used: > for row in df.toLocalIterator(): > print(row) > break > {color:#ff}*But there is no output printed after that code > executed.*{color} > > Then I execute "df.take(1)" and blew error is reported: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 985, in send_command > response = connection.send_command(command) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:37735) > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 2963, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > df.take(1) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 504, in take > return self.limit(num).collect() > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 493, in limit > jdf = self._jdf.limit(num) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in > get_return_value > format(target_id, ".", name)) > py4j.protocol.Py4JError: An error occurred while calling o29.limit > During handling of the above
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649972#comment-16649972 ] shijinkui commented on SPARK-24630: --- I prefer without stream keyword. Because in the future bath and stream API must unify. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
[ https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bihui Jin updated SPARK-25733: -- Attachment: report_dataset.zip.002 > The method toLocalIterator() with dataframe doesn't work > > > Key: SPARK-25733 > URL: https://issues.apache.org/jira/browse/SPARK-25733 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Spark in standalone mode, and 48 cores are avaiable. >Reporter: Bihui Jin >Priority: Major > Attachments: report_dataset.zip.001, report_dataset.zip.002 > > > {color:#FF}The dataset which I used attached.{color} > > First I loaded a dataframe from local disk: > df = spark.read.load('report_dataset') > there are about 200 partitions stored in s3, and the max size of partitions > is 28.37MB. > > after data loaded, I execute "df.take(1)" to test the dataframe, and > expected output printed > "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', > sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, > 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, > 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, > 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, > 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, > 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], > next_word=575, line_num=12)]" > > Then I try to convert dataframe to the local iterator and want to print one > row in dataframe for testing, and blew code is used: > for row in df.toLocalIterator(): > print(row) > break > {color:#ff}*But there is no output printed after that code > executed.*{color} > > Then I execute "df.take(1)" and blew error is reported: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 985, in send_command > response = connection.send_command(command) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:37735) > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 2963, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > df.take(1) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 504, in take > return self.limit(num).collect() > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 493, in limit > jdf = self._jdf.limit(num) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in > get_return_value > format(target_id, ".", name)) > py4j.protocol.Py4JError: An error occurred while calling o29.limit > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 1863, in showtraceback > stb = value._render_traceback_() > AttributeError: 'Py4JError' object has no attribute '_render_traceback_' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File
[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
[ https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bihui Jin updated SPARK-25733: -- Attachment: report_dataset.zip.001 > The method toLocalIterator() with dataframe doesn't work > > > Key: SPARK-25733 > URL: https://issues.apache.org/jira/browse/SPARK-25733 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Spark in standalone mode, and 48 cores are avaiable. >Reporter: Bihui Jin >Priority: Major > Attachments: report_dataset.zip.001, report_dataset.zip.002 > > > {color:#FF}The dataset which I used attached.{color} > > First I loaded a dataframe from local disk: > df = spark.read.load('report_dataset') > there are about 200 partitions stored in s3, and the max size of partitions > is 28.37MB. > > after data loaded, I execute "df.take(1)" to test the dataframe, and > expected output printed > "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', > sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, > 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, > 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, > 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, > 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, > 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], > next_word=575, line_num=12)]" > > Then I try to convert dataframe to the local iterator and want to print one > row in dataframe for testing, and blew code is used: > for row in df.toLocalIterator(): > print(row) > break > {color:#ff}*But there is no output printed after that code > executed.*{color} > > Then I execute "df.take(1)" and blew error is reported: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 985, in send_command > response = connection.send_command(command) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:37735) > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 2963, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > df.take(1) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 504, in take > return self.limit(num).collect() > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 493, in limit > jdf = self._jdf.limit(num) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in > get_return_value > format(target_id, ".", name)) > py4j.protocol.Py4JError: An error occurred while calling o29.limit > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 1863, in showtraceback > stb = value._render_traceback_() > AttributeError: 'Py4JError' object has no attribute '_render_traceback_' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File
[jira] [Updated] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
[ https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bihui Jin updated SPARK-25733: -- Description: {color:#FF}The dataset which I used attached.{color} First I loaded a dataframe from local disk: df = spark.read.load('report_dataset') there are about 200 partitions stored in s3, and the max size of partitions is 28.37MB. after data loaded, I execute "df.take(1)" to test the dataframe, and expected output printed "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], next_word=575, line_num=12)]" Then I try to convert dataframe to the local iterator and want to print one row in dataframe for testing, and blew code is used: for row in df.toLocalIterator(): print(row) break {color:#ff}*But there is no output printed after that code executed.*{color} Then I execute "df.take(1)" and blew error is reported: ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735) Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in df.take(1) File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 504, in take return self.limit(num).collect() File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 493, in limit jdf = self._jdf.limit(num) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name)) py4j.protocol.Py4JError: An error occurred while calling o29.limit During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback stb = value._render_traceback_() AttributeError: 'Py4JError' object has no attribute '_render_traceback_' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, in _get_connection connection = self.deque.pop() IndexError: pop from an empty deque During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, in start self.socket.connect((self.address, self.port)) ConnectionRefusedError: [Errno 111] Connection refused ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735) Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns,
[jira] [Created] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
Bihui Jin created SPARK-25733: - Summary: The method toLocalIterator() with dataframe doesn't work Key: SPARK-25733 URL: https://issues.apache.org/jira/browse/SPARK-25733 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Environment: Spark in standalone mode, and 48 cores are avaiable. Reporter: Bihui Jin First I loaded a dataframe from local disk: df = spark.read.load('report_dataset') there are about 200 partitions stored in s3, and the max size of partitions is 28.37MB. after data loaded, I execute "df.take(1)" to test the dataframe, and expected output printed "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], next_word=575, line_num=12)]" Then I try to convert dataframe to the local iterator and want to print one row in dataframe for testing, and blew code is used: for row in df.toLocalIterator(): print(row) break {color:#FF}*But there is no output printed after that code executed.*{color} Then I execute "df.take(1)" and blew error is reported: ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735) Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in df.take(1) File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 504, in take return self.limit(num).collect() File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 493, in limit jdf = self._jdf.limit(num) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name)) py4j.protocol.Py4JError: An error occurred while calling o29.limit During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback stb = value._render_traceback_() AttributeError: 'Py4JError' object has no attribute '_render_traceback_' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, in _get_connection connection = self.deque.pop() IndexError: pop from an empty deque During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, in start self.socket.connect((self.address, self.port)) ConnectionRefusedError: [Errno 111] Connection refused
[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0
[ https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-25731: Labels: beginner features (was: ) > Spark Structured Streaming Support for Kafka 2.0 > > > Key: SPARK-25731 > URL: https://issues.apache.org/jira/browse/SPARK-25731 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Chandan >Priority: Major > Labels: beginner, features > > [https://github.com/apache/spark/tree/master/external] > As far as I can see, > This doesn't have support for newly release *kafka2.0,* > support is available only till *kafka-0-10.* > If we use the > "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" > for kafka2.0, below is the error I get > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > *Completed connection to node -1. Fetching API versions.* > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > *Initiating API versions fetch from node -1.* > 11:46:18.452 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 > disconnected* > *java.io.EOFException: null* > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) > at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) > > > I might be wrong, but this is the best option I thought to open an issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649917#comment-16649917 ] Marco Gaido commented on SPARK-25732: - cc [~vanzin] [~tgraves] [~jerryshao] [~mridul]. Sorry for pinging you, may you please provide your view on this? Do you see concerns/comments? Thanks. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0
[ https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-25731: Description: [https://github.com/apache/spark/tree/master/external] As far as I can see, This doesn't have support for newly release *kafka2.0,* support is available only till *kafka-0-10.* If we use the "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" for kafka2.0, below is the error I get 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Completed connection to node -1. Fetching API versions.* 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Initiating API versions fetch from node -1.* 11:46:18.452 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected* *java.io.EOFException: null* at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) I might be wrong, but this is the best option I thought to open an issue. was: [https://github.com/apache/spark/tree/master/external] As far as I can see, This doesn't have support for newly release *kafka2.0,* support is available only till *kafka-0-10.* If we use the "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" for kafka2.0, below is the error I get 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Completed connection to node -1. Fetching API versions.* 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Initiating API versions fetch from node -1.* 11:46:18.452 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected* *java.io.EOFException: null* at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) > Spark Structured Streaming Support for Kafka 2.0 > > > Key: SPARK-25731 > URL: https://issues.apache.org/jira/browse/SPARK-25731 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Chandan >Priority: Major > > [https://github.com/apache/spark/tree/master/external] > As far as I can see, > This doesn't have support for newly release *kafka2.0,* > support is available only till *kafka-0-10.* > If we use the > "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" > for kafka2.0, below is the error I get > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > *Completed connection to node -1. Fetching API versions.* > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, >
[jira] [Created] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
Marco Gaido created SPARK-25732: --- Summary: Allow specifying a keytab/principal for proxy user for token renewal Key: SPARK-25732 URL: https://issues.apache.org/jira/browse/SPARK-25732 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.4.0 Reporter: Marco Gaido As of now, application submitted with proxy-user fail after 2 week due to the lack of token renewal. In order to enable it, we need the the keytab/principal of the impersonated user to be specified, in order to have them available for the token renewal. This JIRA proposes to add two parameters {{--proxy-user-principal}} and {{--proxy-user-keytab}}, and the last letting a keytab being specified also in a distributed FS, so that applications can be submitted by servers (eg. Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0
[ https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-25731: Description: [https://github.com/apache/spark/tree/master/external] As far as I can see, This doesn't have support for newly release *kafka2.0,* support is available only till *kafka-0-10.* If we use the "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" for kafka2.0, below is the error I get 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Completed connection to node -1. Fetching API versions.* 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Initiating API versions fetch from node -1.* 11:46:18.452 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected* *java.io.EOFException: null* at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) was: [https://github.com/apache/spark/tree/master/external] As far as I can see, This doesn't have support for newly release kafka2.0, support is available only till kafka-0-10. If we use the "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" for kafka2.0, below is the error I get 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Completed connection to node -1. Fetching API versions.* 11:46:18.061 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] *Initiating API versions fetch from node -1.* 11:46:18.452 [stream execution thread for [id = e393ea37-8009-4ce0-b996-94f767994fb8, runId = bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 disconnected* *java.io.EOFException: null* at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) > Spark Structured Streaming Support for Kafka 2.0 > > > Key: SPARK-25731 > URL: https://issues.apache.org/jira/browse/SPARK-25731 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Chandan >Priority: Major > > [https://github.com/apache/spark/tree/master/external] > As far as I can see, > This doesn't have support for newly release *kafka2.0,* > support is available only till *kafka-0-10.* > If we use the > "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0" > for kafka2.0, below is the error I get > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > *Completed connection to node -1. Fetching API versions.* > 11:46:18.061 [stream execution thread for [id = > e393ea37-8009-4ce0-b996-94f767994fb8, runId = > bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG > org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, > groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0] > *Initiating API versions fetch from node -1.* >