[jira] [Created] (SPARK-24269) Infer nullability rather than declaring all columns as nullable
Maxim Gekk created SPARK-24269: -- Summary: Infer nullability rather than declaring all columns as nullable Key: SPARK-24269 URL: https://issues.apache.org/jira/browse/SPARK-24269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently, CSV and JSON datasource set the *nullable* flag to true independently from data itself during schema inferring. JSON: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126 CSV: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51 For example, source dataset has schema: {code} root |-- item_id: integer (nullable = false) |-- country: string (nullable = false) |-- state: string (nullable = false) {code} If we save it and read again the schema of the inferred dataset is {code} root |-- item_id: integer (nullable = true) |-- country: string (nullable = true) |-- state: string (nullable = true) {code} The ticket aims to set the nullable flag more precisely during schema inferring based on read data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN
Maxim Gekk created SPARK-24276: -- Summary: semanticHash() returns different values for semantically the same IS IN Key: SPARK-24276 URL: https://issues.apache.org/jira/browse/SPARK-24276 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk When a plan is canonicalized any set-based operation, such as IS IN, should have its expressions ordered as the order of expressions does not matter in the evaluation of the operator. For instance: {code:scala} val df = spark.createDataFrame(Seq((1, 2))) val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized val h1 = p1.semanticHash val h2 = p2.semanticHash {code} {code} df: org.apache.spark.sql.DataFrame = [_1: int, _2: int] p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 'Filter '_1 IN (1,2) +- LocalRelation [_1#0, _2#1] p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 'Filter '_1 IN (2,1) +- LocalRelation [_1#0, _2#1] h1: Int = -1384236508 h2: Int = 939549189 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24244) Parse only required columns of CSV file
Maxim Gekk created SPARK-24244: -- Summary: Parse only required columns of CSV file Key: SPARK-24244 URL: https://issues.apache.org/jira/browse/SPARK-24244 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk uniVocity parser allows to specify only required column names or indexes for parsing like: {code} // Here we select only the columns by their indexes. // The parser just skips the values in other columns parserSettings.selectIndexes(4, 0, 1); CsvParser parser = new CsvParser(parserSettings); {code} Need to modify *UnivocityParser* to extract only needed columns from requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24190) lineSep shouldn't be required in JSON write
Maxim Gekk created SPARK-24190: -- Summary: lineSep shouldn't be required in JSON write Key: SPARK-24190 URL: https://issues.apache.org/jira/browse/SPARK-24190 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently, the lineSep option is required by JSON datasource in write if encoding is different from UTF-8. For example, the code: {code:scala} df.write.option("encoding", "UTF-32BE").json(file) {code} throws the exception: {code} requirement failed: The lineSep option must be specified for the UTF-32BE encoding java.lang.IllegalArgumentException: requirement failed: The lineSep option must be specified for the UTF-32BE encoding at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.catalyst.json.JSONOptions$$anonfun$32.apply(JSONOptions.scala:118) at org.apache.spark.sql.catalyst.json.JSONOptions$$anonfun$32.apply(JSONOptions.scala:103) at scala.Option.map(Option.scala:146) {code} The restriction should NOT be applied to writing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24325) Tests for Hadoop's LinesReader
Maxim Gekk created SPARK-24325: -- Summary: Tests for Hadoop's LinesReader Key: SPARK-24325 URL: https://issues.apache.org/jira/browse/SPARK-24325 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently, there are no tests for [Hadoop LineReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42]. For refactoring or rewriting of the class, need to add tests that cover basic functionality of the class like: * Split's boundaries slice lines * A split slices delimiters - user's specified or defaults * No duplicates if splits slice delimiters or lines * Checking constant limits like maximum line length * Handling a case when internal buffers size is less than line size -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24325) Tests for Hadoop's LinesReader
[ https://issues.apache.org/jira/browse/SPARK-24325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24325: --- Description: Currently, there are no tests for [Hadoop LinesReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42]. For refactoring or rewriting of the class, need to add tests that cover basic functionality of the class like: * Split's boundaries slice lines * A split slices delimiters - user's specified or defaults * No duplicates if splits slice delimiters or lines * Checking constant limits like maximum line length * Handling a case when internal buffers size is less than line size was: Currently, there are no tests for [Hadoop LineReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42]. For refactoring or rewriting of the class, need to add tests that cover basic functionality of the class like: * Split's boundaries slice lines * A split slices delimiters - user's specified or defaults * No duplicates if splits slice delimiters or lines * Checking constant limits like maximum line length * Handling a case when internal buffers size is less than line size > Tests for Hadoop's LinesReader > -- > > Key: SPARK-24325 > URL: https://issues.apache.org/jira/browse/SPARK-24325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, there are no tests for [Hadoop > LinesReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42]. > For refactoring or rewriting of the class, need to add tests that cover > basic functionality of the class like: > * Split's boundaries slice lines > * A split slices delimiters - user's specified or defaults > * No duplicates if splits slice delimiters or lines > * Checking constant limits like maximum line length > * Handling a case when internal buffers size is less than line size -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk reopened SPARK-24244: Previous PR was reverted due flaky UnivocityParserSuite > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24366) Improve error message for Catalyst type converters
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24366: --- Summary: Improve error message for Catalyst type converters (was: Improve error message for type converting) > Improve error message for Catalyst type converters > -- > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Updated] (SPARK-24366) Improve error message for type converting
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24366: --- Summary: Improve error message for type converting (was: Improve error message for type conversions) > Improve error message for type converting > - > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Created] (SPARK-24366) Improve error message for type conversions
Maxim Gekk created SPARK-24366: -- Summary: Improve error message for type conversions Key: SPARK-24366 URL: https://issues.apache.org/jira/browse/SPARK-24366 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0, 1.6.3 Reporter: Maxim Gekk User have no way to drill down to understand which of the hundreds of fields in millions records feeding into the job are causing the problem. We should to show where in the schema the error is happening. {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: start (of class java.lang.String) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at
[jira] [Resolved] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.
[ https://issues.apache.org/jira/browse/SPARK-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-15125. Resolution: Fixed Fix Version/s: 2.4.0 The issue has been fixed by https://github.com/apache/spark/commit/7a2d4895c75d4c232c377876b61c05a083eab3c8 > CSV data source recognizes empty quoted strings in the input as null. > -- > > Key: SPARK-15125 > URL: https://issues.apache.org/jira/browse/SPARK-15125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Suresh Thalamati >Priority: Major > Fix For: 2.4.0 > > > CSV data source does not differentiate between empty quoted strings and empty > fields as null. In some scenarios user would want to differentiate between > these values, especially in the context of SQL where NULL , and empty string > have different meanings If input data happens to be dump from traditional > relational data source, users will see different results for the SQL queries. > {code} > Repro: > Test Data: (test.csv) > year,make,model,comment,price > 2017,Tesla,Mode 3,looks nice.,35000.99 > 2016,Chevy,Bolt,"",29000.00 > 2015,Porsche,"",, > scala> val df= sqlContext.read.format("csv").option("header", > "true").option("inferSchema", "true").option("nullValue", > null).load("/tmp/test.csv") > df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more > fields] > scala> df.show > ++---+--+---++ > |year| make| model|comment| price| > ++---+--+---++ > |2017| Tesla|Mode 3|looks nice.|35000.99| > |2016| Chevy| Bolt| null| 29000.0| > |2015|Porsche| null| null|null| > ++---+--+---++ > Expected: > ++---+--+---++ > |year| make| model|comment| price| > ++---+--+---++ > |2017| Tesla|Mode 3|looks nice.|35000.99| > |2016| Chevy| Bolt| | 29000.0| > |2015|Porsche| | null|null| > ++---+--+---++ > {code} > Testing a fix for the this issue. I will give a shot at submitting a PR for > this soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24004) Tests of from_json for MapType
[ https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-24004. Resolution: Won't Fix > Tests of from_json for MapType > -- > > Key: SPARK-24004 > URL: https://issues.apache.org/jira/browse/SPARK-24004 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Trivial > > There are no tests for *from_json* that check *MapType* as a value type of > struct fields. The MapType should be supported as non-root type according to > current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24329) Remove comments filtering before parsing of CSV files
Maxim Gekk created SPARK-24329: -- Summary: Remove comments filtering before parsing of CSV files Key: SPARK-24329 URL: https://issues.apache.org/jira/browse/SPARK-24329 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Comments and whitespace filtering has been performed by uniVocity parser already according to parser settings: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180 It is not necessary to do the same before parsing. Need to inspect all places where the filterCommentAndEmpty method is called, and remove the former one if it duplicates filtering of uniVocity parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24571) Support literals with values of the Char type
Maxim Gekk created SPARK-24571: -- Summary: Support literals with values of the Char type Key: SPARK-24571 URL: https://issues.apache.org/jira/browse/SPARK-24571 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code:scala} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character p at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514665#comment-16514665 ] Maxim Gekk commented on SPARK-24571: I am working on the improvement. > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code:scala} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character p > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24571: --- Description: Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code:java} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character o at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. was: Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code:scala} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character p at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code:java} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character o > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24591) Number of cores and executors in the cluster
Maxim Gekk created SPARK-24591: -- Summary: Number of cores and executors in the cluster Key: SPARK-24591 URL: https://issues.apache.org/jira/browse/SPARK-24591 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add 2 new methods. The first one should return total number of CPU cores of all executors in the cluster. The second one should give current number of executors registered in the cluster. Main motivations for adding of those methods: 1. It is the best practice to manage job parallelism relative to available cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an anti-pattern to leave a bunch of cores on large clusters twiddling their thumb & doing nothing. Usually users pass predefined constants for _repartition()_ and _coalesce()_. Selection of the constant is based on current cluster size. If the code runs on another cluster and/or on the resized cluster, they need to modify the constant each time. This happens frequently when a job that normally runs on, say, an hour of data on a small cluster needs to run on a week of data on a much larger cluster. 2. *spark.default.parallelism* can be used to get total number of cores in the cluster but it can be redefined by user. The info can be taken via registration of a listener but repeating the same looks ugly. We should follow the DRY principle. 3. Regarding to executorsCount(), some jobs, e.g., local node ML training, use a lot of parallelism. It's a common practice to aim to distribute such jobs such that there is one partition for each executor. 4. In some places users collect this info, as well as other settings info together with job timing (at the app level) for analysis. E.g., you can use ML to determine optimal cluster size given different objectives, e.g., fastest throughput vs. lowest cost per unit of processing. 5. The simpler argument is that basic cluster properties should be easily discoverable via APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24543) Support any DataType as DDL string for from_json's schema
Maxim Gekk created SPARK-24543: -- Summary: Support any DataType as DDL string for from_json's schema Key: SPARK-24543 URL: https://issues.apache.org/jira/browse/SPARK-24543 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently, schema for from_json can be specified as DataType or a string in the following formats: * in SQL, as sequence of fields like _INT a, STRING b * in Scala, Python and etc, in JSON format or as in SQL The ticket aims to support arbitrary DataType as DDL string for from_json. For example: {code:sql} select from_json('{"a":1, "b":2}', 'map') {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24543) Support any DataType as DDL string for from_json's schema
[ https://issues.apache.org/jira/browse/SPARK-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510753#comment-16510753 ] Maxim Gekk commented on SPARK-24543: I am working on the feature at the moment. > Support any DataType as DDL string for from_json's schema > - > > Key: SPARK-24543 > URL: https://issues.apache.org/jira/browse/SPARK-24543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, schema for from_json can be specified as DataType or a string in > the following formats: > * in SQL, as sequence of fields like _INT a, STRING b > * in Scala, Python and etc, in JSON format or as in SQL > The ticket aims to support arbitrary DataType as DDL string for from_json. > For example: > {code:sql} > select from_json('{"a":1, "b":2}', 'map') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24005) Remove usage of Scala’s parallel collection
[ https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509667#comment-16509667 ] Maxim Gekk commented on SPARK-24005: [~smilegator] I am trying to reproduce the issue but so far I am not lucky. The following test is passing successfully: {code:scala} test("canceling of parallel collections") { val conf = new SparkConf() sc = new SparkContext("local[1]", "par col", conf) val f = sc.parallelize(0 to 1, 1).map { i => val par = (1 to 100).par val pool = ThreadUtils.newForkJoinPool("test pool", 2) par.tasksupport = new ForkJoinTaskSupport(pool) try { par.flatMap { j => Thread.sleep(1000) 1 to 100 }.seq } finally { pool.shutdown() } }.takeAsync(100) val sem = new Semaphore(0) sc.addSparkListener(new SparkListener { override def onTaskStart(taskStart: SparkListenerTaskStart) { sem.release() } }) // Wait until some tasks were launched before we cancel the job. sem.acquire() // Wait until a task executes parallel collection. Thread.sleep(1) f.cancel() val e = intercept[SparkException] { f.get() }.getCause assert(e.getMessage.contains("cancelled") || e.getMessage.contains("killed")) } {code} > Remove usage of Scala’s parallel collection > --- > > Key: SPARK-24005 > URL: https://issues.apache.org/jira/browse/SPARK-24005 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Labels: starter > > {noformat} > val par = (1 to 100).par.flatMap { i => > Thread.sleep(1000) > 1 to 1000 > }.toSeq > {noformat} > We are unable to interrupt the execution of parallel collections. We need to > create a common utility function to do it, instead of using Scala parallel > collections -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents
[ https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-14034. Resolution: Fixed Fix Version/s: 2.3.0 > Converting to Dataset causes wrong order and values in nested array of > documents > > > Key: SPARK-14034 > URL: https://issues.apache.org/jira/browse/SPARK-14034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Steven She >Priority: Major > Fix For: 2.3.0 > > > I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 > in the console: > {noformat} > {"arr": [{"c": 1, "b": 2, "a": 3}]} > {noformat} > I have the following case classes: > {noformat} > case class X(arr: Seq[Y]) > case class Y(c: Int, b: Int, a: Int) > {noformat} > I run the following in the console to retrieve the value of `c` in the array, > which should have a value of 1 in the data file, but I get the value 3 > instead: > {noformat} > scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c > res19: Int = 3 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24445) Schema in json format for from_json in SQL
[ https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497075#comment-16497075 ] Maxim Gekk commented on SPARK-24445: I am working on the ticket at the moment. > Schema in json format for from_json in SQL > -- > > Key: SPARK-24445 > URL: https://issues.apache.org/jira/browse/SPARK-24445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > In Spark 2.3, schema for the from_json function can be specified in JSON > format in Scala and Python but in SQL. In SQL it is impossible to specify map > type for example because SQL DDL parser can handle struct type only. Need to > support schemas in JSON format as it has been already implemented > [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]: > {code:scala} > val dataType = try { > DataType.fromJson(schema) > } catch { > case NonFatal(_) => StructType.fromDDL(schema) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24445) Schema in json format for from_json in SQL
Maxim Gekk created SPARK-24445: -- Summary: Schema in json format for from_json in SQL Key: SPARK-24445 URL: https://issues.apache.org/jira/browse/SPARK-24445 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk In Spark 2.3, schema for the from_json function can be specified in JSON format in Scala and Python but in SQL. In SQL it is impossible to specify map type for example because SQL DDL parser can handle struct type only. Need to support schemas in JSON format as it has been already implemented [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]: {code:scala} val dataType = try { DataType.fromJson(schema) } catch { case NonFatal(_) => StructType.fromDDL(schema) } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents
[ https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492078#comment-16492078 ] Maxim Gekk commented on SPARK-14034: I checked on Spark 2.3: {code:scala} case class Y(c: Long, b: Long, a: Long) defined class X spark.read.json("test.json").as[X].collect().head.arr.head.c {code} {code} res0: Long = 1 {code} Changing order of parameters in class Y doesn't impact on the result. It seems the issue doesn't exist any more. > Converting to Dataset causes wrong order and values in nested array of > documents > > > Key: SPARK-14034 > URL: https://issues.apache.org/jira/browse/SPARK-14034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Steven She >Priority: Major > > I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 > in the console: > {noformat} > {"arr": [{"c": 1, "b": 2, "a": 3}]} > {noformat} > I have the following case classes: > {noformat} > case class X(arr: Seq[Y]) > case class Y(c: Int, b: Int, a: Int) > {noformat} > I run the following in the console to retrieve the value of `c` in the array, > which should have a value of 1 in the data file, but I get the value 3 > instead: > {noformat} > scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c > res19: Int = 3 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23725) Improve Hadoop's LineReader to support charsets different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528691#comment-16528691 ] Maxim Gekk commented on SPARK-23725: [~hyukjin.kwon] I am working on the implementation and have faced to the problem that I cannot identify lineSep uniquely if encoding is not specified. For example, if a partitioned file contains: {code} 65 00 31 00 0a 00 6c 00 69 {code} I cannot strictly say what is the lineSep here. It could be *0x0a 0x00* if encoding is UTF-16LE in: {code} 6c 00 69 00 6e 00 65 00 31 00 0a 00 6c 00 69 00 |l.i.n.e.1...l.i.| 0010 6e 00 65 00 32 00 |n.e.2.| 0016 {code} or *0x00 0x0a* in UTF-16BE encoding in the text: {code} 00 6c 00 69 00 6e 00 65 00 31 00 0a 00 6c 00 69 |.l.i.n.e.1...l.i| 0010 00 6e 00 65 00 32 |.n.e.2| 0016 {code} So, to detect lineSep automatically we should require specified encoding. > Improve Hadoop's LineReader to support charsets different from UTF-8 > > > Key: SPARK-23725 > URL: https://issues.apache.org/jira/browse/SPARK-23725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > If the record delimiter is not specified, Hadoop LineReader splits > lines/records by '\n', '\r' or/and '\r\n' in UTF-8 encoding: > [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L173-L177] > . The implementation should be improved to support any charset. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-24642. Resolution: Won't Fix > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24709) Inferring schema from JSON string literal
Maxim Gekk created SPARK-24709: -- Summary: Inferring schema from JSON string literal Key: SPARK-24709 URL: https://issues.apache.org/jira/browse/SPARK-24709 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add new function - *schema_of_json()*. The function should infer schema of JSON string literal. The result of the function is a schema in DDL format. One of the use cases is passing output of _schema_of_json()_ to *from_json()*. Currently, the _from_json()_ function requires a schema as a mandatory argument. An user has to pass a schema as string literal in SQL. The new function should allow schema inferring from an example. Let's say json_col is a column containing JSON string with the same schema. It should be possible to pass a JSON string with the same schema to _schema_of_json()_ which infers schema for the particular example. {code:sql} select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}')) from json_table; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529030#comment-16529030 ] Maxim Gekk commented on SPARK-24642: I created new ticket SPARK-24709 which aims to add simpler function. > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529030#comment-16529030 ] Maxim Gekk edited comment on SPARK-24642 at 7/1/18 10:05 AM: - [~rxin] I created new ticket SPARK-24709 which aims to add simpler function. Here is the PR https://github.com/apache/spark/pull/21686 for the ticket. was (Author: maxgekk): I created new ticket SPARK-24709 which aims to add simpler function. > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24643) from_json should accept an aggregate function as schema
[ https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-24643. Resolution: Won't Fix > from_json should accept an aggregate function as schema > --- > > Key: SPARK-24643 > URL: https://issues.apache.org/jira/browse/SPARK-24643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, the *from_json()* function accepts only string literals as schema: > - Checking of schema argument inside of JsonToStructs: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530] > - Accepting only string literal: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752] > JsonToStructs should be modified to accept results of aggregate functions > like *infer_schema* (see SPARK-24642). It should be possible to write SQL > like: > {code:sql} > select from_json(json_col, infer_schema(json_col)) from json_table > {code} > Here is a test case with existing aggregate function - *first()*: > {code:sql} > create temporary view schemas(schema) as select * from values > ('struct'), > ('map'); > select from_json('{"a":1}', first(schema)) from schemas; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24445) Schema in json format for from_json in SQL
[ https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-24445. Resolution: Won't Fix > Schema in json format for from_json in SQL > -- > > Key: SPARK-24445 > URL: https://issues.apache.org/jira/browse/SPARK-24445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > In Spark 2.3, schema for the from_json function can be specified in JSON > format in Scala and Python but in SQL. In SQL it is impossible to specify map > type for example because SQL DDL parser can handle struct type only. Need to > support schemas in JSON format as it has been already implemented > [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]: > {code:scala} > val dataType = try { > DataType.fromJson(schema) > } catch { > case NonFatal(_) => StructType.fromDDL(schema) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9775) Query Mesos for number of CPUs to set default parallelism
[ https://issues.apache.org/jira/browse/SPARK-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522064#comment-16522064 ] Maxim Gekk commented on SPARK-9775: --- Please, change another related methods like proposed in the PR: https://github.com/apache/spark/pull/21589 > Query Mesos for number of CPUs to set default parallelism > - > > Key: SPARK-9775 > URL: https://issues.apache.org/jira/browse/SPARK-9775 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.4.1 >Reporter: Peder Ås >Priority: Minor > > As highlighted in a TODO on line 400 of MesosSchedulerBackend.scala (at least > on 3ca995b7) we should query the Mesos master and set the default parallelism > based on the number of CPUs available in the cluster (and multiply by two or > three?) > See code in question [here > (gitweb)|https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala;h=3f63ec1c;hb=HEAD#l400]. > This task should also update the documentation [here > (gitweb)|https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=docs/configuration.md;h=c60dd1;hb=HEAD#l789] > to highlight the fact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524802#comment-16524802 ] Maxim Gekk commented on SPARK-24642: > Do we want this as an aggregate function? I thought of something similar to the inferSchema flag when CSV datasource triggers a separate job to infer schema for JSON files. > I'm thinking it's better to just take a string and infers the schema on the > string. In general it looks much more cheaper that scanning of full input by aggregate function but we have opportunity to minimize amount of row touched by the aggregate function via sampling or using just a few first row in partitions. And what happens if some json strings are not complete like: {code} {"a": 1} {"b": [1,2,3]} {"a": 3, "b": [10, 11, 12]} {code} in that case, each parsed json string will have different inferred schemas, right? Which schema we should assign to parsed json column? > How would the query you provide compile if it is an aggregate function? I am going to assign the from_json name to the FromJson case class, and write the following rule to trigger a job for replacing aggregate by a string literal like in the code snippet (thank you [~hvanhovell] for the code) {code} case class FromJson(child: Expression) extends Expression` { ... } class SchemaInferringRule(session: SparkSession) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = { plan transform { case node => node.transformExpressions { case FromJson(e) => // Kick off inference val query = new QueryExecution( session, Project(Seq(Alias(InferSchema(e), "schema")()), node)) val Array(row) = query.executedPlan.executeCollect() val schema = Literal(row.getUTF8String(0), StringType) new JsonToStructs(e, schema) } } } } {code} > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24164) Support column list as the pivot column in Pivot
[ https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530448#comment-16530448 ] Maxim Gekk commented on SPARK-24164: [~maryannxue] Are you working on the feature, or maybe you plan to work on it? > Support column list as the pivot column in Pivot > > > Key: SPARK-24164 > URL: https://issues.apache.org/jira/browse/SPARK-24164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > > This is part of a functionality extension to Pivot SQL support as SPARK-24035. > Currently, we only support a single column as the pivot column, while a > column list as the pivot column would look like: > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR (course, year) IN (('dotNET', 2012), ('Java', 2013)) > );{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24722) Column-based API for pivoting
Maxim Gekk created SPARK-24722: -- Summary: Column-based API for pivoting Key: SPARK-24722 URL: https://issues.apache.org/jira/browse/SPARK-24722 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, the pivot() function accepts the pivot column as a string. It is not consistent to groupBy API and causes additional problem of using nested columns as the pivot column. `Column` support is needed for (a) API consistency, (b) user productivity and (c) performance. In general, we should follow to the POLA - https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of the API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24118) Support lineSep format independent from encoding
Maxim Gekk created SPARK-24118: -- Summary: Support lineSep format independent from encoding Key: SPARK-24118 URL: https://issues.apache.org/jira/browse/SPARK-24118 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently, the lineSep option of JSON datasource is depend on encoding. It is impossible to define correct lineSep for JSON files with BOM in UTF-16 and UTF-32 encoding, for example. Need to propose a format of lineSep which will represent sequence of octets (bytes) and will be independent from encoding. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24171) Update comments for non-deterministic functions
Maxim Gekk created SPARK-24171: -- Summary: Update comments for non-deterministic functions Key: SPARK-24171 URL: https://issues.apache.org/jira/browse/SPARK-24171 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Description of non-deterministic functions like the _collect_list()_ and _first()_ doesn't contain information about that. Need to add a notice about it to show the behavior in user facing docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23410) Unable to read jsons in charset different from UTF-8
Maxim Gekk created SPARK-23410: -- Summary: Unable to read jsons in charset different from UTF-8 Key: SPARK-23410 URL: https://issues.apache.org/jira/browse/SPARK-23410 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently the Json Parser is forced to read json files in UTF-8. Such behavior breaks backward compatibility with Spark 2.2.1 and previous versions that can read json files in UTF-16, UTF-32 and other encodings due to using of the auto detection mechanism of the jackson library. Need to give back to users possibility to read json files in specified charset and/or detect charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23410: --- Shepherd: Herman van Hovell > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849 ] Maxim Gekk commented on SPARK-23410: [~bersprockets] does your json contain BOM in the first 2 bytes? > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849 ] Maxim Gekk edited comment on SPARK-23410 at 2/14/18 10:20 PM: -- [~bersprockets] does your json contain BOM in the first 2 bytes? By using the BOM, jackson detects encoding: https://github.com/FasterXML/jackson-core/blob/2.6/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L110-L173 was (Author: maxgekk): [~bersprockets] does your json contain BOM in the first 2 bytes? > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23410: --- Attachment: utf16WithBOM.json > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364889#comment-16364889 ] Maxim Gekk commented on SPARK-23410: I am working on a fix, just in case > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364875#comment-16364875 ] Maxim Gekk commented on SPARK-23410: I attached the file on which I tested on 2.2.1: {code:scala} import org.apache.spark.sql.types._ val schema = new StructType().add("firstName", StringType).add("lastName", StringType) spark.read.schema(schema).json("utf16WithBOM.json").show {code} {code} +-++ |firstName|lastName| +-++ |Chris| Baird| | null|null| | Doug|Rood| | null|null| | null|null| +-++ {code} > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366547#comment-16366547 ] Maxim Gekk commented on SPARK-23410: [~sameerag] It is not blocker anymore. I unset the blocker flag. > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23410: --- Priority: Major (was: Blocker) > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24777) Refactor AVRO read/write benchmark
[ https://issues.apache.org/jira/browse/SPARK-24777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560896#comment-16560896 ] Maxim Gekk commented on SPARK-24777: [~Gengliang.Wang] Which benchmarks are you going to add? Just in case, I can gather typical use cases from our users. > Refactor AVRO read/write benchmark > -- > > Key: SPARK-24777 > URL: https://issues.apache.org/jira/browse/SPARK-24777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema
Maxim Gekk created SPARK-24959: -- Summary: Do not invoke the CSV/JSON parser for empty schema Key: SPARK-24959 URL: https://issues.apache.org/jira/browse/SPARK-24959 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently JSON and CSV parsers are called even if required schema is empty. Invoking the parser per each line has some non-zero overhead. The action can be skipped. Such optimization should speed up count(), for example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24945) Switch to unoVocity 2.7.2
Maxim Gekk created SPARK-24945: -- Summary: Switch to unoVocity 2.7.2 Key: SPARK-24945 URL: https://issues.apache.org/jira/browse/SPARK-24945 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk The recent version 2.7.2 of uniVocity parser includes the fix: https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the fix https://github.com/apache/spark/pull/21631 anymore -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24945) Switch to uniVocity 2.7.2
[ https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24945: --- Summary: Switch to uniVocity 2.7.2 (was: Switch to unoVocity 2.7.2) > Switch to uniVocity 2.7.2 > - > > Key: SPARK-24945 > URL: https://issues.apache.org/jira/browse/SPARK-24945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > The recent version 2.7.2 of uniVocity parser includes the fix: > https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the > fix https://github.com/apache/spark/pull/21631 anymore -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24945) Switch to uniVocity >= 2.7.2
[ https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24945: --- Summary: Switch to uniVocity >= 2.7.2 (was: Switch to uniVocity 2.7.2) > Switch to uniVocity >= 2.7.2 > > > Key: SPARK-24945 > URL: https://issues.apache.org/jira/browse/SPARK-24945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > The recent version 2.7.2 of uniVocity parser includes the fix: > https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the > fix https://github.com/apache/spark/pull/21631 anymore -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24945) Switch to uniVocity >= 2.7.2
[ https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24945: --- Description: The recent version 2.7.2 of uniVocity parser includes the fix: https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent version has better performance. (was: The recent version 2.7.2 of uniVocity parser includes the fix: https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the fix https://github.com/apache/spark/pull/21631 anymore) > Switch to uniVocity >= 2.7.2 > > > Key: SPARK-24945 > URL: https://issues.apache.org/jira/browse/SPARK-24945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > The recent version 2.7.2 of uniVocity parser includes the fix: > https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent > version has better performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24952) Support LZMA2 compression by Avro datasource
Maxim Gekk created SPARK-24952: -- Summary: Support LZMA2 compression by Avro datasource Key: SPARK-24952 URL: https://issues.apache.org/jira/browse/SPARK-24952 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk LZMA2 (XZ) has much more better compression ratio comparing to currently supported snappy and deflate. Underlying Avro library supports the compression codec already. Need to set parameters for the codec and allow users to specify "xz" compression via AvroOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25048) Pivoting by multiple columns
Maxim Gekk created SPARK-25048: -- Summary: Pivoting by multiple columns Key: SPARK-25048 URL: https://issues.apache.org/jira/browse/SPARK-25048 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to change or extend existing API to make pivoting by multiple columns possible. Users should be able to use many columns and values like in the example: {code:scala} trainingSales .groupBy($"sales.year") .pivot(struct(lower($"sales.course"), $"training"), Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25048) Pivoting by multiple columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-25048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-25048: --- Summary: Pivoting by multiple columns in Scala/Java (was: Pivoting by multiple columns) > Pivoting by multiple columns in Scala/Java > -- > > Key: SPARK-25048 > URL: https://issues.apache.org/jira/browse/SPARK-25048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to change or extend existing API to make pivoting by multiple columns > possible. Users should be able to use many columns and values like in the > example: > {code:scala} > trainingSales > .groupBy($"sales.year") > .pivot(struct(lower($"sales.course"), $"training"), Seq( > struct(lit("dotnet"), lit("Experts")), > struct(lit("java"), lit("Dummies"))) > ).agg(sum($"sales.earnings")) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589335#comment-16589335 ] Maxim Gekk commented on SPARK-25195: > Problem number 1: The from_json function accepts as a schema only StructType > or ArrayType(StructType), but not an ArrayType of primitives. This was fixed recently: https://github.com/apache/spark/pull/21439 > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-25240: --- Summary: A deadlock in ALTER TABLE RECOVER PARTITIONS (was: Dead-lock in ALTER TABLE RECOVER PARTITIONS) > A deadlock in ALTER TABLE RECOVER PARTITIONS > > > Key: SPARK-25240 > URL: https://issues.apache.org/jira/browse/SPARK-25240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Recover Partitions in ALTER TABLE is performed in recursive way by calling > the scanPartitions() method. scanPartitions() lists files sequentially or in > parallel if the > [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685] > is true: > {code:scala} > partitionNames.length > 1 && statuses.length > threshold || > partitionNames.length > 2 > {code} > Parallel listening is executed on [the fixed thread > pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622] > which can have 8 threads in total. Dead lock occurs when all 8 cores have > been already occupied and recursive call of scanPartitions() submits new > parallel file listening. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25240) Dead-lock in ALTER TABLE RECOVER PARTITIONS
Maxim Gekk created SPARK-25240: -- Summary: Dead-lock in ALTER TABLE RECOVER PARTITIONS Key: SPARK-25240 URL: https://issues.apache.org/jira/browse/SPARK-25240 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Recover Partitions in ALTER TABLE is performed in recursive way by calling the scanPartitions() method. scanPartitions() lists files sequentially or in parallel if the [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685] is true: {code:scala} partitionNames.length > 1 && statuses.length > threshold || partitionNames.length > 2 {code} Parallel listening is executed on [the fixed thread pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622] which can have 8 threads in total. Dead lock occurs when all 8 cores have been already occupied and recursive call of scanPartitions() submits new parallel file listening. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty
[ https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592704#comment-16592704 ] Maxim Gekk commented on SPARK-25199: I wasn't able to reproduce the issue on the current master: {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT /_/ Using Python version 2.7.15 (default, Aug 22 2018 16:36:18) >>> df = spark.read.format("csv").option("header", >>> "true").option("inferSchema", "true").load("tmp/csv/*.csv") >>> df.printSchema() root |-- a: integer (nullable = true) |-- b: integer (nullable = true) {code} for two csv files but one of them is empty: {code:java} tree -h ./csv ./csv ├── [ 8] 1.csv └── [ 0] 2.csv {code} > InferSchema "all Strings" if one of many CSVs is empty > -- > > Key: SPARK-25199 > URL: https://issues.apache.org/jira/browse/SPARK-25199 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 > Environment: I discovered this on AWS Glue, which uses Spark 2.2.1 >Reporter: Neil McGuigan >Priority: Minor > Labels: newbie > > Spark can load multiple CSV files in one read: > df = spark.read.format("csv").option("header", "true").option("inferSchema", > "true").load("/*.csv") > However, if one of these files is empty (though it has a header), Spark will > set all column types to "String" > Spark should skip a file for inference if it contains no (non-header) rows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty
[ https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-25199. Resolution: Cannot Reproduce > InferSchema "all Strings" if one of many CSVs is empty > -- > > Key: SPARK-25199 > URL: https://issues.apache.org/jira/browse/SPARK-25199 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 > Environment: I discovered this on AWS Glue, which uses Spark 2.2.1 >Reporter: Neil McGuigan >Priority: Minor > Labels: newbie > > Spark can load multiple CSV files in one read: > df = spark.read.format("csv").option("header", "true").option("inferSchema", > "true").load("/*.csv") > However, if one of these files is empty (though it has a header), Spark will > set all column types to "String" > Spark should skip a file for inference if it contains no (non-header) rows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25243) Use FailureSafeParser in from_json
Maxim Gekk created SPARK-25243: -- Summary: Use FailureSafeParser in from_json Key: SPARK-25243 URL: https://issues.apache.org/jira/browse/SPARK-25243 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk The [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] is used in parsing JSON, CSV files and dataset of strings. It supports the [PERMISSIVE, DROPMALFORMED and FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] modes. The ticket aims to make the from_json function compatible to regular parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584357#comment-16584357 ] Maxim Gekk commented on SPARK-17916: > he default behavior in 2.3.x for csv format is that when i write out null >value, it comes back in as null. when i write out empty string, it also comes >back in as null. [~koert] Please, have a look at the added test: [https://github.com/apache/spark/pull/21273/files#diff-219ac8201e443435499123f96e94d29fR1355] . It checks exactly what you described. If you have something different, please, leave the code here. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593368#comment-16593368 ] Maxim Gekk commented on SPARK-25227: > I don't know about to_json. Maybe Maxim Gekk can comment more on that. Here is the PR for that: https://github.com/apache/spark/pull/6 . Please, review it. > Extend functionality of to_json to support arrays of differently-typed > elements > --- > > Key: SPARK-25227 > URL: https://issues.apache.org/jira/browse/SPARK-25227 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'to_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25252) Support arrays of any types in to_json
Maxim Gekk created SPARK-25252: -- Summary: Support arrays of any types in to_json Key: SPARK-25252 URL: https://issues.apache.org/jira/browse/SPARK-25252 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to improve the to_json function and make it more consistent with from_json by supporting arrays of any types (as root types). For now, it supports only arrays of structs and arrays of maps. After the changes the following code should work: {code:scala} select to_json(array('1','2','3')) > ["1","2","3"] select to_json(array(array(1,2,3),array(4))) > [[1,2,3],[4]] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589850#comment-16589850 ] Maxim Gekk commented on SPARK-25195: > 1. Does this patch also solve problem 2, as described above? No, it doesn't. > 2. Do you know when it will be released? It should be in the upcoming release 2.4. > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590140#comment-16590140 ] Maxim Gekk commented on SPARK-25195: This is the ticket which combines both from_json/to_json: https://issues.apache.org/jira/browse/SPARK-24391 . It was closed with the PR [https://github.com/apache/spark/pull/21439]. It would be nice to have a separate ticket specifically for to_json. > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! > == > UPDATE: By the way, apparently the to_json function has the same problems: it > cannot convert an array-typed column to a JSON-string. It would be nice for > it to support arrays, as well. And, speaking of problem 2, an array column of > different types cannot be even created in the first place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24854) Gather all options into AvroOptions
Maxim Gekk created SPARK-24854: -- Summary: Gather all options into AvroOptions Key: SPARK-24854 URL: https://issues.apache.org/jira/browse/SPARK-24854 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to gather all Avro options into a class like in another datasources - JSONOptions and CSVOptions. The map inside of the class should be case insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24836) New option - ignoreExtension
Maxim Gekk created SPARK-24836: -- Summary: New option - ignoreExtension Key: SPARK-24836 URL: https://issues.apache.org/jira/browse/SPARK-24836 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add new option for Avro datasource - *ignoreExtension*. It should control ignoring of the .avro extensions. If it is set to *true* (by default), files with and without .avro extensions should be loaded. Example of usage: {code:scala} spark .read .option("ignoreExtension", false) .avro("path to avro files") {code} The option duplicates Hadoop's config avro.mapred.ignore.inputs.without.extension which is taken into account by Avro datasource now and can be set like: {code:scala} spark .sqlContext .sparkContext .hadoopConfiguration .set("avro.mapred.ignore.inputs.without.extension", "true") {code} The ignoreExtension option must override avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547876#comment-16547876 ] Maxim Gekk commented on SPARK-24849: I am working on the ticket. > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24849) Convert StructType to DDL string
Maxim Gekk created SPARK-24849: -- Summary: Convert StructType to DDL string Key: SPARK-24849 URL: https://issues.apache.org/jira/browse/SPARK-24849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add new methods which should convert a value of StructType to a schema in DDL format . It should be possible to use the former string in new table creation by just copy-pasting of new method results. The existing methods simpleString(), catalogString() and sql() put ':' between top level field name and its type, and wrap by the *struct* word {code} ds.schema.catalogString struct
[jira] [Created] (SPARK-24810) Fix paths to resource files in AvroSuite
Maxim Gekk created SPARK-24810: -- Summary: Fix paths to resource files in AvroSuite Key: SPARK-24810 URL: https://issues.apache.org/jira/browse/SPARK-24810 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently paths to tests files from resource folder are relative in AvroSuite. It causes problems like impossibility for running tests from IDE. Need to wrap test files by: {code:scala} def testFile(fileName: String): String = { Thread.currentThread().getContextClassLoader.getResource(fileName).toString } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24810: --- Attachment: Screen Shot 2018-07-15 at 15.28.13.png > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
Maxim Gekk created SPARK-24911: -- Summary: SHOW CREATE TABLE drops escaping of nested column names Key: SPARK-24911 URL: https://issues.apache.org/jira/browse/SPARK-24911 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Create a table with quoted nested column - *`b`*: {code:sql} create table `test` (`a` STRUCT<`b`:STRING>); {code} and show how the table was created: {code:sql} SHOW CREATE TABLE `test` {code} {code} CREATE TABLE `test`(`a` struct) {code} The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24881) New options - compression and compressionLevel
Maxim Gekk created SPARK-24881: -- Summary: New options - compression and compressionLevel Key: SPARK-24881 URL: https://issues.apache.org/jira/browse/SPARK-24881 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently Avro datasource takes the compression codec name from SQL config (config key is hard coded in AvroFileFormat): https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 . The obvious cons of it is modification of the global config can impact of multiple writes. A purpose of the ticket is to add new Avro option - "compression" the same as we already have for other datasource like JSON, CSV and etc. If new option is not set by an user, we take settings from SQL config spark.sql.avro.compression.codec. If the former one is not set too, default compression codec will be snappy (this is current behavior in the master). Besides of the compression option, need to add another option - compressionLevel which should reflect another SQL config in Avro: https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548878#comment-16548878 ] Maxim Gekk commented on SPARK-24849: [~maropu] This is a part of my work on customer's issue. There are multiple folders of AVRO files with pretty wide and nested schemas. I need programmatically create tables on top of each folder. To do that I read a file in a folder via Scala API, take schema, convert it to DDL string (here I need the changes) and put the string to SQL CREATE TABLE. > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24807) Adding files/jars twice: output a warning and add a note
Maxim Gekk created SPARK-24807: -- Summary: Adding files/jars twice: output a warning and add a note Key: SPARK-24807 URL: https://issues.apache.org/jira/browse/SPARK-24807 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Maxim Gekk In current version of Spark (2.3.x), one file/jar can be added only once. Next additions of the same path are silently ignored. This behavoir is not properly documented: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.SparkContext This confuses our users and support teams in our company. The ticket aims to output a warning which should clearly state that second addition of the same path is not supported now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24805) Don't ignore files without .avro extension by default
Maxim Gekk created SPARK-24805: -- Summary: Don't ignore files without .avro extension by default Key: SPARK-24805 URL: https://issues.apache.org/jira/browse/SPARK-24805 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently to read files without .avro extension, users have to set the flag *avro.mapred.ignore.inputs.without.extension* to *false* (by default it is *true*). The ticket aims to change the default value to *false*. The reasons to do that are: - Other systems can create avro files without extensions. When users try to read such files, they get just partitial results silently. The behaviour may confuse users. - Current behavior is different behavior from another supported datasource CSV and JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25286) Remove dangerous parmap
Maxim Gekk created SPARK-25286: -- Summary: Remove dangerous parmap Key: SPARK-25286 URL: https://issues.apache.org/jira/browse/SPARK-25286 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Maxim Gekk One of parmap methods accepts an execution context created outside of parmap. If the parmap method is called recursively on a thread pool limited by size, it could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 . To eliminate the problems in the future, need to remove parmap() with the signature: {code:scala} def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] (in: Col[I]) (f: I => O) (implicit cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence ec: ExecutionContext ): Col[O] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25283) A deadlock in UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-25283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-25283. Resolution: Fixed Fix Version/s: 2.4.0 It is fixed by the PR: https://github.com/apache/spark/pull/22292 > A deadlock in UnionRDD > -- > > Key: SPARK-25283 > URL: https://issues.apache.org/jira/browse/SPARK-25283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel > collections in UnionRDD by new parmap function. This changes cause a deadlock > in the partitions method. The code demonstrates the problem: > {code:scala} > val wide = 20 > def unionRDD(num: Int): UnionRDD[Int] = { > val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1)) > new UnionRDD(sc, rdds) > } > val level0 = (0 until wide).map { _ => > val level1 = (0 until wide).map(_ => unionRDD(wide)) > new UnionRDD(sc, level1) > } > val rdd = new UnionRDD(sc, level0) > rdd.partitions.length > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25384) Removing spark.sql.fromJsonForceNullableSchema
Maxim Gekk created SPARK-25384: -- Summary: Removing spark.sql.fromJsonForceNullableSchema Key: SPARK-25384 URL: https://issues.apache.org/jira/browse/SPARK-25384 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Disabling the spark.sql.fromJsonForceNullableSchema flag is error prone. We should not allow users to do that since it can lead to producing of corrupted output. The flag should be removed for simplicity too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25381) Stratified sampling by Column argument
Maxim Gekk created SPARK-25381: -- Summary: Stratified sampling by Column argument Key: SPARK-25381 URL: https://issues.apache.org/jira/browse/SPARK-25381 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently the sampleBy method accepts the first argument of string type only. Need to provide overloaded method which accepts Column type too. So, it will allow sampling by multiple columns , for example: {code:scala} import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +-+---+ | name|age| +-+---+ | Nico| 8| |Alice| 10| +-+---+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25387) Malformed CSV causes NPE
Maxim Gekk created SPARK-25387: -- Summary: Malformed CSV causes NPE Key: SPARK-25387 URL: https://issues.apache.org/jira/browse/SPARK-25387 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Loading a malformed CSV files or a dataset can cause NullPointerException, for example the code: {code:scala} val schema = StructType(StructField("a", IntegerType) :: Nil) val input = spark.createDataset(Seq("\u\u\u0001234")) spark.read.schema(schema).csv(input).collect() {code} crashes with the exception: {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) {code} If schema is not specified, the following exception is thrown: {code:java} java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25393) Parsing CSV strings in a column
Maxim Gekk created SPARK-25393: -- Summary: Parsing CSV strings in a column Key: SPARK-25393 URL: https://issues.apache.org/jira/browse/SPARK-25393 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk There are use cases when content in CSV format is dumped into an external storage as one of columns. For example, CSV records are stored together with other meta-info to Kafka. Current Spark API doesn't allow to parse such columns directly. The existing method [csv()|https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L487] requires a dataset with one string column. The API is inconvenient in parsing CSV column in dataset with many columns. The ticket aims to add new function similar to [from_json()|https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3456] with the following signatures in Scala: {code:scala} def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column {code} and for using from Python, R and Java: {code:scala} def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25396) Read array of JSON objects via an Iterator
[ https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-25396: --- Description: If a JSON file has a structure like below: {code} [ { "time":"2018-08-13T18:00:44.086Z", "resourceId":"some-text", "category":"A", "level":2, "operationName":"Error", "properties":{...} }, { "time":"2018-08-14T18:00:44.086Z", "resourceId":"some-text2", "category":"B", "level":3, "properties":{...} }, ... ] {code} it should be read in the `multiLine` mode. In this mode, Spark read whole array into memory in both cases when schema is `ArrayType` and `StructType`. It can lead to unnecessary memory consumption and even to OOM for big JSON files. In general, there is no need to materialize all parsed JSON record in memory there: https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 . So, JSON objects of an array can be read via an Iterator. was: If a JSON file has a structure like below: {code} [ { "time":"2018-08-13T18:00:44.086Z", "resourceId":"some-text", "category":"A", "level":2, "operationName":"Error", "properties":{...} }, { "time":"2018-08-14T18:00:44.086Z", "resourceId":"some-text2", "category":"B", "level":3, "properties":{...} }, ] {code} it should be read in the `multiLine` mode. In this mode, Spark read whole array into memory in both cases when schema is `ArrayType` and `StructType`. It can lead to unnecessary memory consumption and even to OOM for big JSON files. In general, there is no need to materialize all parsed JSON record in memory there: https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 . So, JSON objects of an array can be read via an Iterator. > Read array of JSON objects via an Iterator > -- > > Key: SPARK-25396 > URL: https://issues.apache.org/jira/browse/SPARK-25396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > If a JSON file has a structure like below: > {code} > [ > { > "time":"2018-08-13T18:00:44.086Z", > "resourceId":"some-text", > "category":"A", > "level":2, > "operationName":"Error", > "properties":{...} > }, > { > "time":"2018-08-14T18:00:44.086Z", > "resourceId":"some-text2", > "category":"B", > "level":3, > "properties":{...} > }, > ... > ] > {code} > it should be read in the `multiLine` mode. In this mode, Spark read whole > array into memory in both cases when schema is `ArrayType` and `StructType`. > It can lead to unnecessary memory consumption and even to OOM for big JSON > files. > In general, there is no need to materialize all parsed JSON record in memory > there: > https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 > . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25396) Read array of JSON objects via an Iterator
Maxim Gekk created SPARK-25396: -- Summary: Read array of JSON objects via an Iterator Key: SPARK-25396 URL: https://issues.apache.org/jira/browse/SPARK-25396 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk If a JSON file has a structure like below: {code} [ { "time":"2018-08-13T18:00:44.086Z", "resourceId":"some-text", "category":"A", "level":2, "operationName":"Error", "properties":{...} }, { "time":"2018-08-14T18:00:44.086Z", "resourceId":"some-text2", "category":"B", "level":3, "properties":{...} }, ] {code} it should be read in the `multiLine` mode. In this mode, Spark read whole array into memory in both cases when schema is `ArrayType` and `StructType`. It can lead to unnecessary memory consumption and even to OOM for big JSON files. In general, there is no need to materialize all parsed JSON record in memory there: https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator
[ https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609469#comment-16609469 ] Maxim Gekk commented on SPARK-25396: [~hyukjin.kwon] WDYT > Read array of JSON objects via an Iterator > -- > > Key: SPARK-25396 > URL: https://issues.apache.org/jira/browse/SPARK-25396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > If a JSON file has a structure like below: > {code} > [ > { > "time":"2018-08-13T18:00:44.086Z", > "resourceId":"some-text", > "category":"A", > "level":2, > "operationName":"Error", > "properties":{...} > }, > { > "time":"2018-08-14T18:00:44.086Z", > "resourceId":"some-text2", > "category":"B", > "level":3, > "properties":{...} > }, > ] > {code} > it should be read in the `multiLine` mode. In this mode, Spark read whole > array into memory in both cases when schema is `ArrayType` and `StructType`. > It can lead to unnecessary memory consumption and even to OOM for big JSON > files. > In general, there is no need to materialize all parsed JSON record in memory > there: > https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 > . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator
[ https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609605#comment-16609605 ] Maxim Gekk commented on SPARK-25396: I have a concern regarding to when I should close Jackson parser. For now it is closed before returning result from the parse method there: [https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L394-L404] . If I return an *Iterator[InternalRow]* instead of *Seq[InternalRow]*, so I have to postpone closing of Jackson parser at least up to the end of current task, right? ... but it is bad for per-line mode because this could produce a lot of opened JSON parsers. It seems implementations for multiLine and for per-line mode should be different. > Read array of JSON objects via an Iterator > -- > > Key: SPARK-25396 > URL: https://issues.apache.org/jira/browse/SPARK-25396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > If a JSON file has a structure like below: > {code} > [ > { > "time":"2018-08-13T18:00:44.086Z", > "resourceId":"some-text", > "category":"A", > "level":2, > "operationName":"Error", > "properties":{...} > }, > { > "time":"2018-08-14T18:00:44.086Z", > "resourceId":"some-text2", > "category":"B", > "level":3, > "properties":{...} > }, > ... > ] > {code} > it should be read in the `multiLine` mode. In this mode, Spark read whole > array into memory in both cases when schema is `ArrayType` and `StructType`. > It can lead to unnecessary memory consumption and even to OOM for big JSON > files. > In general, there is no need to materialize all parsed JSON record in memory > there: > https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 > . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25273) How to install testthat v1.0.2
[ https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-25273: --- Summary: How to install testthat v1.0.2 (was: How to install testthat = 1.0.2) > How to install testthat v1.0.2 > -- > > Key: SPARK-25273 > URL: https://issues.apache.org/jira/browse/SPARK-25273 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > The command installs testthat v2.0.x: > {code:R} > R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', > 'survival'), repos='http://cran.us.r-project.org')" > {code} > which prevents running the R tests. Need to update the section > http://spark.apache.org/docs/latest/building-spark.html#running-r-tests > according to https://github.com/apache/spark/pull/20003 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25273) How to install testthat = 1.0.2
Maxim Gekk created SPARK-25273: -- Summary: How to install testthat = 1.0.2 Key: SPARK-25273 URL: https://issues.apache.org/jira/browse/SPARK-25273 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.3.1 Reporter: Maxim Gekk The command installs testthat v2.0.x: {code:R} R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" {code} which prevents running the R tests. Need to update the section http://spark.apache.org/docs/latest/building-spark.html#running-r-tests according to https://github.com/apache/spark/pull/20003 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24757) Improve error message for broadcast timeouts
Maxim Gekk created SPARK-24757: -- Summary: Improve error message for broadcast timeouts Key: SPARK-24757 URL: https://issues.apache.org/jira/browse/SPARK-24757 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently, the TimeoutException that is thrown on broadcast joins doesn't give any clues to user how to resolve the issue. Need to provide such help to users by pointing out two config parameters: *spark.sql.broadcastTimeout* and *spark.sql.autoBroadcastJoinThreshold*. The ticket aims to handle the TimeoutException there: https://github.com/apache/spark/blob/b7a036b75b8a1d287ac014b85e90d555753064c9/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L143 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24761) Check modifiability of config parameters
Maxim Gekk created SPARK-24761: -- Summary: Check modifiability of config parameters Key: SPARK-24761 URL: https://issues.apache.org/jira/browse/SPARK-24761 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Our customers and support team continuously face to the situation when setting a config parameter via *spark.conf.set()* does not may any effects. It is not clear from parameter's name is it static parameter or one of the parameter that can be set at runtime for current session state. It would be useful to have a method of *RuntimeConfig* which could tell to an user - does the given parameter may effect on the current behavior if he/she change it in the spark-shell or running notebook. The method can have the following signature: {code:scala} def isModifiable(key: String): Boolean {code} Any config parameter can be checked by using the syntax like this: {code:scala} scala> spark.conf.isModifiable("spark.sql.sources.schemaStringLengthThreshold") res0: Boolean = false {code} or for Spark Core parameter: {code:scala} scala> spark.conf.isModifiable("spark.task.cpus") res1: Boolean = false {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23620) Split thread dump lines by using the br tag
Maxim Gekk created SPARK-23620: -- Summary: Split thread dump lines by using the br tag Key: SPARK-23620 URL: https://issues.apache.org/jira/browse/SPARK-23620 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Reporter: Maxim Gekk The '\n' line separator should be replaced by the tag in the generated html of thread dump UI to guarantee that each class name is on separate line. There are some cases when the html are proxied and the '\n' could be replaced by another whitespaces (see the screenshot - https://drive.google.com/file/d/18t6yf-jnr072b-hPq4LhMZeRrw1PjpT5/view?usp=sharing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23649: --- Attachment: utf8xFF.csv > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
Maxim Gekk created SPARK-23649: -- Summary: CSV schema inferring fails on some UTF-8 chars Key: SPARK-23649 URL: https://issues.apache.org/jira/browse/SPARK-23649 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Schema inferring of CSV files fails if the file contains a char starts from *0xFF.* {code:java} spark.read.option("header", "true").csv("utf8xFF.csv") {code} {code:java} java.lang.ArrayIndexOutOfBoundsException: 63 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) {code} Here is content of the file: {code:java} hexdump -C ~/tmp/utf8xFF.csv 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| 0020 2c 34 35 36 0d|,456.| 0025 {code} Schema inferring doesn't fail in multiline mode: {code} spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv") {code} {code:java} +---+-+ |channel|code +---+-+ | United| 123 | ABGUN�| 456 +---+-+ {code} and Spark is able to read the csv file if the schema is specified: {code} import org.apache.spark.sql.types._ val schema = new StructType().add("channel", StringType).add("code", StringType) spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show {code} {code:java} +---++ |channel|code| +---++ | United| 123| | ABGUN�| 456| +---++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23649: --- Shepherd: Herman van Hovell > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23643: --- Summary: XORShiftRandom.hashSeed allocates unnecessary memory (was: XORShiftRandom.setSeed allocates unnecessary memory) > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Trivial > > The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23643) XORShiftRandom.setSeed allocates unnecessary memory
Maxim Gekk created SPARK-23643: -- Summary: XORShiftRandom.setSeed allocates unnecessary memory Key: SPARK-23643 URL: https://issues.apache.org/jira/browse/SPARK-23643 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Maxim Gekk The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the seed parameter into it. Other bytes are always zero and could be easily excluded from hash calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23643: --- Description: The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the seed parameter into it. Other bytes are always zero and could be easily excluded from hash calculation. (was: The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the seed parameter into it. Other bytes are always zero and could be easily excluded from hash calculation.) > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Trivial > > The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24068) CSV schema inferring doesn't work for compressed files
Maxim Gekk created SPARK-24068: -- Summary: CSV schema inferring doesn't work for compressed files Key: SPARK-24068 URL: https://issues.apache.org/jira/browse/SPARK-24068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Here is a simple csv file compressed by lzo {code} $ cat ./test.csv col1,col2 a,1 $ lzop ./test.csv $ ls test.csv test.csv.lzo {code} Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example): {code:scala} scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") ds: org.apache.spark.sql.DataFrame = [�LZO?: string] scala> ds.printSchema root |-- �LZO: string (nullable = true) scala> ds.show +-+ |�LZO| +-+ |a| +-+ {code} but the file can be read if the schema is specified: {code} scala> import org.apache.spark.sql.types._ scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType) scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") scala> ds.show +++ |col1|col2| +++ | a| 1| +++ {code} Just in case, schema inferring works for the original uncompressed file: {code:scala} scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema root |-- col1: string (nullable = true) |-- col2: integer (nullable = true) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24004) Tests of from_json for MapType
Maxim Gekk created SPARK-24004: -- Summary: Tests of from_json for MapType Key: SPARK-24004 URL: https://issues.apache.org/jira/browse/SPARK-24004 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk There are no tests for *from_json* that check *MapType* as a value type of struct fields. The MapType should be supported as non-root type according to current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org