[jira] [Updated] (SPARK-24045) Create base class for file data source v2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-24045: --- Issue Type: Task (was: Sub-task) Parent: (was: SPARK-23817) > Create base

[jira] [Updated] (SPARK-26673) File source V2 write: create framework and migrate ORC to it

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-26673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26673: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > File source V2

[jira] [Updated] (SPARK-26871) File Source V2: avoid creating unnecessary FileIndex in the write path

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-26871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26871: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > File Source V2: avoid

[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26744: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > Support schema

[jira] [Updated] (SPARK-27049) Support handling partition values in the abstraction of file source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27049: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Support handling partition

[jira] [Updated] (SPARK-23817) Create file source V2 framework and migrate ORC read path

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23817: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > Create file source

[jira] [Updated] (SPARK-23817) Create file source V2 framework and migrate ORC read path

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23817: --- Summary: Create file source V2 framework and migrate ORC read path (was: Migrate ORC file

[jira] [Resolved] (SPARK-27113) remove CHECK_FILES_EXIST_KEY option in file source

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-27113. Resolution: Duplicate > remove CHECK_FILES_EXIST_KEY option in file source >

[jira] [Updated] (SPARK-27113) remove CHECK_FILES_EXIST_KEY option in file source

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27113: --- Issue Type: Task (was: Sub-task) Parent: (was: SPARK-23817) > remove

[jira] [Updated] (SPARK-23817) Create file source V2 framework and migrate ORC read path

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23817: --- Affects Version/s: (was: 2.3.1) 3.0.0 > Create file source V2

[jira] [Updated] (SPARK-27085) Migrate CSV to File Data Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27085: --- Issue Type: Task (was: Sub-task) Parent: (was: SPARK-23507) > Migrate CSV to

[jira] [Updated] (SPARK-27136) Remove data source option check_files_exist

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27136: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Remove data source option

[jira] [Updated] (SPARK-27589) Spark file source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27589: --- Issue Type: Umbrella (was: New Feature) > Spark file source V2 > > >

[jira] [Updated] (SPARK-27085) Migrate CSV to File Data Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27085: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Migrate CSV to File Data

[jira] [Updated] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26447: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Allow

[jira] [Updated] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26447: --- Issue Type: Task (was: Sub-task) Parent: (was: SPARK-23817) > Allow

[jira] [Updated] (SPARK-27128) Migrate JSON to File Data Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27128: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Migrate JSON to File Data

[jira] [Resolved] (SPARK-24045) Create base class for file data source v2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-24045. Resolution: Fixed It is fixed in https://github.com/apache/spark/pull/23383 > Create

[jira] [Updated] (SPARK-27286) Handles exceptions on proceeding to next record in FilePartitionReader

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27286: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-27589 > Handles exceptions on

[jira] [Updated] (SPARK-27269) File source v2 should validate data schema only

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27269: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-27589 > File source v2 should

[jira] [Updated] (SPARK-27271) Migrate Text to File Data Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27271: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Migrate Text to File Data

[jira] [Updated] (SPARK-27132) Improve file source V2 framework

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27132: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Improve file source V2

[jira] [Updated] (SPARK-27291) File source V2: Ignore empty files in load

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27291: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-27589 > File source V2: Ignore empty

[jira] [Updated] (SPARK-27448) File source V2 table provider should be compatible with V1 provider

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27448: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > File source V2 table

[jira] [Updated] (SPARK-27326) Fall back all v2 file sources in `InsertIntoTable` to V1 FileFormat

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27326: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > Fall back all v2

[jira] [Updated] (SPARK-27384) File source V2: Prune unnecessary partition columns

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27384: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > File source V2:

[jira] [Updated] (SPARK-27356) File source V2: return actual schema in method `FileScan.readSchema`

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27356: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > File source V2:

[jira] [Updated] (SPARK-27435) Support schema pruning in Orc V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27435: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Support schema pruning in

[jira] [Updated] (SPARK-27407) File source V2: Invalidate cache data on overwrite/append

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27407: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-27589 > File source V2:

[jira] [Updated] (SPARK-27443) Support UDF input_file_name in file source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27443: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Support UDF input_file_name

[jira] [Updated] (SPARK-27418) Migrate Parquet to File Data Source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27418: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Migrate Parquet to File

[jira] [Updated] (SPARK-27504) File source V2: support refreshing metadata cache

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27504: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > File source V2: support

[jira] [Updated] (SPARK-27459) Revise the exception message of schema inference failure in file source V2

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27459: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Revise the exception

[jira] [Updated] (SPARK-27490) File source V2: return correct result for Dataset.inputFiles()

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27490: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > File source V2: return

[jira] [Updated] (SPARK-27580) Implement `doCanonicalize` in BatchScanExec for comparing query plan results

2019-04-28 Thread Gengliang Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27580: --- Issue Type: Sub-task (was: Task) Parent: SPARK-27589 > Implement `doCanonicalize`

[jira] [Created] (SPARK-27589) Spark file source V2

2019-04-28 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27589: -- Summary: Spark file source V2 Key: SPARK-27589 URL: https://issues.apache.org/jira/browse/SPARK-27589 Project: Spark Issue Type: New Feature

[jira] [Assigned] (SPARK-27472) Docuement binary file data source in Spark user guide

2019-04-28 Thread Xiangrui Meng (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27472: - Assignee: Xiangrui Meng > Docuement binary file data source in Spark user guide >

[jira] [Updated] (SPARK-27227) Spark Runtime Filter

2019-04-28 Thread Song Jun (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27227: - Summary: Spark Runtime Filter (was: Dynamic Partition Prune in Spark) > Spark Runtime Filter >

[jira] [Commented] (SPARK-27587) No such method error (sun.nio.ch.DirectBuffer.cleaner()) when reading big table from JDBC (with one slow query)

2019-04-28 Thread Yuming Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828875#comment-16828875 ] Yuming Wang commented on SPARK-27587: - Sorry [~MohsenTaheri], It may have been fixed by SPARK-24421.

[jira] [Updated] (SPARK-27519) Pandas udf corrupting data

2019-04-28 Thread Hyukjin Kwon (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27519: - Affects Version/s: 3.0.0 > Pandas udf corrupting data > -- > >

[jira] [Assigned] (SPARK-27588) Fail fast if binary file data source will load a file that is bigger than 2GB

2019-04-28 Thread Xiangrui Meng (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27588: - Assignee: Xiangrui Meng > Fail fast if binary file data source will load a file that

[jira] [Created] (SPARK-27588) Fail fast if binary file data source will load a file that is bigger than 2GB

2019-04-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27588: - Summary: Fail fast if binary file data source will load a file that is bigger than 2GB Key: SPARK-27588 URL: https://issues.apache.org/jira/browse/SPARK-27588

[jira] [Commented] (SPARK-27519) Pandas udf corrupting data

2019-04-28 Thread Jeff gold (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828851#comment-16828851 ] Jeff gold commented on SPARK-27519: --- [^Pandas UDF Bug.py]   ^Hello! [~hyukjin.kwon] here is my 

[jira] [Updated] (SPARK-27519) Pandas udf corrupting data

2019-04-28 Thread Jeff gold (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff gold updated SPARK-27519: -- Attachment: Pandas UDF Bug.py > Pandas udf corrupting data > -- > >

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2019-04-28 Thread Neil Alexander McQuarrie (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Alexander McQuarrie updated SPARK-21727: - Description: Previously

[jira] [Commented] (SPARK-27530) FetchFailedException: Received a zero-size buffer for block shuffle

2019-04-28 Thread Adrian Muraru (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828783#comment-16828783 ] Adrian Muraru commented on SPARK-27530: --- I can confirm SPARK-27216 fixes this issue. >

[jira] [Assigned] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

2019-04-28 Thread Apache Spark (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27581: Assignee: Apache Spark > DataFrame countDistinct("*") fails with AnalysisException:

[jira] [Assigned] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

2019-04-28 Thread Apache Spark (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27581: Assignee: (was: Apache Spark) > DataFrame countDistinct("*") fails with

[jira] [Resolved] (SPARK-27534) Do not load `content` column in binary data source if it is not selected

2019-04-28 Thread Xiangrui Meng (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27534. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24473

[jira] [Created] (SPARK-27587) No such method error (sun.nio.ch.DirectBuffer.cleaner()) when reading big table from JDBC (with one slow query)

2019-04-28 Thread Mohsen Taheri (JIRA)
Mohsen Taheri created SPARK-27587: - Summary: No such method error (sun.nio.ch.DirectBuffer.cleaner()) when reading big table from JDBC (with one slow query) Key: SPARK-27587 URL:

[jira] [Commented] (SPARK-27585) No such method error (sun.nio.ch.DirectBuffer.cleaner())

2019-04-28 Thread Mohsen Taheri (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828025#comment-16828025 ] Mohsen Taheri commented on SPARK-27585: --- Thanks. It was added to the sub-tasks of the related

[jira] [Commented] (SPARK-27587) No such method error (sun.nio.ch.DirectBuffer.cleaner()) when reading big table from JDBC (with one slow query)

2019-04-28 Thread Mohsen Taheri (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828024#comment-16828024 ] Mohsen Taheri commented on SPARK-27587: --- On OpenJDK 11.0.2 > No such method error

[jira] [Updated] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop

2019-04-28 Thread WoudyGao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WoudyGao updated SPARK-27586: - Description: I found the cpu cost of TypeUtils.compareBinary is noticeable when handle some big

[jira] [Created] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop

2019-04-28 Thread WoudyGao (JIRA)
WoudyGao created SPARK-27586: Summary: Improve binary comparison: replace Scala's for-comprehension if statements with while loop Key: SPARK-27586 URL: https://issues.apache.org/jira/browse/SPARK-27586