[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741315#comment-16741315 ] Yuanjian Li commented on SPARK-26225: - Thanks for your reply Wenchen, as our discussion, the decoding time for file format should hold on until data source v2 implement done, so I just close [GitHub Pull Request #23378|https://github.com/apache/spark/pull/23378]. For the `RowDataSourceScanExec`, I give a preview PR here [GitHub Pull Request #23528|https://github.com/apache/spark/pull/23528], but during the work, I found it does not take too much time, please take a look whether it's necessary to add this metric for `RowDataSourceScanExec`. Thanks. > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735954#comment-16735954 ] Wenchen Fan commented on SPARK-26225: - I think it's hard to define the decoding time, as every data source may has its own definition. For data source v1, I think we just need to update `RowDataSourceScanExec` and track the time of the unsafe projection that turns Row to InternalRow. For data source v2, it's totally different. Spark needs to ask the data source to report the decoding time (or any other metrics). I'd like to defer it after the data source v2 metrics API is introduced. > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728620#comment-16728620 ] Yuanjian Li commented on SPARK-26225: - We define decoding time here as the time which the system cost on converting data from the storage format to 'InternalRow' of Spark. I list decoding source code here and divide them into two parts. 1. Row-based data sources All decoding work happened in 'buildReader' function of row-based data sources, which override from FileFormat.buildReader. ||Data Source||Decode Logic||Code Link|| |Json-TextInputJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L231-L232]| |Json-MultiLineJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L145]| |CSV-TextInputCSVDataSource|UnivocityParser.parseIterator|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L105]| |CSV-MultiLineCSVDataSource|UnivocityParser.parseStream|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L178-L182]| |Avro|AvroDeserializer.deserialize|[AvroFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L238]| |Text|UnsafeRowWriter.write|[TextFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L128-L134]| |ORC-hive|OrcFileFormat.unwrapOrcStructs|[hive/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala#L174-L179]| |Image|RowEncoder.toRow|[ImageFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala#L95]| |LibSVM|RowEncoder.toRow|[LibSVMRelation.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala#L175-L179]| Instead of dealing with all scenario separately, we can handle them uniformly by timing FileFormat.buildreader if we can accept the initializing work(like reader initialization, schema preparation, etc) count in decoding time. That can be more code and logical clean as well as overhead minimize. 2. Column-based data sources All decoding work triggered in buildReaderWithPartitionValures which override from FileFormat, it should discuss separately by batch read mode enable or disable. ||Data Source||Batch Read||Decode Logic||Code Link|| |ORC-native|false|OrcDeserializer.deserialize|[datasources/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L229-L234]| |ORC-native|true|Full fill column vector in OrcColumnBatchReader.nextBatch|[OrcColumnarBatchReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java#L252-L259]| |Parquet|false|InternalParquetRecordReader|This part of code not in Spark, the decoding work is done in RecordMaterializer| |Parquet|true|Full fill column vector in VectorizedColumnReader.readBatch|[VectorizedColumnReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L259-L262]| Listing decoding logic of column-based data sources, if further work is needed later. > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705846#comment-16705846 ] Sean Owen commented on SPARK-26225: --- [~Thincrs] let's not automatically add comments from your tool to JIRA. It generates emails and a bit of clutter. > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705373#comment-16705373 ] Thincrs commented on SPARK-26225: - A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 10:42 PM > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org