[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2019-01-12 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741315#comment-16741315
 ] 

Yuanjian Li commented on SPARK-26225:
-

Thanks for your reply Wenchen, as our discussion, the decoding time for file 
format should hold on until data source v2 implement done, so I just close 
[GitHub Pull Request #23378|https://github.com/apache/spark/pull/23378].

For the `RowDataSourceScanExec`, I give a preview PR here [GitHub Pull Request 
#23528|https://github.com/apache/spark/pull/23528], but during the work, I 
found it does not take too much time, please take a look whether it's necessary 
to add this metric for `RowDataSourceScanExec`. Thanks.

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2019-01-07 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735954#comment-16735954
 ] 

Wenchen Fan commented on SPARK-26225:
-

I think it's hard to define the decoding time, as every data source may has its 
own definition.

For data source v1, I think we just need to update `RowDataSourceScanExec` and 
track the time of the unsafe projection that turns Row to InternalRow.

For data source v2, it's totally different. Spark needs to ask the data source 
to report the decoding time (or any other metrics). I'd like to defer it after 
the data source v2 metrics API is introduced.

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2018-12-25 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728620#comment-16728620
 ] 

Yuanjian Li commented on SPARK-26225:
-

We define decoding time here as the time which the system cost on converting 
data from the storage format to 'InternalRow' of Spark. I list decoding source 
code here and divide them into two parts.
1. Row-based data sources
All decoding work happened in 'buildReader' function of row-based data sources, 
which override from FileFormat.buildReader.
||Data Source||Decode Logic||Code Link||
|Json-TextInputJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L231-L232]|
|Json-MultiLineJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L145]|
|CSV-TextInputCSVDataSource|UnivocityParser.parseIterator|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L105]|
|CSV-MultiLineCSVDataSource|UnivocityParser.parseStream|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L178-L182]|
|Avro|AvroDeserializer.deserialize|[AvroFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L238]|
|Text|UnsafeRowWriter.write|[TextFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L128-L134]|
|ORC-hive|OrcFileFormat.unwrapOrcStructs|[hive/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala#L174-L179]|
|Image|RowEncoder.toRow|[ImageFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala#L95]|
|LibSVM|RowEncoder.toRow|[LibSVMRelation.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala#L175-L179]|

Instead of dealing with all scenario separately, we can handle them uniformly 
by timing FileFormat.buildreader if we can accept the initializing work(like 
reader initialization, schema preparation, etc) count in decoding time. That 
can be more code and logical clean as well as overhead minimize.

2. Column-based data sources

All decoding work triggered in buildReaderWithPartitionValures which override 
from FileFormat, it should discuss separately by batch read mode enable or 
disable.
||Data Source||Batch Read||Decode Logic||Code Link||
|ORC-native|false|OrcDeserializer.deserialize|[datasources/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L229-L234]|
|ORC-native|true|Full fill column vector in 
OrcColumnBatchReader.nextBatch|[OrcColumnarBatchReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java#L252-L259]|
|Parquet|false|InternalParquetRecordReader|This part of code not in Spark, the 
decoding work is done in RecordMaterializer|
|Parquet|true|Full fill column vector in 
VectorizedColumnReader.readBatch|[VectorizedColumnReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L259-L262]|

Listing decoding logic of column-based data sources, if further work is needed 
later.

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2018-12-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705846#comment-16705846
 ] 

Sean Owen commented on SPARK-26225:
---

[~Thincrs] let's not automatically add comments from your tool to JIRA. It 
generates emails and a bit of clutter.

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2018-11-30 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705373#comment-16705373
 ] 

Thincrs commented on SPARK-26225:
-

A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 10:42 PM

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org