[jira] [Comment Edited] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
 ] 

Udit Mehrotra edited comment on HUDI-829 at 4/23/20, 12:07 AM:
---

You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.

In practice the above data source implementation I am working on would work for 
both bootstrapped and non-bootstrapped tables as well. So once this is working 
and stable would be nice to do some benchmarking for regular (non-bootstrapped) 
tables as well and compare its read performance against standard parquet data 
source with filters.


was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
 ] 

Udit Mehrotra edited comment on HUDI-829 at 4/22/20, 11:59 PM:
---

You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.


was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view just once to get latest files. Its still work in 
progress and I am yet to see how fast this is going to be. We can consider 
moving to a place where our reading in spark happens through our relations, and 
underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)