[
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
]
Udit Mehrotra edited comment on HUDI-829 at 4/23/20, 12:07 AM:
---
You may also want to look at my implementation of custom relation in Spark to
read bootstrapped tables
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
. Here I am building my own file index using spark's InMemoryFileIndex, but
the filtering part is just one operation now because once I have all the files,
the hudi filesystem view is created just once to get latest files. Its still
work in progress and I am yet to see how fast this is going to be. We can
consider moving to a place where our reading in spark happens through our
relations, and underneath we use the native readers.
In practice the above data source implementation I am working on would work for
both bootstrapped and non-bootstrapped tables as well. So once this is working
and stable would be nice to do some benchmarking for regular (non-bootstrapped)
tables as well and compare its read performance against standard parquet data
source with filters.
was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to
read bootstrapped tables
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
. Here I am building my own file index using spark's InMemoryFileIndex, but
the filtering part is just one operation now because once I have all the files,
the hudi filesystem view is created just once to get latest files. Its still
work in progress and I am yet to see how fast this is going to be. We can
consider moving to a place where our reading in spark happens through our
relations, and underneath we use the native readers.
> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
> Issue Type: Task
> Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of
> spark with Hudi tables.
> My understanding is that when you read Hudi tables through spark-shell, some
> of your queries are slower due to some sequential activity performed by spark
> when interacting with Hudi tables (even with
> spark.sql.hive.convertMetastoreParquet which can give you the same data
> reading speed and all the vectorization benefits). Is this slowness observed
> during spark query planning ? Can you please elaborate on this ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)