Re: How to use delta storage format

Mike Thomsen Sun, 05 Apr 2020 06:26:32 -0700

Paul,

> What it does is it basically runs a local-mode (in-process) spark to read
the log.


That's unfortunately not particularly scalable unless I'm missing
something. I think the easiest path to accomplish this would be to build a
NiFi flow that generates Parquet files, uploads them into a S3 bucket and
then periodically run a Spark job to read the entire bucket and merge the
files into the table. I've done something simple like this in a personal
experiment. It's particularly easy now that we have Record API components
that will directly generate Parquet output files from an Avro schema and a
record set.

On Tue, Mar 31, 2020 at 11:08 AM Paul Parker <nifi.sur...@gmail.com> wrote:

> Let me share answers from the delta community:
>
> Answer to Q1:
> Structured streaming queries can do commits every minute, even every 20-30
> seconds. This definitely creates small files. But that is okay, because it
> is expected that people will periodically compact the files. The same
> timing should work fine for Nifi and any other streaming engine. It does
> create 1000-ish versions per day, but that is okay.
>
> Answer to Q2:
> That is up to the sink implementation. Both are okay. In fact, it probably
> can be combination of both, as long as we dont commit every second. That
> may not scale well.
>
> Answer to Q3:
> You need a primary node which is responsible for managing the Delta table.
> That note would be responsible for reading the log, parsing it, updating
> it, etc. Unfortunately, we have no good non-spark way to read the log, and
> much less write to the log.
> there is an experimental uber jar that tries to package delta + spark
> together into a single jar ... using which you could read the log. its
> available here - https://github.com/delta-io/connectors/
>
> What it does is it basically runs a local-mode (in-process) spark to read
> the log. This what we are using to build a hive connector, that will allow
> hive to read delta files. Now the goal of that was to only read. your goal
> is to write which is definitely more complicated, because for that you have
> to do much more. Now that uber jar has all the necessary code to do the
> writing. ... which you could use. but there has to be a driver node which
> has to collect all the parquet files written by other nodes and atomically
> commit those parquet files to the Delta log to make them visible to all
> readers.
>
> Does the first orientation help?
>
>
> Mike Thomsen <mikerthom...@gmail.com> schrieb am So., 29. März 2020,
> 20:29:
>
>> It looks like a lot of their connectors rely on external management by
>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>> documentation. Read some of the fine print near the bottom of this to get
>> an idea of what I mean:
>>
>> https://github.com/delta-io/connectors
>>
>> Hypothetically, we could build a connector for NiFi, but there are some
>> things about the design of Delta Lake that I am not sure about based on my
>> own research and what not. Roughly, they are the following:
>>
>> 1. What is a good timing strategy for doing commits to Delta Lake?
>> 2. What would trigger a commit in the first place?
>> 3. Is there a good way to trigger commits that would work within the
>> masterless cluster design of clustered NiFi instead of requiring a special
>> "primary node only" processor for executing commits?
>>
>> Based on my experimentation, one of the biggest questions around the
>> first point is do you really want potentially thousands or tens of
>> thousands of time shift events to be created throughout the day? A record
>> processor that reads in a ton of small record sets and injects that into
>> the Delta Lake would create a ton of these checkpoints, and they'd be
>> largely meaningless to people trying to make sense of them for the purpose
>> of going back and forth in time between versions.
>>
>> Do we trigger a commit per record set or set a timer?
>>
>> Most of us on the NiFi dev side have no real experience here. It would be
>> helpful for us to get some ideas to form use cases from the community
>> because there are some big gaps on how we'd even start to shape the
>> requirements.
>>
>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <nifi.sur...@gmail.com>
>> wrote:
>>
>>> Hi Mike,
>>> your alternate suggestion sounds good. But how does it work if I want to
>>> keep this running continuously? In other words, the delta table should be
>>> continuously updated. Finally, this is one of the biggest advantages of
>>> Delta: you can ingest batch and streaming data into one table.
>>>
>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>> Nifi):
>>> "Here is the list of integrations that enable you to access Delta
>>> tables from external data processing engines.
>>>
>>>    - Presto and Athena to Delta Lake Integration
>>>    <https://docs.delta.io/latest/presto-integration.html>
>>>    - Redshift Spectrum to Delta Lake Integration
>>>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>>    - Snowflake to Delta Lake Integration
>>>    <https://docs.delta.io/latest/snowflake-integration.html>
>>>    - Apache Hive to Delta Lake Integration
>>>    <https://docs.delta.io/latest/hive-integration.html>"
>>>
>>> Source:
>>> https://docs.delta.io/latest/integrations.html
>>>
>>> I am looking forward to further ideas from the community.
>>>
>>> Mike Thomsen <mikerthom...@gmail.com> schrieb am So., 29. März 2020,
>>> 17:23:
>>>
>>>> I think there is a connector for Hive that works with Delta. You could
>>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>>> Alternatively, you can simply convert the results of the JDBC query into a
>>>> Parquet file, push to a desired location and run a Spark job to convert
>>>> from Parquet to Delta (that should be pretty fast because Delta is
>>>> basically a fork of Parquet).
>>>>
>>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <nifi.sur...@gmail.com>
>>>> wrote:
>>>>
>>>>> We read data via JDBC from a database and want to save the results as
>>>>> a delta table and then read them again. How can I realize this with Nifi
>>>>> and Hive or Glue Metastore?
>>>>>
>>>>

Re: How to use delta storage format

Reply via email to