Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Jungtaek Lim Mon, 05 Oct 2020 05:04:14 -0700

Hi,

"spark.read.<format>" is a "shorthand" for "built-in" data sources, not for
external data sources. spark.read.format() is still an official way to use
it. Delta Lake is not included in Apache Spark so that is indeed not
possible for Spark to refer to.

Starting from Spark 3.0, the concept of "catalog" is introduced, which you
can simply refer to the table from catalog (if the external data source
provides catalog implementation) and no need to specify the format
explicitly (as catalog would know about it).

This session explains the catalog and how Cassandra connector leverages it.
I see some external data sources starting to support catalog, and in Spark
itself there's some effort to support catalog for JDBC.
https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael <
michael.mo...@siemens-healthineers.com> wrote:

> Hi there,
>
>
>
> I’m just wondering if there is any incentive to implement read/write
> methods in the DataFrameReader/DataFrameWriter for delta similar to e.g.
> parquet?
>
>
>
> For example, using PySpark, “spark.read.parquet” is available, but
> “spark.read.delta” is not (same for write).
>
> In my opinion, “spark.read.delta” feels more clean and pythonic compared
> to “spark.read.format(‘delta’).load()”, especially if more options are
> called, like “mode”.
>
>
>
> Can anyone explain the reasoning behind this, is this due to the Java
> nature of Spark?
>
> From a pythonic point of view, I could also imagine a single read/write
> method, with the format as an arg and kwargs related to the different file
> format options.
>
>
>
> Best,
>
> Michael
>
>
>
>
>

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Reply via email to