Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Enrico Minack Mon, 05 Oct 2020 05:55:15 -0700

Though spark.read.<format> refers to "built-in" data sources, there isnothing that prevents 3rd party libraries to "extend" spark.read inScala or Python. As users know the Spark-way to read built-in datasources, it feels natural to hook 3rd party data sources under the samescheme, to give users a holistic and integrated feel.

One Scala example(https://github.com/G-Research/spark-dgraph-connector#spark-dgraph-connector):


import  uk.co.gresearch.spark.dgraph.connector._
val  triples  =  spark.read.dgraph.triples("localhost:9080")

and in Python:

from  gresearch.spark.dgraph.connector  import  *
triples  =  spark.read.dgraph.triples("localhost:9080")

I agree that 3rd parties should also support the officialspark.read.format() and the new catalog approaches.


Enrico


Am 05.10.20 um 14:03 schrieb Jungtaek Lim:

Hi,
"spark.read.<format>" is a "shorthand" for "built-in" data sources,not for external data sources. spark.read.format() is still anofficial way to use it. Delta Lake is not included in Apache Spark sothat is indeed not possible for Spark to refer to.
Starting from Spark 3.0, the concept of "catalog" is introduced, whichyou can simply refer to the table from catalog (if the external datasource provides catalog implementation) and no need to specify theformat explicitly (as catalog would know about it).
This session explains the catalog and how Cassandra connectorleverages it. I see some external data sources starting to supportcatalog, and in Spark itself there's some effort to support catalogfor JDBC.
https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)
On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael<michael.mo...@siemens-healthineers.com<mailto:michael.mo...@siemens-healthineers.com>> wrote:
    Hi there,

    I’m just wondering if there is any incentive to implement
    read/write methods in the DataFrameReader/DataFrameWriter for
    delta similar to e.g. parquet?

    For example, using PySpark, “spark.read.parquet” is available, but
    “spark.read.delta” is not (same for write).

    In my opinion, “spark.read.delta” feels more clean and pythonic
    compared to “spark.read.format(‘delta’).load()”, especially if
    more options are called, like “mode”.

    Can anyone explain the reasoning behind this, is this due to the
    Java nature of Spark?

    From a pythonic point of view, I could also imagine a single
    read/write method, with the format as an arg and kwargs related to
    the different file format options.

    Best,

    Michael

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Reply via email to