Though spark.read.<format> refers to "built-in" data sources, there is
nothing that prevents 3rd party libraries to "extend" spark.read in
Scala or Python. As users know the Spark-way to read built-in data
sources, it feels natural to hook 3rd party data sources under the same
scheme, to give users a holistic and integrated feel.
One Scala example
(https://github.com/G-Research/spark-dgraph-connector#spark-dgraph-connector):
import uk.co.gresearch.spark.dgraph.connector._
val triples = spark.read.dgraph.triples("localhost:9080")
and in Python:
from gresearch.spark.dgraph.connector import *
triples = spark.read.dgraph.triples("localhost:9080")
I agree that 3rd parties should also support the official
spark.read.format() and the new catalog approaches.
Enrico
Am 05.10.20 um 14:03 schrieb Jungtaek Lim:
Hi,
"spark.read.<format>" is a "shorthand" for "built-in" data sources,
not for external data sources. spark.read.format() is still an
official way to use it. Delta Lake is not included in Apache Spark so
that is indeed not possible for Spark to refer to.
Starting from Spark 3.0, the concept of "catalog" is introduced, which
you can simply refer to the table from catalog (if the external data
source provides catalog implementation) and no need to specify the
format explicitly (as catalog would know about it).
This session explains the catalog and how Cassandra connector
leverages it. I see some external data sources starting to support
catalog, and in Spark itself there's some effort to support catalog
for JDBC.
https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world
Hope this helps.
Thanks,
Jungtaek Lim (HeartSaVioR)
On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael
<michael.mo...@siemens-healthineers.com
<mailto:michael.mo...@siemens-healthineers.com>> wrote:
Hi there,
I’m just wondering if there is any incentive to implement
read/write methods in the DataFrameReader/DataFrameWriter for
delta similar to e.g. parquet?
For example, using PySpark, “spark.read.parquet” is available, but
“spark.read.delta” is not (same for write).
In my opinion, “spark.read.delta” feels more clean and pythonic
compared to “spark.read.format(‘delta’).load()”, especially if
more options are called, like “mode”.
Can anyone explain the reasoning behind this, is this due to the
Java nature of Spark?
From a pythonic point of view, I could also imagine a single
read/write method, with the format as an arg and kwargs related to
the different file format options.
Best,
Michael