DataSourceV2 sync notes - 29 May 2019

Ryan Blue Thu, 30 May 2019 15:19:31 -0700

Here are my notes from last night’s sync. I had to leave early, so there
may be more discussion. Others can fill in the details for those topics.


*Attendees*:

John Zhuge
Ryan Blue
Yifei Huang
Matt Cheah
Yuanjian Li
Russell Spitzer
Kevin Yu

*Topics*:

   - Atomic extensions for the TableCatalog API
   - Moving DSv2 to Catalyst - should this include package renames?
   - Catalogs and table resolution: proposal to prefer default v2 catalog
   when defined

*Notes*:

   - Skipping discussion of open PRs
   - Atomic table catalogs:
      - Matt: the proposal in the SPIP makes sense. When should Spark use
      the atomic API? Is there a way for a user to signal that Spark should use
      the staging calls? Spark could use SQL transaction statements for this.
      - Ryan: the atomic operations that we are currently targeting with
      the TableCatalog extensions are single statements, like CREATE TABLE AS
      SELECT. Transaction statements (e.g., BEGIN) are for multi-statement
      transactions and are out of scope.
      - Ryan: Because the expected behavior of the commands (CTAS, RTAS) is
      that atomic, Spark should use always use atomic implementations
if they are
      available. No need for a user to opt in.
      - Matt: What should REPLACE TABLE do if transactions are not
      supported? If the write fails, the table would be deleted
      - Ryan: REPLACE is a combination of DROP TABLE and CREATE TABLE AS
      SELECT. By using it, user is signaling that if a combined operation is
      possible, Spark should use it. So REPLACE TABLE signals intent
to drop and
      it is the right thing to drop the table if an atomic replace is not
      supported.
      - There was also some confusion about whether IF EXISTS should be
      supported. The consensus was that REPLACE TABLE AS SELECT is
expected to be
      idempotent and should not fail if the target table does not exist.
   - Moving DSv2 to catalyst - skipped because Wenchen did not attend
   - Catalogs and table resolution:
      - Ryan: Table resolution with catalogs is getting complicated when
      namespaces overlap. If an identifier has a catalog, then it is
easy to use
      a v2 catalog. But when the identifier does not have a catalog, there is a
      namespace overlap between session catalog tables and the default
v2 catalog
      tables. It would be much easier to understand and document if we used a
      simple rule for precedence. We suggest using session catalog unless the
      default v2 catalog is defined, then using the v2 catalog by default.
      - This makes the behavior easy to document and reason about, with few
      special cases. To guarantee compatibility, we will need a v2
implementation
      that delegates to session catalog.
      - Ryan: If there aren’t objections, I’ll raise this on the dev list.
      We should make a decision there.

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 29 May 2019

Reply via email to