Sounds similar to Confluent Kafka Schema Registry and Kafka Connect.
The Schema Registry and Kafka Connect themselves are open-source, but some of
the datasource-specific adapters, and GUIs to manage it all, are not
open-source (see Confluent Enterprise Edition).
Note that the Schema Registry and Kafka Connect are generic tools, and not
spark-specific.
Regards, Simon
> Am 08.07.2017 um 19:49 schrieb Benjamin Kim :
>
> Has anyone seen AWS Glue? I was wondering if there is something similar going
> to be built into Spark Structured Streaming? I like the Data Catalog idea to
> store and track any data source/destination. It profiles the data to derive
> the scheme and data types. Also, it does some sort-of automated schema
> evolution when or if the schema changes. It leaves only the transformation
> logic to the ETL developer. I think some of this can enhance or simplify
> Structured Streaming. For example, AWS S3 can be catalogued as a Data Source;
> in Structured Streaming, Input DataFrame is created like a SQL view based off
> of the S3 Data Source; lastly, the Transform logic, if any, just manipulates
> the data going from the Input DataFrame to the Result DataFrame, which is
> another view based off of a catalogued Data Destination. This would relieve
> the ETL developer from caring about any Data Source or Destination. All
> server information, access credentials, data schemas, folder directory
> structures, file formats, and any other properties can be securely stored
> away with only a select few.
>
> I'm just curious to know if anyone has thought the same thing.
>
> Cheers,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org