Mark DataFrame/Dataset APIs stable

Reynold Xin Wed, 12 Oct 2016 21:26:55 -0700

I took a look at all the public APIs we expose in o.a.spark.sql tonight,
and realized we still have a large number of APIs that are marked
experimental. Most of these haven't really changed, except in 2.0 we merged
DataFrame and Dataset. I think it's long overdue to mark them stable.


I'm tracking this via ticket:
https://issues.apache.org/jira/browse/SPARK-17900

*The list I've come up with to graduate are*:

Dataset/DataFrame
- functions, since 1.3
- ColumnName, since 1.3
- DataFrameNaFunctions, since 1.3.1
- DataFrameStatFunctions, since 1.4
- UserDefinedFunction, since 1.3
- UserDefinedAggregateFunction, since 1.5
- Window and WindowSpec, since 1.4

Data sources:
- DataSourceRegister, since 1.5
- RelationProvider, since 1.3
- SchemaRelationProvider, since 1.3
- CreatableRelationProvider, since 1.3
- BaseRelation, since 1.3
- TableScan, since 1.3
- PrunedScan, since 1.3
- PrunedFilteredScan, since 1.3
- InsertableRelation, since 1.3


*The list I think we should definitely keep experimental are*:

- CatalystScan in data source (tied to internal logical plans so it is not
stable by definition)
- all classes related to Structured streaming (introduced new in 2.0 and
will likely change)


*The ones that I'm not sure whether we should graduate are:*

Typed operations for Datasets, including:
- all typed methods on Dataset class
- KeyValueGroupedDataset
- o.a.s.sql.expressions.javalang.typed
- o.a.s.sql.expressions.scalalang.typed
- methods that return typed Dataset in SparkSession

Most of these were introduced in 1.6 and had gone through drastic changes
in 2.0. I think we should try very hard not to break them any more, but we
might still run into issues in the future that require changing these.


Let me know what you think.

Mark DataFrame/Dataset APIs stable

Reply via email to