Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Reynold Xin
+1


On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Wednesday October 4th at 23:59 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc2
>  (fabbb7f59e47590
> 114366d14e15fbbff8c88593c)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1251
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3 fixed into 2.1.2 as appropriate.
>
> *Why the longer voting window?*
>
> Since there is a large industry big data conference this week I figured
> I'd add a little bit of extra buffer time just to make sure everyone has a
> chance to take a look.
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core/sql-core/sql-catalyst/mllib/mllib-local have passed.

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 
1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

% build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 
24 clean package install
% build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core 
-pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
...
Run completed in 12 minutes, 42 seconds.
Total number of tests run: 1035
Suites: completed 166, aborted 0
Tests: succeeded 1035, failed 0, canceled 0, ignored 5, pending 0
All tests passed.
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Core . SUCCESS [17:14 
min]
[INFO] Spark Project ML Local Library . SUCCESS [ 
4.067 s]
[INFO] Spark Project Catalyst . SUCCESS [08:23 
min]
[INFO] Spark Project SQL .. SUCCESS [10:50 
min]
[INFO] Spark Project ML Library ... SUCCESS [15:45 
min]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 52:20 min
[INFO] Finished at: 2017-09-28T12:16:46+09:00
[INFO] Final Memory: 103M/309M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki



From:   Dongjoon Hyun 
To: Denny Lee 
Cc: Sean Owen , Holden Karau 
, "dev@spark.apache.org" 
Date:   2017/09/28 07:57
Subject:Re: [VOTE] Spark 2.1.2 (RC2)



+1 (non-binding)

Bests,
Dongjoon.


On Wed, Sep 27, 2017 at 7:54 AM, Denny Lee  wrote:
+1 (non-binding)


On Wed, Sep 27, 2017 at 6:54 AM Sean Owen  wrote:
+1

I tested the source release.
Hashes and signature (your signature) check out, project builds and tests 
pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
List of issues look good and there are no open issues at all for 2.1.2.

Great work on improving the build process and docs.


On Wed, Sep 27, 2017 at 5:47 AM Holden Karau  wrote:
Please vote on releasing the following candidate as Apache Spark 
version 2.1.2. The vote is open until Wednesday October 4th at 23:59 
PST and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.1.2-rc2 (
fabbb7f59e47590114366d14e15fbbff8c88593c)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc2-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1251

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc2-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can 
add the staging repository to your projects resolvers and test with 
the RC (make sure to clean up the artifact cache before/after so you don't 
end up building with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1. That being said if 
there is something which is a regression form 2.1.1 that has not been 
correctly targeted please ping a committer to help target the issue (you 
can see the open issues listed as impacting Spark 2.1.1 & 2.1.2)

What are the unresolved issues targeted for 2.1.2?

At this time there are no open unresolved issues.

Is there anything different about this release?

This 

Inclusion of Spark on SDKMAN

2017-09-27 Thread Marco Vermeulen
Hi all,

My name is Marco and I am the project lead of SDKMAN. For those of you who are 
not familiar with the project, it is a FLOSS SDK management tool which allows 
you to install and switch seamlessly between multiple versions of the same SDK 
when using UNIX shells. You can read more about it on our website[1].

I’ve started using Spark myself on a project and was thinking that it would be 
a very good candidate too be hosted on SDKMAN. This becomes especially apparent 
when needing to switch between versions of Spark while developing.

The reason I’m writing here is because our tool has an API that allows SDK 
providers to push their own releases to our service. We don’t host the actual 
binaries, but it merely enables our tool to point to your new release archives 
and allow for super easy installation. This can be done by either a few simple 
REST calls as part of your release process or else automated by using our Maven 
release plugin. 

Would the Spark dev community be open for something like this? A recent poll on 
Twitter shows a good appetite for Spark on SDKMAN by our users[2]. Also, we 
already have many teams pushing to us in this manner including Groovy, Kotlin, 
Ceylon, OpenJDK, Gradle, SBT to name a few. Having Spark included would be 
really great.

Apologies in advance if this is not the correct forum for release related posts.
Many thanks,
Marco.

[1] http://sdkman.io
[2] https://twitter.com/sdkman_/status/907698363877003264


Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-27 Thread Joseph Bradley
This vote passes with 11 +1s (4 binding) and no +0s or -1s.

+1:
Sean Owen (binding)
Holden Karau
Denny Lee
Reynold Xin (binding)
Joseph Bradley (binding)
Noman Khan
Weichen Xu
Yanbo Liang
Dongjoon Hyun
Matei Zaharia (binding)
Vaquar Khan

Thanks everyone!
Joseph

On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan  wrote:

> +1 looks good,
>
> Regards,
> Vaquar khan
>
> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia 
> wrote:
>
>> +1; we should consider something similar for multi-dimensional tensors
>> too.
>>
>> Matei
>>
>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang  wrote:
>> >
>> > +1
>> >
>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan 
>> wrote:
>> > +1
>> >
>> > Regards
>> > Noman
>> > From: Denny Lee 
>> > Sent: Friday, September 22, 2017 2:59:33 AM
>> > To: Apache Spark Dev; Sean Owen; Tim Hunter
>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>> >
>> > +1
>> >
>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:
>> > Am I right that this doesn't mean other packages would use this
>> representation, but that they could?
>> >
>> > The representation looked fine to me w.r.t. what DL frameworks need.
>> >
>> > My previous comment was that this is actually quite lightweight. It's
>> kind of like how I/O support is provided for CSV and JSON, so makes enough
>> sense to add to Spark. It doesn't really preclude other solutions.
>> >
>> > For those reasons I think it's fine. +1
>> >
>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter 
>> wrote:
>> > Hello community,
>> >
>> > I would like to call for a vote on SPARK-21866. It is a short proposal
>> that has important applications for image processing and deep learning.
>> Joseph Bradley has offered to be the shepherd.
>> >
>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>> > PDF version: https://issues.apache.org/jira/secure/attachment/12884792/
>> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
>> >
>> > Background and motivation
>> > As Apache Spark is being used more and more in the industry, some new
>> use cases are emerging for different data formats beyond the traditional
>> SQL types or the numerical types (vectors and matrices). Deep Learning
>> applications commonly deal with image processing. A number of projects add
>> some Deep Learning capabilities to Spark (see list below), but they
>> struggle to communicate with each other or with MLlib pipelines because
>> there is no standard way to represent an image in Spark DataFrames. We
>> propose to federate efforts for representing images in Spark by defining a
>> representation that caters to the most common needs of users and library
>> developers.
>> > This SPIP proposes a specification to represent images in Spark
>> DataFrames and Datasets (based on existing industrial standards), and an
>> interface for loading sources of images. It is not meant to be a
>> full-fledged image processing library, but rather the core description that
>> other libraries and users can rely on. Several packages already offer
>> various processing facilities for transforming images or doing more complex
>> operations, and each has various design tradeoffs that make them better as
>> standalone solutions.
>> > This project is a joint collaboration between Microsoft and Databricks,
>> which have been testing this design in two open source packages: MMLSpark
>> and Deep Learning Pipelines.
>> > The proposed image format is an in-memory, decompressed representation
>> that targets low-level applications. It is significantly more liberal in
>> memory usage than compressed image representations such as JPEG, PNG, etc.,
>> but it allows easy communication with popular image processing libraries
>> and has no decoding overhead.
>> > Targets users and personas:
>> > Data scientists, data engineers, library developers.
>> > The following libraries define primitives for loading and representing
>> images, and will gain from a common interchange format (in alphabetical
>> order):
>> >   • BigDL
>> >   • DeepLearning4J
>> >   • Deep Learning Pipelines
>> >   • MMLSpark
>> >   • TensorFlow (Spark connector)
>> >   • TensorFlowOnSpark
>> >   • TensorFrames
>> >   • Thunder
>> > Goals:
>> >   • Simple representation of images in Spark DataFrames, based on
>> pre-existing industrial standards (OpenCV)
>> >   • This format should eventually allow the development of
>> high-performance integration points with image processing libraries such as
>> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>> >   • The reader should be able to read popular formats of images
>> from distributed sources.
>> > Non-Goals:
>> > Images are a versatile medium and encompass a very wide range of
>> formats and 

Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Dongjoon Hyun
+1 (non-binding)

Bests,
Dongjoon.


On Wed, Sep 27, 2017 at 7:54 AM, Denny Lee  wrote:

> +1 (non-binding)
>
>
> On Wed, Sep 27, 2017 at 6:54 AM Sean Owen  wrote:
>
>> +1
>>
>> I tested the source release.
>> Hashes and signature (your signature) check out, project builds and tests
>> pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
>> List of issues look good and there are no open issues at all for 2.1.2.
>>
>> Great work on improving the build process and docs.
>>
>>
>> On Wed, Sep 27, 2017 at 5:47 AM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Wednesday October 4th at 23:59
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc2
>>>  (fabbb7f59e47590
>>> 114366d14e15fbbff8c88593c)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1251
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *Why the longer voting window?*
>>>
>>> Since there is a large industry big data conference this week I figured
>>> I'd add a little bit of extra buffer time just to make sure everyone has a
>>> chance to take a look.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>


Re: [discuss] Data Source V2 write path

2017-09-27 Thread Russell Spitzer
On an unrelated note, is there any appetite for making the write path also
include an option to return elements that were not
able to be processed for some reason.

Usage might be like

saveAndIgnoreFailures() : Dataset

So that if some records cannot be parsed by the datasource for writing, or
violate some contract with the datasource the records can be returned for
further processing or dealt with by an alternate system.

On Wed, Sep 27, 2017 at 12:40 PM Ryan Blue 
wrote:

> Comments inline. I've written up what I'm proposing with a bit more detail.
>
> On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan  wrote:
>
>> I'm trying to give a summary:
>>
>> Ideally data source API should only deal with data, not metadata. But one
>> key problem is, Spark still need to support data sources without metastore,
>> e.g. file format data sources.
>>
>> For this kind of data sources, users have to pass the metadata
>> information like partitioning/bucketing to every write action of a
>> "table"(or other identifiers like path of a file format data source), and
>> it's user's responsibility to make sure these metadata information are
>> consistent. If it's inconsistent, the behavior is undefined, different data
>> sources may have different behaviors.
>>
>
> Agreed so far. One minor point is that we currently throws an exception if
> you try to configure, for example, partitioning and also use `insertInto`.
>
>
>> If we agree on this, then data source write API should have a way to pass
>> these metadata information, and I think using data source options is a good
>> choice because it's the most implicit way and doesn't require new APIs.
>>
>
> What I don't understand is why we "can't avoid this problem" unless you
> mean the last point, that we have to support this. I don't think that using
> data source options is a good choice, but maybe I don't understand the
> alternatives. Here's a straw-man version of what I'm proposing so you can
> tell me what's wrong with it or why options are a better choice.
>
> I'm assuming we start with a query like this:
> ```
>
> df.write.partitionBy("utc_date").bucketBy("primary_key").format("parquet").saveAsTable("s3://bucket/path/")
> ```
>
> That creates a logical node, `CreateTableAsSelect`, with some options. It
> would contain a `Relation` (or `CatalogTable` definition?) that corresponds
> to the user's table name and `partitionBy`, `format`, etc. calls. It should
> also have a write mode and the logical plan for `df`.
>
> When this CTAS logical node is turned into a physical plan, the relation
> gets turned into a `DataSourceV2` instance and then Spark gets a writer and
> configures it with the proposed API. The main point of this is to pass the
> logical relation (with all of the user's options) through to the data
> source, not the writer. The data source creates the writer and can tell the
> writer what to do. Another benefit of this approach is that the relation
> gets resolved during analysis, when it is easy to add sorts and other
> requirements to the logical plan.
>
> If we were to implement what I'm suggesting, then we could handle metadata
> conflicts outside of the `DataSourceV2Writer`, in the data source. That
> eliminates problems about defining behavior when there are conflicts (the
> next point) and prepares implementations for a catalog API that would
> standardize how those conflicts are handled. In the short term, this
> doesn't have to be in a public API yet. It can be special handling for
> HadoopFS relations that we can eventually use underneath a public API.
>
> Please let me know if I've misunderstood something. Now that I've written
> out how we could actually implement conflict handling outside of the
> writer, I can see that it isn't as obvious of a change as I thought. But, I
> think in the long term this would be a better way to go.
>
>
>> But then we have another problem: how to define the behavior for data
>> sources with metastore when the given options contain metadata information?
>> A typical case is `DataFrameWriter.saveAsTable`, when a user calls it with
>> partition columns, he doesn't know what will happen. The table may not
>> exist and he may create the table successfully with specified partition
>> columns, or the table already exist but has inconsistent partition columns
>> and Spark throws exception. Besides, save mode doesn't play well in this
>> case, as we may need different save modes for data and metadata.
>>
>> My proposal: data source API should only focus on data, but concrete data
>> sources can implement some dirty features via options. e.g. file format
>> data sources can take partitioning/bucketing from options, data source with
>> metastore can use a special flag in options to indicate a create table
>> command(without writing data).
>>
>
> I can see how this would make changes smaller, but I don't think it is a
> good thing to do. If we do this, then I think we will not really 

Re: [discuss] Data Source V2 write path

2017-09-27 Thread Ryan Blue
Comments inline. I've written up what I'm proposing with a bit more detail.

On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan  wrote:

> I'm trying to give a summary:
>
> Ideally data source API should only deal with data, not metadata. But one
> key problem is, Spark still need to support data sources without metastore,
> e.g. file format data sources.
>
> For this kind of data sources, users have to pass the metadata information
> like partitioning/bucketing to every write action of a "table"(or other
> identifiers like path of a file format data source), and it's user's
> responsibility to make sure these metadata information are consistent. If
> it's inconsistent, the behavior is undefined, different data sources may
> have different behaviors.
>

Agreed so far. One minor point is that we currently throws an exception if
you try to configure, for example, partitioning and also use `insertInto`.


> If we agree on this, then data source write API should have a way to pass
> these metadata information, and I think using data source options is a good
> choice because it's the most implicit way and doesn't require new APIs.
>

What I don't understand is why we "can't avoid this problem" unless you
mean the last point, that we have to support this. I don't think that using
data source options is a good choice, but maybe I don't understand the
alternatives. Here's a straw-man version of what I'm proposing so you can
tell me what's wrong with it or why options are a better choice.

I'm assuming we start with a query like this:
```
df.write.partitionBy("utc_date").bucketBy("primary_key").format("parquet").
saveAsTable("s3://bucket/path/")
```

That creates a logical node, `CreateTableAsSelect`, with some options. It
would contain a `Relation` (or `CatalogTable` definition?) that corresponds
to the user's table name and `partitionBy`, `format`, etc. calls. It should
also have a write mode and the logical plan for `df`.

When this CTAS logical node is turned into a physical plan, the relation
gets turned into a `DataSourceV2` instance and then Spark gets a writer and
configures it with the proposed API. The main point of this is to pass the
logical relation (with all of the user's options) through to the data
source, not the writer. The data source creates the writer and can tell the
writer what to do. Another benefit of this approach is that the relation
gets resolved during analysis, when it is easy to add sorts and other
requirements to the logical plan.

If we were to implement what I'm suggesting, then we could handle metadata
conflicts outside of the `DataSourceV2Writer`, in the data source. That
eliminates problems about defining behavior when there are conflicts (the
next point) and prepares implementations for a catalog API that would
standardize how those conflicts are handled. In the short term, this
doesn't have to be in a public API yet. It can be special handling for
HadoopFS relations that we can eventually use underneath a public API.

Please let me know if I've misunderstood something. Now that I've written
out how we could actually implement conflict handling outside of the
writer, I can see that it isn't as obvious of a change as I thought. But, I
think in the long term this would be a better way to go.


> But then we have another problem: how to define the behavior for data
> sources with metastore when the given options contain metadata information?
> A typical case is `DataFrameWriter.saveAsTable`, when a user calls it with
> partition columns, he doesn't know what will happen. The table may not
> exist and he may create the table successfully with specified partition
> columns, or the table already exist but has inconsistent partition columns
> and Spark throws exception. Besides, save mode doesn't play well in this
> case, as we may need different save modes for data and metadata.
>
> My proposal: data source API should only focus on data, but concrete data
> sources can implement some dirty features via options. e.g. file format
> data sources can take partitioning/bucketing from options, data source with
> metastore can use a special flag in options to indicate a create table
> command(without writing data).
>

I can see how this would make changes smaller, but I don't think it is a
good thing to do. If we do this, then I think we will not really accomplish
what we want to with this (a clean write API).


> In other words, Spark connects users to data sources with a clean protocol
> that only focus on data, but this protocol has a backdoor: the data source
> options. Concrete data sources are free to define how to deal with
> metadata, e.g. Cassandra data source can ask users to create table at
> Cassandra side first, then write data at Spark side, or ask users to
> provide more details in options and do CTAS at Spark side. These can be
> done via options.
>
> After catalog federation, hopefully only file format data sources still
> use this backdoor.
>

Why would 

Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Denny Lee
+1 (non-binding)


On Wed, Sep 27, 2017 at 6:54 AM Sean Owen  wrote:

> +1
>
> I tested the source release.
> Hashes and signature (your signature) check out, project builds and tests
> pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
> List of issues look good and there are no open issues at all for 2.1.2.
>
> Great work on improving the build process and docs.
>
>
> On Wed, Sep 27, 2017 at 5:47 AM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.2. The vote is open until Wednesday October 4th at 23:59 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc2
>>  (
>> fabbb7f59e47590114366d14e15fbbff8c88593c)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>>
>> Release artifacts are signed with a key from:
>> https://people.apache.org/~holden/holdens_keys.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1251
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you
>> can add the staging repository to your projects resolvers and test with the
>> RC (make sure to clean up the artifact cache before/after so you don't
>> end up building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1. That being said
>> if there is something which is a regression form 2.1.1 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>> 
>> )
>>
>> *What are the unresolved* issues targeted for 2.1.2
>> 
>> ?
>>
>> At this time there are no open unresolved issues.
>>
>> *Is there anything different about this release?*
>>
>> This is the first release in awhile not built on the AMPLAB Jenkins. This
>> is good because it means future releases can more easily be built and
>> signed securely (and I've been updating the documentation in
>> https://github.com/apache/spark-website/pull/66 as I progress), however
>> the chances of a mistake are higher with any change like this. If there
>> something you normally take for granted as correct when checking a release,
>> please double check this time :)
>>
>> *Should I be committing code to branch-2.1?*
>>
>> Thanks for asking! Please treat this stage in the RC process as "code
>> freeze" so bug fixes only. If you're uncertain if something should be back
>> ported please reach out. If you do commit to branch-2.1 please tag your
>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>> 2.1.3 fixed into 2.1.2 as appropriate.
>>
>> *Why the longer voting window?*
>>
>> Since there is a large industry big data conference this week I figured
>> I'd add a little bit of extra buffer time just to make sure everyone has a
>> chance to take a look.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Sean Owen
+1

I tested the source release.
Hashes and signature (your signature) check out, project builds and tests
pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
List of issues look good and there are no open issues at all for 2.1.2.

Great work on improving the build process and docs.


On Wed, Sep 27, 2017 at 5:47 AM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Wednesday October 4th at 23:59 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc2
>  (
> fabbb7f59e47590114366d14e15fbbff8c88593c)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1251
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3 fixed into 2.1.2 as appropriate.
>
> *Why the longer voting window?*
>
> Since there is a large industry big data conference this week I figured
> I'd add a little bit of extra buffer time just to make sure everyone has a
> chance to take a look.
>
> --
> Twitter: https://twitter.com/holdenkarau
>