Re: [vote] Apache Spark 3.0 RC3

2020-06-14 Thread Dongjoon Hyun
Hi, Reynold.

Is there any progress on 3.0.0 release since the vote was finalized 5 days
ago?

Apparently, tag `v3.0.0` is not created yet, the binary and docs are still
sitting on the voting location, Maven Central doesn't have it, and
PySpark/SparkR uploading is not started yet.

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/

Like Apache Spark 2.0.1 had 316 fixes after 2.0.0, we already have 35
patches on top of `v3.0.0-rc3` and are expecting more.

Although we can have Apache Spark 3.0.1 very soon before Spark+AI Summit,
Apache Spark 3.0.0 should be available in Apache Spark distribution channel
because it passed the vote.

Apache Spark 3.0.0 release itself helps the community use 3.0-line codebase
and makes the codebase healthy.

Please let us know if you need any help from the community for 3.0.0
release.

Thanks,
Dongjoon.


On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia 
wrote:

> Congrats! Excited to see the release posted soon.
>
> On Jun 9, 2020, at 6:39 PM, Reynold Xin  wrote:
>
> 
> I waited another day to account for the weekend. This vote passes with the
> following +1 votes and no -1 votes!
>
> I'll start the release prep later this week.
>
> +1:
> Reynold Xin (binding)
> Prashant Sharma (binding)
> Gengliang Wang
> Sean Owen (binding)
> Mridul Muralidharan (binding)
> Takeshi Yamamuro
> Maxim Gekk
> Matei Zaharia (binding)
> Jungtaek Lim
> Denny Lee
> Russell Spitzer
> Dongjoon Hyun (binding)
> DB Tsai (binding)
> Michael Armbrust (binding)
> Tom Graves (binding)
> Bryan Cutler
> Huaxin Gao
> Jiaxin Shan
> Xingbo Jiang
> Xiao Li (binding)
> Hyukjin Kwon (binding)
> Kent Yao
> Wenchen Fan (binding)
> Shixiong Zhu (binding)
> Burak Yavuz
> Tathagata Das (binding)
> Ryan Blue
>
> -1: None
>
>
>
> On Sat, Jun 06, 2020 at 1:08 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0.
>>
>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>> are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-rc3 (commit
>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> This release is using the release script of the tag v3.0.0-rc3.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>>
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>
>


Re: Handling user-facing metadata issues on file stream source & sink

2020-06-14 Thread Jungtaek Lim
Bump again - hope to get some traction because these issues are either
long-standing problems or noticeable improvements (each PR has numbers/UI
graph to show the improvement).

Fixed long-standing problems:

* [SPARK-17604][SS] FileStreamSource: provide a new option to have
retention on input files [1]
* [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
on output files [2]

There's no logic to control the size of metadata for file stream source &
file stream sink, and it affects end users who run the streaming query with
many input files / output files in the long run. Both are to resolve
metadata growing incrementally over time. As the number of the issue
represents for SPARK-17604 it's a fairly old problem. There're at least
three relevant issues being reported on SPARK-27188.

Improvements:

* [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond
maxFilesPerTrigger as unread files [3]
* [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log
twice if the query restarts from compact batch [4]
* [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with
LZ4 compression on FileStream(Source/Sink)Log [5]

Above patches provide better performance on the condition described on each
PR. Worth noting, SPARK-30946 provides pretty much better performance
(~10x) on compaction per every compact batch, whereas it also reduces down
the compact batch log file (~30% of current).

1. https://github.com/apache/spark/pull/28422
2. https://github.com/apache/spark/pull/28363
3. https://github.com/apache/spark/pull/27620
4. https://github.com/apache/spark/pull/27649
5. https://github.com/apache/spark/pull/27694


On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim 
wrote:

> Worth noting that I got similar question around local community as well.
> These reporters didn't encounter the edge-case, they're encountered the
> critical issue in the normal running of streaming query.
>
> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim 
> wrote:
>
>> (bump to expose the discussion to more readers)
>>
>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim 
>> wrote:
>>
>>> Hi devs,
>>>
>>> I'm seeing more and more structured streaming end users encountered the
>>> metadata issues on file stream source and sink. They have been known-issues
>>> and there're even long-standing JIRA issues reported before, end users
>>> report them again in user@ mailing list in April.
>>>
>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>>> input files | Spark -2.4.0 [1]
>>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>>
>>> I've proposed various improvements on the area (see my PRs [3]) but
>>> suffered on lack of interests/reviews. I feel the issue is critical
>>> (under-estimated) because...
>>>
>>> 1. It's one of "built-in" data sources which is being maintained by
>>> Spark community. (End users may judge the state of project/area on the
>>> quality on the built-in data source, because that's the thing they would
>>> start with.)
>>> 2. It's the only built-in data source which provides "end-to-end
>>> exactly-once" in structured streaming.
>>>
>>> I'd hope to see us address such issues so that end users can live with
>>> built-in data source. (It may not need to be perfect, but at least be
>>> reasonable on the long-run streaming workloads.) I know there're couple of
>>> alternatives, but I don't think starter would start from there. End users
>>> may just try to find alternatives - not alternative of data source, but
>>> alternative of streaming processing framework.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 2.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>