Re: Implementation for approx_count_distinct_sketch and associated functions

2023-01-20 Thread Ryan Berti
Hello,

Wanted to follow up and link out the Spark PR associated with these changes
; I'm excited to open up the
implementation for community review!

For reference, I worked with @Daniel Tenedorio
 and the Databricks team on a pre-review
, which yielded some interesting
discussions about the existing HyperLogLogPlusPlus implementation. I think
we're in agreement that having a cross-compatible HLL++ implementation
would be super valuable, though I didn't attempt to take that work on in
this PR. I've included a format identifier in this implementation's HLL++
sketches to set us up for migrating to a cross-compatible sketch format /
HLL++ implementation in the future.

Thanks

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028



On Wed, Jan 11, 2023 at 3:23 PM Ryan Berti  wrote:

> Hello!
>
> I've recently wanted to write the sketches associated with the
> approx_count_distinct function to allow for distinct count re-aggregation. 
> This
> 2019 databricks post
> 
>  proposes
> the use of spark-alchemy, and I've also seen other discussions which
> propose building custom UDAFs/UDFs to achieve the desired effect. Trino
> supports re-aggregating HLL sketches
>  natively, and
> I figured Spark should also provide this functionality natively.
>
> After searching the Spark JIRA and this dev list, I found a few requests
> for this functionality:
>
>- Here's a ticket that was closed (due to inactivity?) in 2019
>, where there
>seemed to be agreement that adding the requested implementation would be
>simple
>- Here's a discussion in this dev list from 2015
>,
>which focuses on implementing the functionality via legacy(?) APIs, and
>interoperability between HLL implementations.
>
> I've implemented two new agg functions and a new misc function
>  that handle HLL++ sketches,
> and I'd like to open my implementation up for review. Can someone help me
> re-open SPARK-16484 ,
> and then I'll move forward with opening a PR against the main spark repo?
>
> Thanks!
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>
>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-20 Thread Jungtaek Lim
Heads-up: It's addressed via
https://issues.apache.org/jira/browse/SPARK-42075. We just marked
deprecation in the entry point of DStream, StreamContext. Marking all
classes in the DStream module is not pragmatic and users would see the
warning message anyway.

On Mon, Jan 16, 2023 at 8:26 AM Jungtaek Lim 
wrote:

> Given that I got more than 3 PMC members' positive votes as well as
> several active contributors' positive votes as well, I will proceed with
> the actual work.
> (It may take a couple of more days as folk in US will help me and there's
> a holiday in US.)
>
> Please let me know if we want to have an official vote thread before
> moving forward.
>
> Thanks all for providing your voices on this!
>
> On Sat, Jan 14, 2023 at 3:56 AM Anish Shrigondekar <
> anish.shrigonde...@databricks.com> wrote:
>
>> +1 on the Dstreams deprecation proposal
>>
>> On Fri, Jan 13, 2023 at 10:47 AM Jerry Peng 
>> wrote:
>>
>>> +1 in general for marking the DStreams API as deprecated
>>>
>>> Jungtaek, can you please provide / elaborate on the concrete actions you
>>> intend on taking for the depreciation process?
>>>
>>> Best,
>>>
>>> Jerry
>>>
>>> On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:
>>>
 +1

 On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
  wrote:
 >
 > Yes, exactly. I'm sorry to bring confusion - should have clarified
 action items on the proposal.
 >
 > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:
 >>
 >> Then, could you elaborate `the proposed code change` specifically?
 >> Maybe, usual deprecation warning logs and annotation on the API?
 >>
 >>
 >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 >>>
 >>> Maybe I need to clarify - my proposal is "explicitly" deprecating
 it, which incurs code change for sure. Guidance on the Spark website is
 done already as I mentioned - we updated the DStream doc page to mention
 that DStream is a "legacy" project and users should move to SS. I don't
 feel this is sufficient to refrain users from using it, hence initiating
 this proposal.
 >>>
 >>> Sorry to make confusion. I just wanted to make sure the goal of the
 proposal is not "removing" the API. The discussion on the removal of API
 doesn't tend to go well, so I wanted to make sure I don't mean that.
 >>>
 >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:
 
  +1 for the proposal (guiding only without any code change).
 
  Thanks,
  Dongjoon.
 
  On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
 wrote:
 >
 > +1
 >
 >
 > On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:
 >>
 >> +1
 >>
 >> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon <
 gurwls...@gmail.com> wrote:
 >>>
 >>> +1
 >>>
 >>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 
  bump for more visibility.
 
  On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 >
 > Hi dev,
 >
 > I'd like to propose the deprecation of DStream in Spark 3.4,
 in favor of promoting Structured Streaming.
 > (Sorry for the late proposal, if we don't make the change in
 3.4, we will have to wait for another 6 months.)
 >
 > We have been focusing on Structured Streaming for years
 (across multiple major and minor versions), and during the time we haven't
 made any improvements for DStream. Furthermore, recently we updated the
 DStream doc to explicitly say DStream is a legacy project.
 >
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
 >
 > The baseline of deprecation is that we don't see a particular
 use case which only DStream solves. This is a different story with GraphX
 and MLLIB, as we don't have replacements for that.
 >
 > The proposal does not mean we will remove the API soon, as
 the Spark project has been making deprecation against public API. I don't
 intend to propose the target version for removal. The goal is to guide
 users to refrain from constructing a new workload with DStream. We might
 want to go with this in future, but it would require a new discussion
 thread at that time.
 >
 > What do you think?
 >
 > Thanks,
 > Jungtaek Lim (HeartSaVioR)

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org