from:"Gian Merlino"

Re: [VOTE] Release Apache Druid 29.0.0 [RC1]

2024-02-16 Thread Gian Merlino

Here's a patch with the validation idea: 
https://github.com/apache/druid/pull/15920

It adds validation for the most problematic case (mixing strings and arrays), 
provides a way to override the validation, and makes the warning log on the 
controller task when arrayIngestMode is 'mvd' more friendly and explanatory.

Depending on which direction you're going in, the errors look like either:

  Cannot write into field[flags] using type[VARCHAR ARRAY] and 
arrayIngestMode[mvd], since the existing type is[VARCHAR ARRAY]. Try setting 
arrayIngestMode to[array] to retain the SQL type[VARCHAR ARRAY]

Or:

  Cannot write into field[flags] using type[VARCHAR ARRAY] and 
arrayIngestMode[array], since the existing type is[VARCHAR]. Try wrapping this 
field using ARRAY_TO_MV(...) AS "flags"

The "try" language pushes people towards the behavior we'd like for the future: 
using arrayIngestMode[array] and wrapping MVDs in ARRAY_TO_MV.

I'm changing my vote to a plain 0, given that _most_ of the changes related to 
arrayIngestMode went out in Druid 28. However I do think it would be nice to 
get a patch like this in, given that the Druid 29 web console is pushing more 
people to change their arrayIngestMode.

Gian

On 2024/02/16 22:24:23 Gian Merlino wrote:
> I just learned that arrayIngestMode is not actually new, just
> https://github.com/apache/druid/pull/15588 is. However this will still make
> it more likely that people accidentally break their tables, so I am still
> -0. Just, slightly less so. I still think it would be a good idea, for
> Druid 29, to add string-to-array type validation to Druid 29's INSERT /
> REPLACE handling to compensate for the new web console support for
> arrayIngestMode, and the UI changes to push people towards setting it to
> "array".
> 
> I could be convinced that it's ok to do that in a 29.0.1. I don't think it
> should wait for 30, given the impact that can happen if people end up with
> mixed types without planning for it.
> 
> On Fri, Feb 16, 2024 at 2:16 PM Gian Merlino  wrote:
> 
> > Thanks for managing this release!
> >
> > My vote is -0, let me explain why. I am concerned about usability issues
> > with the new arrayIngestMode feature. There are various issues when mixing
> > MVD strings and string arrays in the same column: as soon as arrays show up
> > in a column, various "classic MVD-style" queries will fail at validation or
> > at runtime. I believe that the new feature, and especially the changes to
> > the web console in https://github.com/apache/druid/pull/15588, will make
> > it more likely that people will do this by accident and experience
> > brokenness.
> >
> > When this occurs, there is not an easy way to fix it; data needs to be
> > reingested or queries need to be adjusted. I believe that in some cases,
> > queries can't be adjusted without suffering from performance loss.
> >
> > This is something that could happen even before Druid 29, if you did a
> > Kafka ingest with auto types or useSchemaDiscovery, followed by a SQL
> > REPLACE from that table into itself. In that case, arrays written by Kafka
> > ingest would get rewritten as MVDs. But with this Druid 29 RC, there are
> > additional pathways created that enable people to get into this problematic
> > scenario with increased likelihood. I suggest we make some adjustments that
> > prevent it, such as:
> >
> > - Validating that SQL INSERT and SQL REPLACE do not insert an array type
> > into a column that previously contained strings, or vice versa.
> > - Making it obvious in the web console whether a tab is in
> > "arrayIngestMode: array" or "arrayIngestMode: mvd" or "server default".
> >
> > Thank you for considering this viewpoint.
> >
> > Gian
> >
> > On Tue, Feb 13, 2024 at 4:34 AM Laksh Singla 
> > wrote:
> >
> >> Hi all,
> >>
> >> I have created a build for Apache Druid 29.0.0, release
> >> candidate 1.
> >>
> >> Thanks to everyone who has helped contribute to the release! You can read
> >> the proposed release notes here:
> >> https://github.com/apache/druid/issues/15896
> >>
> >> The release candidate has been tagged in GitHub as
> >> druid-29.0.0-rc1 (869bd3978f0c835ef8eb7c1f25c468e23472a81b),
> >> available here:
> >> https://github.com/apache/druid/tree/druid-29.0.0-rc1
> >>
> >> The artifacts to be voted on are located here:
> >> https://dist.apache.org/repos/dist/dev/druid/29.0.0-rc1/
> >>
> >> A staged Maven repository is available for review at:
> >> https://repository.apache.org/cont

Re: [VOTE] Release Apache Druid 29.0.0 [RC1]

2024-02-16 Thread Gian Merlino

I just learned that arrayIngestMode is not actually new, just
https://github.com/apache/druid/pull/15588 is. However this will still make
it more likely that people accidentally break their tables, so I am still
-0. Just, slightly less so. I still think it would be a good idea, for
Druid 29, to add string-to-array type validation to Druid 29's INSERT /
REPLACE handling to compensate for the new web console support for
arrayIngestMode, and the UI changes to push people towards setting it to
"array".

I could be convinced that it's ok to do that in a 29.0.1. I don't think it
should wait for 30, given the impact that can happen if people end up with
mixed types without planning for it.

On Fri, Feb 16, 2024 at 2:16 PM Gian Merlino  wrote:

> Thanks for managing this release!
>
> My vote is -0, let me explain why. I am concerned about usability issues
> with the new arrayIngestMode feature. There are various issues when mixing
> MVD strings and string arrays in the same column: as soon as arrays show up
> in a column, various "classic MVD-style" queries will fail at validation or
> at runtime. I believe that the new feature, and especially the changes to
> the web console in https://github.com/apache/druid/pull/15588, will make
> it more likely that people will do this by accident and experience
> brokenness.
>
> When this occurs, there is not an easy way to fix it; data needs to be
> reingested or queries need to be adjusted. I believe that in some cases,
> queries can't be adjusted without suffering from performance loss.
>
> This is something that could happen even before Druid 29, if you did a
> Kafka ingest with auto types or useSchemaDiscovery, followed by a SQL
> REPLACE from that table into itself. In that case, arrays written by Kafka
> ingest would get rewritten as MVDs. But with this Druid 29 RC, there are
> additional pathways created that enable people to get into this problematic
> scenario with increased likelihood. I suggest we make some adjustments that
> prevent it, such as:
>
> - Validating that SQL INSERT and SQL REPLACE do not insert an array type
> into a column that previously contained strings, or vice versa.
> - Making it obvious in the web console whether a tab is in
> "arrayIngestMode: array" or "arrayIngestMode: mvd" or "server default".
>
> Thank you for considering this viewpoint.
>
> Gian
>
> On Tue, Feb 13, 2024 at 4:34 AM Laksh Singla 
> wrote:
>
>> Hi all,
>>
>> I have created a build for Apache Druid 29.0.0, release
>> candidate 1.
>>
>> Thanks to everyone who has helped contribute to the release! You can read
>> the proposed release notes here:
>> https://github.com/apache/druid/issues/15896
>>
>> The release candidate has been tagged in GitHub as
>> druid-29.0.0-rc1 (869bd3978f0c835ef8eb7c1f25c468e23472a81b),
>> available here:
>> https://github.com/apache/druid/tree/druid-29.0.0-rc1
>>
>> The artifacts to be voted on are located here:
>> https://dist.apache.org/repos/dist/dev/druid/29.0.0-rc1/
>>
>> A staged Maven repository is available for review at:
>> https://repository.apache.org/content/repositories/orgapachedruid-1061/
>>
>> Staged druid.apache.org website documentation is available here:
>> https://druid.staged.apache.org/docs/29.0.0/design/
>>
>> A Docker image containing the binary of the release candidate can be
>> retrieved via:
>> docker pull apache/druid:29.0.0-rc1
>>
>> artifact checksums
>> src:
>>
>> 1948fab4500f3571591f887a638631b6a05f040b88b35004406ca852e16884e5d2269a74e8d07c79c477d8b02a2489efec85776f8ef46f4e8defedce4efb9931
>> bin:
>>
>> f2a11ddc71b59a648d01d7c220a6fae527c34d702d4f1e7aed954803a3576543c4cd7149f34e8948eccf89240435f09a8512db83b46991ac6e354eeba8cbada4
>> docker: 2bddcf692f2137dc4094908b63d2043423fca895ba0f479f0db34ec8016c4472
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/lakshsingla.asc
>>
>> This key and the key of other committers can also be found in the
>> project's
>> KEYS file here:
>> https://dist.apache.org/repos/dist/release/druid/KEYS
>>
>> (If you are a committer, please feel free to add your own key to that file
>> by following the instructions in the file's header.)
>>
>>
>> Verify checksums:
>> diff <(shasum -a512 apache-druid-29.0.0-src.tar.gz | \
>> cut -d ' ' -f1) \
>> <(cat apache-druid-29.0.0-src.tar.gz.sha512 ; echo)
>>
>> diff <(shasum -a512 apache-druid-29.0.0-bin.tar.gz | \
>> cut -d ' ' -f1) \
>> <(cat apache-druid-29.0.0-bin.tar.gz.sha512 ; echo)
>>
>&g

Re: [VOTE] Release Apache Druid 29.0.0 [RC1]

2024-02-16 Thread Gian Merlino

Thanks for managing this release!

My vote is -0, let me explain why. I am concerned about usability issues
with the new arrayIngestMode feature. There are various issues when mixing
MVD strings and string arrays in the same column: as soon as arrays show up
in a column, various "classic MVD-style" queries will fail at validation or
at runtime. I believe that the new feature, and especially the changes to
the web console in https://github.com/apache/druid/pull/15588, will make it
more likely that people will do this by accident and experience brokenness.

When this occurs, there is not an easy way to fix it; data needs to be
reingested or queries need to be adjusted. I believe that in some cases,
queries can't be adjusted without suffering from performance loss.

This is something that could happen even before Druid 29, if you did a
Kafka ingest with auto types or useSchemaDiscovery, followed by a SQL
REPLACE from that table into itself. In that case, arrays written by Kafka
ingest would get rewritten as MVDs. But with this Druid 29 RC, there are
additional pathways created that enable people to get into this problematic
scenario with increased likelihood. I suggest we make some adjustments that
prevent it, such as:

- Validating that SQL INSERT and SQL REPLACE do not insert an array type
into a column that previously contained strings, or vice versa.
- Making it obvious in the web console whether a tab is in
"arrayIngestMode: array" or "arrayIngestMode: mvd" or "server default".

Thank you for considering this viewpoint.

Gian

On Tue, Feb 13, 2024 at 4:34 AM Laksh Singla  wrote:

> Hi all,
>
> I have created a build for Apache Druid 29.0.0, release
> candidate 1.
>
> Thanks to everyone who has helped contribute to the release! You can read
> the proposed release notes here:
> https://github.com/apache/druid/issues/15896
>
> The release candidate has been tagged in GitHub as
> druid-29.0.0-rc1 (869bd3978f0c835ef8eb7c1f25c468e23472a81b),
> available here:
> https://github.com/apache/druid/tree/druid-29.0.0-rc1
>
> The artifacts to be voted on are located here:
> https://dist.apache.org/repos/dist/dev/druid/29.0.0-rc1/
>
> A staged Maven repository is available for review at:
> https://repository.apache.org/content/repositories/orgapachedruid-1061/
>
> Staged druid.apache.org website documentation is available here:
> https://druid.staged.apache.org/docs/29.0.0/design/
>
> A Docker image containing the binary of the release candidate can be
> retrieved via:
> docker pull apache/druid:29.0.0-rc1
>
> artifact checksums
> src:
>
> 1948fab4500f3571591f887a638631b6a05f040b88b35004406ca852e16884e5d2269a74e8d07c79c477d8b02a2489efec85776f8ef46f4e8defedce4efb9931
> bin:
>
> f2a11ddc71b59a648d01d7c220a6fae527c34d702d4f1e7aed954803a3576543c4cd7149f34e8948eccf89240435f09a8512db83b46991ac6e354eeba8cbada4
> docker: 2bddcf692f2137dc4094908b63d2043423fca895ba0f479f0db34ec8016c4472
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/lakshsingla.asc
>
> This key and the key of other committers can also be found in the project's
> KEYS file here:
> https://dist.apache.org/repos/dist/release/druid/KEYS
>
> (If you are a committer, please feel free to add your own key to that file
> by following the instructions in the file's header.)
>
>
> Verify checksums:
> diff <(shasum -a512 apache-druid-29.0.0-src.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-29.0.0-src.tar.gz.sha512 ; echo)
>
> diff <(shasum -a512 apache-druid-29.0.0-bin.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-29.0.0-bin.tar.gz.sha512 ; echo)
>
> Verify signatures:
> gpg --verify apache-druid-29.0.0-src.tar.gz.asc \
> apache-druid-29.0.0-src.tar.gz
>
> gpg --verify apache-druid-29.0.0-bin.tar.gz.asc \
> apache-druid-29.0.0-bin.tar.gz
>
> Please review the proposed artifacts and vote. Note that Apache has
> specific requirements that must be met before +1 binding votes can be cast
> by PMC members. Please refer to the policy at
> http://www.apache.org/legal/release-policy.html#policy for more details.
>
> As part of the validation process, the release artifacts can be generated
> from source by running:
> mvn clean install -Papache-release,dist -Dgpg.skip
>
> The RAT license check can be run from source by:
> mvn apache-rat:check -Prat
>
> This vote will be open for at least 72 hours. The vote will pass if a
> majority of at least three +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Druid 29.0.0
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
> [ ] -1 Do not release this package because...
>
> Thanks!
>

Re: on removing 'auto' strategy from native search query

2023-11-20 Thread Gian Merlino

We don't have usage data, but my sense is that the search query is not
commonly used, and among people that use the search query, it's not common
to rely on "druid.query.search.searchStrategy: auto". So I think it would
be ok to remove the feature and have "auto" be an alias for "useIndexes",
especially if as part of that, the "useIndexes" approach gets smarter.

On Wed, Nov 15, 2023 at 8:52 PM Clint Wylie  wrote:

> Hi all, just wanted to start a thread to discuss removing the 'auto'
> strategy from the native search query, which is the only thing that
> uses the 'estimateSelectivity' method of
>
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/index/BitmapColumnIndex.java#L33
> ,
> which is what I would actually like to remove to make implementing
> index suppliers a bit easier and also tidy some stuff up for some
> additional changes I would like to make.
>
> The 'auto' strategy uses these selectivity estimation methods to try
> to determine if it should actually use the indexes or not when
> performing a search query, however I've been making some improvements
> on making index usage itself automatically determine if it should use
> indexes or value matchers, which I think is the replacement. The first
> part is the changes in https://github.com/apache/druid/pull/13977,
> which currently only applies to 'auto' and 'json' columns, but the
> strategy could easily be applied to traditional string columns as
> well, and the future work I'd like to do would allow cursor creation
> to adaptively skip computing the remaining bitmaps of an AND filter
> once the intersection is selective enough to make further operations
> not worth the cost.
>
> Once these changes are in place, the 'useIndexes' strategy effectively
> becomes the 'auto' strategy because all index evaluation is automatic,
> and also much cheaper than the 'auto' strategy which currently has to
> repeat some work if it actually does decide to use indexes.
>
> I suppose the part worth discussing is if this 'auto' strategy is used
> enough for search queries that we need the complete replacement in
> place before we can remove it, or if I can just go ahead and remove it
> now, or if I should wire up the partial improvements used today by
> 'auto' and 'json' columns to regular string columns so at least a
> partial improvement is available. I think it would make it a bit
> easier for me to finish the remaining refactor for the improvements i
> have in mind if i can get this method out of my way entirely, but I
> can also probably work around it if necessary (e.g. if anyone is
> depending on it), so removing it probably isn't a blocker for
> finishing this work.
>
> Thoughts?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Druid Summit 2023 — call for speakers!

2023-09-11 Thread Gian Merlino

Hey Druids,

I am excited to write to you about this year's Druid Summit (
https://druidsummit.org/), an event being held virtually on December 5–6,
2023. The call for speakers is open here:
https://docs.google.com/forms/d/e/1FAIpQLSfoBZNh_IpSCT59fsYdTSSK92hYa7Rxf_7Fu0yBRCbK8ZwJdg/viewform

A title and short abstract is OK for a proposal. It's a good opportunity to
share your story with the Druid community. If you have any questions about
putting together a proposal, I'd be happy to answer them. Looking forward
to seeing many of you there!

Gian

Re: CVEs in contrib extensions

2023-09-05 Thread Gian Merlino

I think it would be OK to have a policy that contrib extension dependencies
are not proactively screened for CVEs. If we adopt such a policy, we do
need to make it clear to people that they should do their own screening of
any contrib extensions they use.

However, we can't extend that policy to saying we don't take responsibility
for security of contrib extensions at all. If there is a vulnerability in
the code of a contrib extension itself, then we are obligated to fix it. If
we receive a vulnerability report about a contrib extension, including a
report about an issue with one of its dependencies (via the process at
https://www.apache.org/security/#reporting-a-vulnerability) then we should
take it seriously and investigate. This is the cost of having the code
exist at all and be part of our source releases. We can only avoid _those_
costs by removing an extension completely.

On Mon, Sep 4, 2023 at 3:02 AM Abhishek Agarwal  wrote:

> Hello all
> What is our current policy about addressing CVEs in contrib extensions if
> we have one? As of now, before the release, the release manager will either
> try to fix the CVEs or add a suppression if applicable. Unless any
> developer has done that same work before the release process begins. This,
> however, is a tedious exercise for the release manager and for us
> maintainers. With contrib extensions added to the mix, there is a huge
> surface area for us to cover when it comes to managing CVEs in
> dependencies.
>
> I propose excluding contrib extensions from our CVE checks so that RM can
> ignore those CVEs during the release. We don't ship the contrib extensions
> in distribution anyway, so it seems like a reasonable stance to me.
>

Re: New Committer : Soumyava Das

2023-08-23 Thread Gian Merlino

Congratulations!!

On Mon, Aug 21, 2023 at 9:13 AM Karan Kumar  wrote:

> Hello everyone,
>
> The Project Management Committee (PMC) for Apache Druid has invited
> Soumyava to become a committer and we are pleased to announce that
> Soumyava has accepted.
>
> Soumyava has been a consistent contributor for over a year now. He has over
> 29 commits in druid. His majority commits are in the calcite layer and the
> query processing layer. His major contributions are :
> 1. Developing the unnest feature for array typed columns.
> 2. Fixing various bugs in the druid query planning and query processing
> area.
> 3. Vectorizing druid aggregators for faster processing.
> 4. Calcite 1.35 upgrade.
>
>
> Congratulations Soumyava.
>

Re: New Committer : Adarsh Sanjeev

2023-08-23 Thread Gian Merlino

Congratulations!!

On Mon, Aug 21, 2023 at 8:14 AM Karan Kumar  wrote:

> Hello everyone,
>
> The Project Management Committee (PMC) for Apache Druid has invited
> Adarsh to become a committer and we are pleased to announce that
> Adarsh has accepted.
>
> Adarsh has been a consistent contributor for over a year now. He has over
> 49 commits in druid. His majority commits are in the calcite layer and MSQ.
> Some of his most notable contributions are:
> 1. Adding the replace SQL syntax for MSQ.
> 2. Adding the sequential merge feature in MSQ which helps in generating
> better segment sizes.
> 3. Changes in load rules and brokers to enable the query from deep storage
> feature.
>
>
> Congratulations Adarsh.
>

Re: [DISCUSS] Druid 28 dropping support for Hadoop 2

2023-07-19 Thread Gian Merlino

Given the replies on this thread, I think it's appropriate that we do the 
following now:

1) Announce in the upcoming Druid 27 release that Hadoop 2 support is 
deprecated and will be removed in Druid 28

2) Remove Hadoop 2 support from the master branch, since Druid 27 has been 
branched off already, and the next release (28) is meant to not have it.

Does anyone have some spare cycles to do (2)?

Gian

On 2023/06/28 06:42:08 Gian Merlino wrote:
> I'd like to propose dropping support for Hadoop 2 in Druid 28. Not the very
> next release (which I assume will be Druid 27) but the one after that,
> likely late 2023 timeframe.
> 
> In 2021, we had a discussion about moving away from Hadoop 2:
> https://lists.apache.org/thread/zmc389trnkh6x444so8mdb2h0x0noqq4. For
> various reasons, it didn't seem like the right time. However, I believe now
> is the right time:
> 
> 1) We didn't support Hadoop 3 in 2021, but we support it now. There is now
> a Hadoop 3 build profile, as well as convenience binaries on
> https://druid.apache.org/downloads.html.
> 
> 2) We have SQL-based ingest with MSQ tasks, which provides a built-in /
> scalable / robust alternative to using Hadoop at all.
> 
> 3) It has been an additional two years. Hadoop 2 is that much older, that
> much more time has passed since it was superseded by Hadoop 3, and people
> have had that much more time to migrate.
> 
> 4) The original main reason for wanting to move away from Hadoop 2 is still
> relevant. It keeps us on various old dependencies, including an ancient
> version of Guava, which in turn has been keeping us on an ancient version
> of Calcite. The Calcite community has graciously decided to support this
> old version of Guava for at least one release, but plans to drop support by
> Calcite 1.36, leaving us back in the same position. Managing this situation
> is time-consuming for both Druid and Calcite maintainers.
> 
> 5) Other solutions beyond dropping Hadoop 2 support were proposed in 2021,
> such as reworking Hadoop support to be purely extension based, and
> reworking extensions to be more isolated from each other. However, these
> are both substantially more complex than dropping support, and in the two
> years since the original thread, these more complex solutions have not been
> implemented. So, I think we need to move on with the simpler solution of
> dropping support.
> 
> Gian
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: About maintaining the Helm's Chart of Apache Druid

2023-07-17 Thread Gian Merlino

Thank you. This kind of work can be thankless so I just wanted to
explicitly say TY.

On Tue, Jul 11, 2023 at 10:48 PM Abhishek Agarwal 
wrote:

> Since no one else has volunteered, I will take on the 1). It's possible
> that we don't get IP clearance and if we don't, we will just remove the
> code.
>
> On Wed, Mar 1, 2023 at 7:14 AM Gian Merlino  wrote:
>
> > Not as far as I _know_, I mean.
> >
> > On 2023/03/01 01:43:43 Gian Merlino wrote:
> > > Not as far as I do. I think we're stuck since nobody has volunteered to
> > do one of the two necessary things:
> > >
> > > 1) shepherd this code the IP clearance process, or
> > > 2) analyze its provenance enough to determine that IP clearance isn't
> > necessary.
> > >
> > > If anyone is willing to do one of the above it would be greatly
> > appreciated.
> > >
> > > At some point, if it seems like nobody will volunteer to do one of the
> > above things, we'll need to remove the code, and whoever's interested in
> > maintaining it would need to fork it off into another repo.
> > >
> > > Gian
> > >
> > > On 2023/02/15 18:37:57 Xavier Léauté wrote:
> > > > Did we ever get to a conclusion here on IP clearance? Clint, it looks
> > like
> > > > the helm charts are still in the Druid repo and being contributed to.
> > > >
> > > > Thanks,
> > > > Xavier
> > > >
> > > > On Tue, Sep 14, 2021 at 4:19 AM Clint Wylie 
> wrote:
> > > >
> > > > > My understanding of that thread suggests that
> > > > >
> https://incubator.apache.org/ip-clearance/ip-clearance-template.html
> > > > > is the process that our PMC does before the IPMC can continue the
> > > > > clearance process, meaning someone on our PMC fill out i think this
> > > > > form
> > > > >
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/ip-clearance-template.xml
> > > > > ,
> > > > > and check the things in and send the emails and such listed out in
> > the
> > > > > "process" section.
> > > > >
> > > > > I'm going to remove the helm chart from the 0.22.0 release for now
> > > > > since I can't find any record of this having been done yet (I can
> add
> > > > > it back if I missed it and we haven't yet released).
> > > > >
> > > > > On Sun, Jul 4, 2021 at 11:07 PM Benedict Jin 
> > wrote:
> > > > > >
> > > > > > Hi Jihoon,
> > > > > >
> > > > > > Last week I asked the AFS, and according to the replay, it seems
> > that
> > > > > only our IPMC has the authority to launch the IP Clearance process.
> > FYI,
> > > > >
> >
> https://lists.apache.org/thread.html/rfbfc5951c4524c0e68223e4fbe05a7d7ee26c185ab557d6f77a4989d%40%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > Regards,
> > > > > > Benedict Jin
> > > > > >
> > > > > > On 2021/07/02 22:56:08, Jihoon Son  wrote:
> > > > > > > Hey Benedict,
> > > > > > >
> > > > > > > Any updates on this issue? I think we are going to start the
> > release
> > > > > > > process for 0.22.0 soon.
> > > > > > >
> > > > > > > On Fri, Jul 2, 2021 at 1:19 AM Benedict Jin <
> asdf2...@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Xavier,
> > > > > > > >
> > > > > > > > I'm so happy to hear that and look forward to your changes
> > will be
> > > > > > > > contributed to upstream. In fact, Helm and Operator are not
> in
> > > > > conflict,
> > > > > > > > their relationship is kind like RPM and Systemd. You can even
> > > > > convert Helm
> > > > > > > > into Operator, or build Operator based on Helm. And I agree
> > with you
> > > > > that
> > > > > > > > it would be better if we can define user scenarios.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Benedict Jin
> > > > > > > >
> > > > > > > > On 2021/06/25 22:42:32, Xavier Léauté
> > 
> > &g

Re: group-by v1

2023-07-17 Thread Gian Merlino

+1 to removing it.

The only benefit I am aware of is the same one that you mentioned. But I
don't think this needs to block removing the old v1 algo.

On Wed, Jul 12, 2023 at 4:07 AM Clint Wylie  wrote:

> Is anyone opposed to removing group-by v1? I think it would allow us
> to simplify quite a lot of stuff. While it would very nice to
> implement 'growable' buffer aggregators so that the v2 algorithm could
> be a bit more flexible and finally cover the only potential reason I
> can imagine people might still be using v1, I don't think that this
> needs to block removal.
>
> So, is anyone out there still using group-by v1 and would be sad by it
> going away?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: request to join dev group

2023-07-06 Thread Gian Merlino

Hi Tanya,

Welcome! You can subscribe by sending an email to 
dev-subscr...@druid.apache.org.

Gian

On 2023/07/04 06:41:02 Tanya Mary wrote:
> request to join dev group
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: [DISCUSS] Druid 28 dropping support for Hadoop 2

2023-06-29 Thread Gian Merlino

Yes, I think it would make sense to deprecate it in Druid 27 if we're planning 
to remove the support in Druid 28.

I haven't looked into what it would take to make the Hadoop integration into an 
optional extension. That would be really nice though. Has anyone on this list 
looked into it?

Gian

On 2023/06/29 19:49:42 Xavier Léauté wrote:
> +1, does this mean we would mark Hadoop 2 deprecated in Druid 27?
> 
> Also, do we have a broader plan to remove Hadoop in general from core
> dependencies and make a an optional extension?
> 
> On Tue, Jun 27, 2023 at 11:53 PM Karan Kumar 
> wrote:
> 
> > In favour of dropping hadoop 2 support . Another point is the lack of
> > security and vulnerability fixes in hadoop2.
> >
> >
> >
> > On Wed, Jun 28, 2023 at 12:17 PM Clint Wylie  wrote:
> >
> > > obvious +1 from me
> > >
> > > On Tue, Jun 27, 2023 at 11:42 PM Gian Merlino  wrote:
> > > >
> > > > I'd like to propose dropping support for Hadoop 2 in Druid 28. Not the
> > > very
> > > > next release (which I assume will be Druid 27) but the one after that,
> > > > likely late 2023 timeframe.
> > > >
> > > > In 2021, we had a discussion about moving away from Hadoop 2:
> > > > https://lists.apache.org/thread/zmc389trnkh6x444so8mdb2h0x0noqq4. For
> > > > various reasons, it didn't seem like the right time. However, I believe
> > > now
> > > > is the right time:
> > > >
> > > > 1) We didn't support Hadoop 3 in 2021, but we support it now. There is
> > > now
> > > > a Hadoop 3 build profile, as well as convenience binaries on
> > > > https://druid.apache.org/downloads.html.
> > > >
> > > > 2) We have SQL-based ingest with MSQ tasks, which provides a built-in /
> > > > scalable / robust alternative to using Hadoop at all.
> > > >
> > > > 3) It has been an additional two years. Hadoop 2 is that much older,
> > that
> > > > much more time has passed since it was superseded by Hadoop 3, and
> > people
> > > > have had that much more time to migrate.
> > > >
> > > > 4) The original main reason for wanting to move away from Hadoop 2 is
> > > still
> > > > relevant. It keeps us on various old dependencies, including an ancient
> > > > version of Guava, which in turn has been keeping us on an ancient
> > version
> > > > of Calcite. The Calcite community has graciously decided to support
> > this
> > > > old version of Guava for at least one release, but plans to drop
> > support
> > > by
> > > > Calcite 1.36, leaving us back in the same position. Managing this
> > > situation
> > > > is time-consuming for both Druid and Calcite maintainers.
> > > >
> > > > 5) Other solutions beyond dropping Hadoop 2 support were proposed in
> > > 2021,
> > > > such as reworking Hadoop support to be purely extension based, and
> > > > reworking extensions to be more isolated from each other. However,
> > these
> > > > are both substantially more complex than dropping support, and in the
> > two
> > > > years since the original thread, these more complex solutions have not
> > > been
> > > > implemented. So, I think we need to move on with the simpler solution
> > of
> > > > dropping support.
> > > >
> > > > Gian
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
> > --
> > Thanks
> > Karan
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

[DISCUSS] Druid 28 dropping support for Hadoop 2

2023-06-28 Thread Gian Merlino

I'd like to propose dropping support for Hadoop 2 in Druid 28. Not the very
next release (which I assume will be Druid 27) but the one after that,
likely late 2023 timeframe.

In 2021, we had a discussion about moving away from Hadoop 2:
https://lists.apache.org/thread/zmc389trnkh6x444so8mdb2h0x0noqq4. For
various reasons, it didn't seem like the right time. However, I believe now
is the right time:

1) We didn't support Hadoop 3 in 2021, but we support it now. There is now
a Hadoop 3 build profile, as well as convenience binaries on
https://druid.apache.org/downloads.html.

2) We have SQL-based ingest with MSQ tasks, which provides a built-in /
scalable / robust alternative to using Hadoop at all.

3) It has been an additional two years. Hadoop 2 is that much older, that
much more time has passed since it was superseded by Hadoop 3, and people
have had that much more time to migrate.

4) The original main reason for wanting to move away from Hadoop 2 is still
relevant. It keeps us on various old dependencies, including an ancient
version of Guava, which in turn has been keeping us on an ancient version
of Calcite. The Calcite community has graciously decided to support this
old version of Guava for at least one release, but plans to drop support by
Calcite 1.36, leaving us back in the same position. Managing this situation
is time-consuming for both Druid and Calcite maintainers.

5) Other solutions beyond dropping Hadoop 2 support were proposed in 2021,
such as reworking Hadoop support to be purely extension based, and
reworking extensions to be more isolated from each other. However, these
are both substantially more complex than dropping support, and in the two
years since the original thread, these more complex solutions have not been
implemented. So, I think we need to move on with the simpler solution of
dropping support.

Gian

Re: Requirements for relaxing restrictions on github actions usage

2023-06-02 Thread Gian Merlino

+1, allowing CI to run without an explicit button push by committers will help 
encourage new contributors.

The requirements seem OK. I looked through our repo and I don't see any 
external actions (they are all in "github" or "actions").

We do have ".github/workflows/labeler.yml" that fires on pull_request_target 
and does use GITHUB_TOKEN. However, that action doesn't run any code from the 
PR itself, so I think it is fine. (The risk to me seems to be if the action 
exports GITHUB_TOKEN, and runs code from the PR, then the PR can steal 
GITHUB_TOKEN.)

Gian

On 2023/05/31 08:10:18 Abhishek Agarwal wrote:
> Hello,
> I raised an INFRA ticket (https://issues.apache.org/jira/browse/INFRA-24657)
> for the druid project so the contributors don't need a committer to trigger
> PR build/test. Infra has agreed to relax the restrictions enough that a
> contributor will need the approval only for their first contribution.
> 
> However, as a project, we need to follow certain requirements that are
> called out here - https://infra.apache.org/github-actions-policy.html
> 
> They all seem fine to me. We are using `pull_request_target` for the
> labeler action but that action doesn't export any confidential variables.
> If others agree as well, I will just link this thread to the INFRA ticket.
> 
> As a follow-up item, I can add a README.md in .github folder that warns
> contributors and committers to keep these requirements in mind as they
> change GitHub workflows in future.
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Roadmap event: call for speakers

2023-05-30 Thread Gian Merlino

Hi Druids,

We are looking to put on a virtual event called "Druid.NEXT" in June
highlighting things that people in the community are working on. This is a
call for speakers for that event!

Date is TBD, but likely late June. The event will be on the shorter side,
about meetup-length (an hour or two). If you are interested in giving a
5–10 minute talk about something you plan to contribute this year, then let
us know!

1) If the thing you're working on isn't yet on the "community roadmap"
GitHub issue at https://github.com/apache/druid/issues/14157, post there so
we can add it.

2) Reply to this email, or post on that GitHub issue, with a note that
you're interested in speaking.

Looking forward to what's next,

Gian

Re: Error message: "Error: Resource limit exceeded

2023-05-15 Thread Gian Merlino

Hi Alaka,

There's a bit of text cut off in the error message. The full one is
something like:

  "Time ordering is not supported for a Scan query with %,d
segments per time chunk and a row limit of %,d. "
  + "Try reducing your query limit below
maxRowsQueuedForOrdering (currently %,d), or using compaction to "
  + "reduce the number of segments per time chunk, or raising
maxSegmentPartitionsOrderedInMemory "
  + "(currently %,d) above the number of segments you have per
time chunk.",

So, in general you could try:

- Reduce the number of segments in each time chunk, perhaps using
compaction (
https://druid.apache.org/docs/latest/data-management/compaction.html)
- Or, increase druid.query.scan.maxSegmentPartitionsOrderedInMemory
- Reduce the limit of your Scan query
- Or, increase druid.query.scan.maxRowsQueuedForOrdering

Note that the "increase" options will lead to more memory usage, so you'd
want to keep an eye on that.

On Tue, May 9, 2023 at 1:03 PM Alaka Thorat  wrote:

> Hi,
>
> I am getting the following  druid error message in Superset.
>
> Error message: "Error: Resource limit
> exceeded(org.apache.druid.query.ResourseLimitExceededException): Time
> ordering
> is not supported for a Scan Query
>
> This happens on one tab with chart set up(in Superset) without server
> pagination & row limit<=10.
>
> But when I use a different tab with the same configuration(without
> pagination/row limit=10), it works.
>
> However the issue is sorted by "__time" has been removed then also it is
> working again with server pagination=unticked & row limit=30,
> but there is issue of descending ordering again.
>
>
> I am requesting you, What changes I have to do to avoid the above Druid
> error message given by Superset.
>
>
> Thanks
> --Alaka
>

Re: Question regarding new development

2023-03-28 Thread Gian Merlino

Looks like the conversation is now in
https://github.com/apache/druid/issues/13948.

On Sat, Mar 18, 2023 at 8:00 AM Sergiu Ungureanu 
wrote:

> Hi Team,
>
> Yesterday I raised a question in #dev channel in slack
>
> https://apachedruidworkspace.slack.com/archives/C030CMF6B70/p1679085073683509
>
> I would like to know if Junit4 migration is in the project scope and if it
> is a viable proposal for the project.Thank you in advance!!
>
> Kind regards,
> Sergiu
>

CI requiring approval for external contributors

2023-03-28 Thread Gian Merlino

Recently, ASF GitHub repos had their defaults for GitHub Actions changed to
"always require approval for external contributors". In Slack, Karan
pointed out that Airflow has recently submitted a ticket to have that
changed back: https://issues.apache.org/jira/browse/INFRA-24200. IMO, we
should do the same. I don't think we have a problem with fake PRs, but we
can always improve our responsiveness to contributors from outside the
project! Every little bit helps, including running CI automatically.

If others have opinions on this, let me know. I'd like to raise our own
ticket to change our default.

Gian

Re: About maintaining the Helm's Chart of Apache Druid

2023-02-28 Thread Gian Merlino

Not as far as I do. I think we're stuck since nobody has volunteered to do one 
of the two necessary things:

1) shepherd this code the IP clearance process, or
2) analyze its provenance enough to determine that IP clearance isn't necessary.

If anyone is willing to do one of the above it would be greatly appreciated.

At some point, if it seems like nobody will volunteer to do one of the above 
things, we'll need to remove the code, and whoever's interested in maintaining 
it would need to fork it off into another repo.

Gian

On 2023/02/15 18:37:57 Xavier Léauté wrote:
> Did we ever get to a conclusion here on IP clearance? Clint, it looks like
> the helm charts are still in the Druid repo and being contributed to.
> 
> Thanks,
> Xavier
> 
> On Tue, Sep 14, 2021 at 4:19 AM Clint Wylie  wrote:
> 
> > My understanding of that thread suggests that
> > https://incubator.apache.org/ip-clearance/ip-clearance-template.html
> > is the process that our PMC does before the IPMC can continue the
> > clearance process, meaning someone on our PMC fill out i think this
> > form
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/ip-clearance-template.xml
> > ,
> > and check the things in and send the emails and such listed out in the
> > "process" section.
> >
> > I'm going to remove the helm chart from the 0.22.0 release for now
> > since I can't find any record of this having been done yet (I can add
> > it back if I missed it and we haven't yet released).
> >
> > On Sun, Jul 4, 2021 at 11:07 PM Benedict Jin  wrote:
> > >
> > > Hi Jihoon,
> > >
> > > Last week I asked the AFS, and according to the replay, it seems that
> > only our IPMC has the authority to launch the IP Clearance process. FYI,
> > https://lists.apache.org/thread.html/rfbfc5951c4524c0e68223e4fbe05a7d7ee26c185ab557d6f77a4989d%40%3Cgeneral.incubator.apache.org%3E
> > >
> > > Regards,
> > > Benedict Jin
> > >
> > > On 2021/07/02 22:56:08, Jihoon Son  wrote:
> > > > Hey Benedict,
> > > >
> > > > Any updates on this issue? I think we are going to start the release
> > > > process for 0.22.0 soon.
> > > >
> > > > On Fri, Jul 2, 2021 at 1:19 AM Benedict Jin 
> > wrote:
> > > >
> > > > > Hi Xavier,
> > > > >
> > > > > I'm so happy to hear that and look forward to your changes will be
> > > > > contributed to upstream. In fact, Helm and Operator are not in
> > conflict,
> > > > > their relationship is kind like RPM and Systemd. You can even
> > convert Helm
> > > > > into Operator, or build Operator based on Helm. And I agree with you
> > that
> > > > > it would be better if we can define user scenarios.
> > > > >
> > > > > Regards,
> > > > > Benedict Jin
> > > > >
> > > > > On 2021/06/25 22:42:32, Xavier Léauté 
> > > > > wrote:
> > > > > > For what it's worth, we have been using a heavily modified version
> > of
> > > > > this
> > > > > > helm chart at Confluent.
> > > > > >
> > > > > > I would say it is good to get a Druid cluster up and running
> > quickly, but
> > > > > > we had to make some significant changes to make it easier to
> > operate a
> > > > > > Druid cluster.
> > > > > > It's great for initial deployment and getting all the required
> > > > > dependencies
> > > > > > in place, but operations are somewhat painful and require a lot of
> > > > > internal
> > > > > > Druid knowledge to not shoot yourself in the foot.
> > > > > > Our original intention was to contribute back those changes
> > upstream, but
> > > > > > we have not had the time to put it in a shape that would allow
> > others to
> > > > > > use it.
> > > > > >
> > > > > > We should try to define what we want this chart to be used for,
> > since I
> > > > > > think the Druid k8s operator is probably a better choice for
> > someone to
> > > > > run
> > > > > > and upgrade a meaningful cluster.
> > > > > > Another option would be to focus our effort on the Druid operator
> > and
> > > > > maybe
> > > > > > build a helm chart to get that and our external dependencies in
> > place, I
> > > > > > think we can provide a better experience that way.
> > > > > > One concern with the pure helm chart is that we'll get a lot of
> > questions
> > > > > > on how to operate it that will likely take a lot of time to answer.
> > > > > > Considering we'd have helm, k8s operator, and docker-compose, I
> > think we
> > > > > > should be conscious of the time it would take to maintain all
> > those ways
> > > > > of
> > > > > > running Druid in containers and what purpose each of them serves.
> > > > > >
> > > > > > Just my 2¢,
> > > > > > Xavier
> > > > > >
> > > > > > On Tue, Jun 22, 2021 at 8:04 AM Benedict Jin 
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jihoon Son,
> > > > > > >
> > > > > > > Cool, thanks a lot 
> > > > > > >
> > > > > > > Regards,
> > > > > > > Benedict Jin
> > > > > > >
> > > > > > > On 2021/06/21 17:10:13, Jihoon Son  wrote:
> > > > > > > > Thanks Benedict.
> > > > > > > > You can find another example of the IP clearance process here:
> > > > >

Re: [Discuss] S3 buckets or IT tests

2023-02-22 Thread Gian Merlino

I think the ticket you're referring to is 
https://issues.apache.org/jira/browse/INFRA-23952.

It would definitely be valuable to run S3 integration tests as part of the 
automated test suite in GitHub Actions. If Infra is willing to provide a bucket 
for this purpose then we would certainly be able to use that. I bet we could 
also use Minio (https://min.io/) or something similar. 

Gian

On 2023/02/15 21:38:47 Karan Kumar wrote:
> Hey Folks
> S3 read write tests currently are not executed in github actions since we
> do not have public creds to read/write from s3.
> Have raised an ASF infra ticket
> https://apachedruidworkspace.slack.com/archives/C030CMF6B70/p1675916945323839
> so that they can give us a bucket.
> They require a PMC approval.
> As discussed on this slack thread
> https://apachedruidworkspace.slack.com/archives/C030CMF6B70/p1675916945323839,
> starting a formal discussion around it.
> 
> The objective would be to integrate our s3 IT's tests to run on each PR.
> 
> 
> 
> -- 
> Thanks
> Karan
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: moving druid-core, extendedset, druid-hll into druid-processing

2023-02-06 Thread Gian Merlino

I support this. I don't feel like the separation between core and
processing is buying us very much.

On Mon, Jan 23, 2023 at 5:12 PM Clint Wylie  wrote:

> Hi all,
>
> I want to discuss moving druid-core, extendedset, and druid-hll into
> druid-processing to simplify our code structure and dependencies a
> bit. We've been discussing doing something like this off and on for
> quite a lot of years now, re
> https://github.com/apache/druid/issues/4312, and we've done parts of
> it, but .. we just haven't got back to it yet.
>
> I've opened a PR https://github.com/apache/druid/pull/13698 and have
> done some testing so I think it should be minimally disruptive (see PR
> for details), but wanted to raise this on the list to try to get a
> wider audience and see if anyone has any concerns I missed. There are
> still a couple of CI issues to work out.
>
> Beyond this, I also think it would be super nice to move druid-sql and
> druid-indexing-service into druid-server, though that should be done
> in a separate PR.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: [DISCUSS] Release 24.0.1

2022-10-18 Thread Gian Merlino

Thank you for volunteering!

On Mon, Oct 17, 2022 at 7:00 AM Kashif Faraz  wrote:

> Hi Abhishek
>
> If you haven't started with the release process already, I would like to
> volunteer to perform this release so that we can expedite it.
> Please let me know if that works for you.
>
> Regards
> Kashif
>
> On Mon, Sep 26, 2022 at 3:41 PM Abhishek Agarwal <
> abhishek.agar...@imply.io>
> wrote:
>
> > Hi All,
> > Recently we discovered a regression (
> > https://github.com/apache/druid/pull/13138) in the 24.0.0 release.
> Because
> > of this regression, Hadoop ingestion will not work if the user has
> > overridden any of the `druid.extensions.*` config. Some examples below
> > - If a custom load list is specified, Hadoop ingestion task will still
> pick
> > up all the extensions in "extension" directory.
> > - If a custom Hadoop dependency directory is being used, those jars will
> no
> > longer be available for Hadoop Ingestion.
> >
> > I have created a branch https://github.com/apache/druid/tree/24.0.1 to
> > backport the hotfix. We can also choose to backport some other bug fixes
> > that we missed in 24.0.0. Though, of course, we want to backport only
> > critical bug fixes.
> >
>

Druid Summit on the road

2022-09-06 Thread Gian Merlino

Hey Druids,

I am excited to write to you about upcoming events in this year's edition
of Druid Summit, which is being conducted as a series of more local
in-person events. I hope it gives you a chance to meet people near you in
the Druid community. Attendance is free of charge.

I personally will be attending the SF event, next week on 9/14, and hope to
see many of you there!

You can see a list of upcoming events at: https://druidsummit.org/#locations

This includes:

- Sept. 8, Berlin 
- Sept. 9, Singapore

- Sept. 14, San Francisco

- Sept. 15, London

- Sept. 29, Paris 
- Oct. 20, Seoul 
- Oct. 27, Tel Aviv


Gian

Re: Intermediate segment persistence

2022-09-06 Thread Gian Merlino

Hey Pramod,

If it's a minor change I recommend raising a PR. Generally raising an issue
first is a good idea for bigger changes, where it is helpful to have some
discussion prior to the code showing up. But for smaller changes, we can go
directly to the code.

You can post the PR here too, or in Slack (see
https://druid.apache.org/community/) to make sure it gets prompt attention.

Gian

On Tue, Sep 6, 2022 at 3:50 PM Pramod Immaneni  wrote:

> Hi,
>
> I have a contribution that I wanted to submit, to speed up persistence of
> intermediate segments in middlemanager in certain cases. What is the best
> way to go about doing this? Its a minor change, would creating an issue
> followed by a PR be a good way of going about this.
>
> Thanks
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

2022-08-08 Thread Gian Merlino

It's always good to deprecate things for some time prior to removing them,
so we don't need to (nor should we) remove Hadoop 2 support right now. My
vote is that in this upcoming release, we should deprecate it. The main
problem in my eyes is the one Abhishek brought up: the dependency
management situation with Hadoop 2 is really messy, and I'm not sure
there's a good way to handle them given the limited classloader isolation.
This situation becomes tougher to manage with each release, and we haven't
had people volunteering to find and build comprehensive solutions. It is
time to move on.

The concern Samarth raised, that people may end up stuck on older Druid
versions because they aren't able to upgrade to Hadoop 3, is valid. I can
see two good solutions to this. First: we can improve native ingest to the
point where people feel broadly comfortable moving Hadoop 2 workloads to
native. The work planned as part of doing ingest via multi-stage
distributed query  is going
to be useful here, by improving the speed and scalability of native ingest.
Second: it would also be great to have something similar that runs on
Spark, for people that have made investments in Spark. I suspect that most
people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
both of those would ease a lot of the potential pain of dropping Hadoop 2
support.

On Spark: I'm not familiar with the current state of the Spark work. Is it
stuck? If so could something be done to unstick it? I agree with Abhishek
that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
great if we could get it done before actually removing Hadoop 2 support
from the code base.

On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal 
wrote:

> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> low-resistance path than moving from Hadoop to Spark. even if we get that
> PR merged, it will take good time for spark integration to reach the same
> level of maturity as Hadoop or Native ingestion. BTW I am not making an
> argument against spark integration. it will certainly be nice to have Spark
> as an option. Just that spark integration doesn't become a blocker for us
> to get off Hadoop.
>
> btw are you using Hadoop 2 right now with the latest druid version? If so,
> did you run into similar errors that I posted in my last email?
>
> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain 
> wrote:
>
> > I am sure there are other companies out there who are still on Hadoop 2.x
> > with migration to Hadoop 3.x being a no-go.
> > If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> > would prevent users from updating to newer versions of Druid which would
> be
> > a shame.
> >
> > FWIW, we have found in practice for high volume use cases that compaction
> > based on Druid's Hadoop based batch ingestion is a lot more scale-able
> than
> > the native compaction.
> >
> > Having said that, as an alternative, if we can merge Julian's Spark based
> > ingestion PR s in Druid,
> that
> > might provide an alternate way for users to get rid of the Hadoop
> > dependency.
> >
> > On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> > abhishek.agar...@imply.io>
> > wrote:
> >
> > > Reviving this conversation again.
> > > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> > been
> > > around for some time now and is very stable as far as I know.
> > >
> > > The dependencies coming from Hadoop 2 are also old enough that they
> cause
> > > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> > from
> > > Hadoop 2, get flagged during these scans. We have also seen issues when
> > > customers try to use Hadoop ingestion with the latest log4j2 library.
> > >
> > > Exception in thread "main" java.lang.NoSuchMethodError:
> > >
> > >
> >
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> > >
> > >
> > > Instead of fixing these point issues, we would be better served by
> > > completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> > > releases and dependencies are well isolated.
> > >
> > > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar  >
> > > wrote:
> > >
> > > > Hello
> > > > We can also use maven profiles. We keep hadoop2 support by default
> and
> > > add
> > > > a new maven profile with hadoop3. This will allow the user to choose
> > the
> > > > profile which is best suited for the use case.
> > > > Agreed, it will not help in the Hadoop

Re: Next Druid release version scheme

2022-07-06 Thread Gian Merlino

I'd say yes, in a way that's similar to today. Today we treat increments of
the version after the 0 as potentially allowing breaking changes. We also
try to avoid them whenever feasible, because we know they're painful for
users. I'm not suggesting we immediately get any more, or less, eager about
making breaking changes as part of dropping the "0.". Over time, though,
I'd like to see us get less eager about making breaking changes.

On Wed, Jul 6, 2022 at 9:47 AM Julian Hyde  wrote:

> Would 24.0 and 25.0 each be regarded as major versions for the purposes of
> semantic versioning?
>
> If so, under the rules of semantic versioning, we *can* make breaking API
> changes but that doesn’t mean that we *should*. (For an example of a
> project that followed the letter of semantic versioning but still
> undermined the trust of their users by making too many API changes, look no
> further than Guava.)
>
> Julian
>
>
> On Jul 6, 2022, at 1:53 AM, Gian Merlino  wrote:
>
> My proposal for the next release is that we merely drop the leading "0."
> and don't change anything else about our dev process. We'd start the next
> release at 24.0, and then likely do 25.0 shortly after. Same as today, just
> no leading '0.".
>
> Separately, I'd like to craft a better versioning story around extension
> API, query API, etc. But I don't think we need to connect these two things.
> The dropping of the leading "0." is mainly about reflecting the reality
> that the project is way more stable than a random member of the public
> would expect for a "0." release. The better versioning story is an effort
> that is independent from that.
>
> On Tue, Jun 7, 2022 at 11:50 AM Xavier Léauté  >
> wrote:
>
> Extension API: do extensions written for version X run as expected with
>
> version Y?
>
> One thing I'd like to see us do before we declare to 1.0 and provide
> backwards compatibility for extensions APIs is
> to remove some of the crufty Hadoop 2.x and Guava 16 dependency constraints
> we have (or at least isolate them so
> extensions and core are not constrained by old versions). Removing those
> will likely be a breaking change for extensions.
>
> I'm also fine declaring 1.0, but that might mean we can't deprecate things
> until 2.0, and then remove those in 3.0 depending on
> what our backwards compatibility guarantees are. What I'd like us to avoid
> is to be further entrenched and bogged down in
> moving away from those dependencies by declaring a stable API.
>
> Xavier
>
> On Mon, Jun 6, 2022 at 2:45 PM rahul gidwani 
> wrote:
>
> Hi Gian, this is great.
>
> For me what is most important is (2) and (4)
> Does my current extension work with new releases?
> Can I do a rolling upgrade of druid to the next version?
>
> The more things that are versioned the better, but (2) and (4) have been
> the things that have been most important to me in the past.
>
> Anyone in the community have any thoughts on this?
> Thank you
> rahul
>
>
>
> On Fri, May 27, 2022 at 11:22 AM Gian Merlino  wrote:
>
> Yeah, I'd say the next one after 24.0 would be 25.0. The idea is really
> just to remove the leading zero and thereby communicate the accurate
>
> state
>
> of the project: it has been stable and production-ready for a long
>
> time.
>
> Some people see the leading zero and interpret that as a sign of an
> immature or non-production-ready system. So I think this change is
>
> worth
>
> doing and beneficial.
>
> I do think we can do better at communicating compatibility, but IMO
> semantic versioning for the whole system isn't the best way to do it.
> Semantic versioning is good for libraries, where people need one kind
>
> of
>
> assurance: that they can update to the latest version of the library
> without needing to make changes in their program. But Druid is
> infrastructure software with many varied senses of compatibility, such
>
> as:
>
>
> 1) Query API: do user queries written for version X return compatible
> responses when run against version Y?
> 2) Extension API: do extensions written for version X run as expected
>
> with
>
> version Y?
> 3) Storage format: can servers at version X read segments written by
> servers at version Y?
> 4) Intracluster protocol: can a server at version X communicate
>
> properly
>
> with a server at version Y?
> 5) Server configuration: do server configurations (runtime properties,
>
> jvm
>
> configs) written for version X work as expected for version Y?
> 6) Ecosystem: does version Y drop support for older versions of
>
> ZooKeeper,
>
> Kafka, Hadoop, etc, which were supported by version X?
>
&g

Re: Next Druid release version scheme

2022-07-06 Thread Gian Merlino

My proposal for the next release is that we merely drop the leading "0."
and don't change anything else about our dev process. We'd start the next
release at 24.0, and then likely do 25.0 shortly after. Same as today, just
no leading '0.".

Separately, I'd like to craft a better versioning story around extension
API, query API, etc. But I don't think we need to connect these two things.
The dropping of the leading "0." is mainly about reflecting the reality
that the project is way more stable than a random member of the public
would expect for a "0." release. The better versioning story is an effort
that is independent from that.

On Tue, Jun 7, 2022 at 11:50 AM Xavier Léauté 
wrote:

> > Extension API: do extensions written for version X run as expected with
> version Y?
>
> One thing I'd like to see us do before we declare to 1.0 and provide
> backwards compatibility for extensions APIs is
> to remove some of the crufty Hadoop 2.x and Guava 16 dependency constraints
> we have (or at least isolate them so
> extensions and core are not constrained by old versions). Removing those
> will likely be a breaking change for extensions.
>
> I'm also fine declaring 1.0, but that might mean we can't deprecate things
> until 2.0, and then remove those in 3.0 depending on
> what our backwards compatibility guarantees are. What I'd like us to avoid
> is to be further entrenched and bogged down in
> moving away from those dependencies by declaring a stable API.
>
> Xavier
>
> On Mon, Jun 6, 2022 at 2:45 PM rahul gidwani 
> wrote:
>
> > Hi Gian, this is great.
> >
> > For me what is most important is (2) and (4)
> > Does my current extension work with new releases?
> > Can I do a rolling upgrade of druid to the next version?
> >
> > The more things that are versioned the better, but (2) and (4) have been
> > the things that have been most important to me in the past.
> >
> > Anyone in the community have any thoughts on this?
> > Thank you
> > rahul
> >
> >
> >
> > On Fri, May 27, 2022 at 11:22 AM Gian Merlino  wrote:
> >
> > > Yeah, I'd say the next one after 24.0 would be 25.0. The idea is really
> > > just to remove the leading zero and thereby communicate the accurate
> > state
> > > of the project: it has been stable and production-ready for a long
> time.
> > > Some people see the leading zero and interpret that as a sign of an
> > > immature or non-production-ready system. So I think this change is
> worth
> > > doing and beneficial.
> > >
> > > I do think we can do better at communicating compatibility, but IMO
> > > semantic versioning for the whole system isn't the best way to do it.
> > > Semantic versioning is good for libraries, where people need one kind
> of
> > > assurance: that they can update to the latest version of the library
> > > without needing to make changes in their program. But Druid is
> > > infrastructure software with many varied senses of compatibility, such
> > as:
> > >
> > > 1) Query API: do user queries written for version X return compatible
> > > responses when run against version Y?
> > > 2) Extension API: do extensions written for version X run as expected
> > with
> > > version Y?
> > > 3) Storage format: can servers at version X read segments written by
> > > servers at version Y?
> > > 4) Intracluster protocol: can a server at version X communicate
> properly
> > > with a server at version Y?
> > > 5) Server configuration: do server configurations (runtime properties,
> > jvm
> > > configs) written for version X work as expected for version Y?
> > > 6) Ecosystem: does version Y drop support for older versions of
> > ZooKeeper,
> > > Kafka, Hadoop, etc, which were supported by version X?
> > >
> > > In practice we do find good reasons to make such changes in one or more
> > of
> > > these areas in many of our releases. We try to maximize compatibility
> > > between releases, but it is balanced against the effort to improve the
> > > system while keeping the code maintainable. So if we considered all of
> > > these areas in semantic versioning, we'd be incrementing the major
> > version
> > > often anyway. The effect would be similar to having a "meaningless"
> > version
> > > number but with more steps.
> > >
> > > IMO a better approach would be to introduce more kinds of version
> > numbers.
> > > In my experience the two most important kinds of compatibility to most
> >

Re: [DISCUSS] Removing code related to `FireHose`

2022-07-06 Thread Gian Merlino

I am in favor of immediately removing FiniteFirehoseFactory and marking
EventReceiverFirehoseFactory deprecated. Then, later on we can remove
InputRowParser and EventReceiverFirehoseFactory.

On Fri, Jun 24, 2022 at 4:41 AM Abhishek Agarwal 
wrote:

> I didn’t include them (RealtimeIndexTask and
> AppenderatorDriverRealtimeIndexTask) in my previous email because they have
> not been marked deprecated yet. We should mark them deprecated officially
> in the next release and remove them in the release after that.
>
> So looks like the classes that we can definitely remove are implementations
> of `FiniteFirehoseFactory` and mark the `Firehose` interface deprecated.
>
> On Fri, 24 Jun 2022 at 4:36 AM, Clint Wylie  wrote:
>
> > If we remove RealtimeIndexTask and AppenderatorDriverRealtimeIndexTask
> > then we can remove EventReceiverFirehoseFactory. The former was
> > primarily used by tranquility which has been sunset, the latter I'm
> > not sure was ever used for anything. I'm personally in favor of
> > removing both of them since push based ingestion is very fragile in my
> > experience, but I think some of the oldest integration tests use
> > RealtimeIndexTask and so would need to be removed/updated/rewritten to
> > use something else as appropriate.
> >
> > I don't think we can completely remove InputRowParser until we drop
> > Hadoop support (or modify Hadoop ingestion to use
> > InputSource/InputFormat?), since it still relies on using the older
> > spec. As far as I know, Thrift is the only data format that has not
> > been fully migrated to use InputFormat, though there is an old PR that
> > is mostly done  here https://github.com/apache/druid/pull/11360.
> >
> > On Thu, Jun 23, 2022 at 5:11 AM Abhishek Agarwal
> >  wrote:
> > >
> > > Hello,
> > > The `FiniteFirehoseFactory` and `InputRowParser` classes were
> deprecated
> > in
> > > 0.17.0 (https://github.com/apache/druid/pull/8823) in favour of
> > > `InputSource`.  0.17.0 was released more than 2 years ago in Jan 2020.
> > >
> > > I think it is about time that we remove this code entirely. Removing
> > > `InputRowParser` may not be as trivial as
> `EventReceiverFirehoseFactory`
> > > depends on it. I didn't find any alternatives for
> > > `EventReceiverFirehoseFactory` and it is not marked deprecated as well.
> > >
> > > But we can still remove `FiniteFirehoseFactory` and the implementations
> > > safely as there are alternatives available.
> > >
> > > Thoughts/Suggestions?
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>

Re: Vulnerability Report [Misconfigured DMARC Record Flag]

2022-06-21 Thread Gian Merlino

Hey Zeus,

You should have received a response to this report from the Apache Security
Team (secur...@apache.org). In the future, please note that security
reports should be sent to secur...@apache.org, not the dev list.

On Tue, Jun 21, 2022 at 1:04 PM Cyber Zeus  wrote:

> Hi team
> kindly update me with the bug that I've reported.
> -Zeus
>
> On Fri, May 20, 2022 at 11:34 PM Cyber Zeus 
> wrote:
>
>> Hi Team,
>> I am an independent security researcher and I have found a bug in your
>> website
>> The details of it are as follows:-
>>
>> Description: This report is about a misconfigured Dmarc record flag,
>> which can be used for malicious purposes as it allows for fake mailing on
>> behalf of respected organizations.
>>
>> About the Issue:
>> As i have seen the DMARC record for
>>
>> *druid.apache.org*
>>
>> which is:
>> DMARC Policy Not Enabled
>> DMARC Not Found
>>
>> As u can see that your DMARC record, a valid record should be like:-
>>
>> DMARC Policy Enabled
>> What's the issue:
>> A DMARC record is a type of Domain Name Service (DNS) record that
>> identifies which mail servers are permitted to send an email on behalf of
>> your domain. The purpose of a DMARC record is to prevent spammers from
>> sending messages on the behalf of your organization.
>>
>> Attack Scenario: An attacker will send phishing mail or anything
>> malicious mail to the victim via mail:
>>
>> commits-h...@druid.apache.org
>>
>>
>> even if the victim is aware of a phishing attack, he will check the
>> origin email which came from your genuine mail id
>> commits-h...@druid.apache.org
>>
>>
>> so he will think that it is genuine mail and get trapped by the attacker.
>> The attack can be done using any PHP mailer tool like this:-
>>
>> > $to = "vic...@example.com";
>> $subject = "Password Change";
>> $txt = "Change your password by visiting here - [VIRUS LINK HERE]l";
>> $headers = "From:
>>
>> commits-h...@druid.apache.org
>>
>>
>> ";mail($to,$subject,$txt,$headers);
>> ?>
>>
>> U can also check your Dmarc/ SPF record form: MXTOOLBOX
>>
>> Reference:
>> https://support.google.com/a/answer/2466580?hl=en
>> have a look at the GOOGLE article for a better understanding![image:
>> image.png]
>> [image: image.png]
>>
>

New PMC member: Abhishek Agarwal

2022-06-07 Thread Gian Merlino

Hey Druids,

The Druid PMC has invited Abhishek Agarwal (asf id abhishek, github id
abhishekagarwal87) to become a PMC member, and we are pleased to announce
that he has accepted. Abhishek has authored dozens of commits, participated
in nearly 200 code reviews, and is release manager for the upcoming 0.23.0
release.

Congratulations, Abhishek!

Re: EJB interceptor binding API is not available

2022-06-04 Thread Gian Merlino

Hi Maithri,

I haven't encountered something like this before so I'm not sure what's
causing it. Is it reproducible? If you could provide some steps for someone
else to see the same thing you're seeing — maybe it relies on a particular
Java version, or particular Druid version, or something — then that would
be helpful.

On Thu, Jun 2, 2022 at 11:16 AM Maithri Vemula  wrote:

> Hello concerned,
> I am doing druid kafka integration and at times the coordinator and
> overlord pods are crashing with the error EJB interceptor binding API is
> not available and JAX-RS EJB support is diabled. I saw a similar discussion
> but it's closed now. https://github.com/apache/druid/issues/8030. Is it
> something someone from your team can help me with?
>
> Thanks,
> Maithri Vemula.
>

Re: Next Druid release version scheme

2022-05-27 Thread Gian Merlino

Yeah, I'd say the next one after 24.0 would be 25.0. The idea is really
just to remove the leading zero and thereby communicate the accurate state
of the project: it has been stable and production-ready for a long time.
Some people see the leading zero and interpret that as a sign of an
immature or non-production-ready system. So I think this change is worth
doing and beneficial.

I do think we can do better at communicating compatibility, but IMO
semantic versioning for the whole system isn't the best way to do it.
Semantic versioning is good for libraries, where people need one kind of
assurance: that they can update to the latest version of the library
without needing to make changes in their program. But Druid is
infrastructure software with many varied senses of compatibility, such as:

1) Query API: do user queries written for version X return compatible
responses when run against version Y?
2) Extension API: do extensions written for version X run as expected with
version Y?
3) Storage format: can servers at version X read segments written by
servers at version Y?
4) Intracluster protocol: can a server at version X communicate properly
with a server at version Y?
5) Server configuration: do server configurations (runtime properties, jvm
configs) written for version X work as expected for version Y?
6) Ecosystem: does version Y drop support for older versions of ZooKeeper,
Kafka, Hadoop, etc, which were supported by version X?

In practice we do find good reasons to make such changes in one or more of
these areas in many of our releases. We try to maximize compatibility
between releases, but it is balanced against the effort to improve the
system while keeping the code maintainable. So if we considered all of
these areas in semantic versioning, we'd be incrementing the major version
often anyway. The effect would be similar to having a "meaningless" version
number but with more steps.

IMO a better approach would be to introduce more kinds of version numbers.
In my experience the two most important kinds of compatibility to most
users are "Query API" and "Extension API". So if we had a "Query API
version" or "Extension API version" then we could semantically version the
Query and Extension API versions, separately from the main Druid version.
(Each Druid release would have an associated Extension API version, and a
list of supported Query API versions that users could choose between on a
per-query basis.)

Rahul, I wonder what you think about this idea? What kinds of compatibility
are most important to you?

On Fri, May 27, 2022 at 9:39 AM rahul gidwani  wrote:

> I would say that semantic versioning for me is very important for
> determining compatibility between releases.  Minor versions should always
> adhere to being compatible with each other and a major version bump is
> where you can potentially break it.
>
> Right now calling it 24.0 is fine, but what would the next release be
> called?  25.0? If that is the case, then the number means nothing, every
> release is a major version and nothing has changed from what it is today
> except moving a decimal point.
>
> Personally I think we should focus on what we are going to do going forward
> for druid users such that they can be assured that compatibility is met
> between releases.  Right now it is release notes, but if we start using
> minor versioning like it is intended - that would be much more clear.
>
>
>
>
>
>
>
>
>
>
> On Fri, May 27, 2022 at 9:25 AM suneet Saldanha  wrote:
>
> > Hi Druids,
> >
> > I'd like to propose we bump the version of Druid to 24.0 for the next
> > release.
> > I think this would be beneficial because it better reflects the maturity
> of
> > the Druid
> > project that is actively used in many production use cases. This was
> > discussed briefly
> > in the Druid 0.23.0 release thread [1].
> >
> > Other ideas that were proposed
> > * Use a year / month in the release
> > * Make the next release 1.xx
> >
> > I think the year month is interesting, but since we do not have a planned
> > release schedule,
> > it is hard to pick the version that should be in the `master` branch
> while
> > active dev is happening.
> >
> > Labeling the next release as 1.xx makes it appear as if the current
> version
> > of Druid isn't very
> > stable since the current version is 0.xx which isn't the case.
> >
> > Happy to hear more opinions on this so we can get to consensus before it
> is
> > time for the next code freeze + release.
> >
> > [1]
> >
> >
> https://lists.apache.org/list?dev@druid.apache.org:2022-5:[DISCUSS]%20Druid%200.23%20release
> >
>

Re: [DISCUSS] Druid 0.23 release

2022-05-26 Thread Gian Merlino

I'm supportive of changing the versioning to something without the leading
zero in the next release where this is practical. If it's the one after
0.23.0, then I would go with 24.0. IMO, going with 1.0 would send a message
that this is the first mature release. But that isn't the case: we have
been doing mature releases for a long time now. Going with 24.0 is clearer
in that regard.

Happy to repeat this opinion on a new thread too :)

On Thu, May 26, 2022 at 6:49 PM Frank Chen  wrote:

> For 0.23, I don't think we need to make changes because I think it may take
> us some time to reach an agreement on the naming.
>
> We can start a new thread to discuss the versioning schema.
>
>
> On Thu, May 26, 2022 at 8:19 PM Abhishek Agarwal <
> abhishek.agar...@imply.io>
> wrote:
>
> > We should definitely move away from the `0.xx` versioning scheme we have
> > been using. However, the next version that we pick up is debatable.
> `23.x`
> > seems an odd jump from `0.23`. Can we increment the version to `1.x`
> maybe?
> > I also like the idea of using Yeah and Month that Frank has suggested.
> >
> > I don't think that 0.23 is the right release to make this change though.
> > 0.23 has already been delayed because of CVE investigations and bug
> fixes.
> > I would like to get this release out of the door as soon as possible.
> >
> > On Thu, May 26, 2022 at 2:40 PM Frank Chen  wrote:
> >
> > > I agree.
> > >
> > > This is also a question that I want to ask why the version is still
> 0.xx
> > > which gives many people a hint that Druid is still under mature.
> > >
> > > There are many versioning schemas. One popular way is combining the
> > release
> > > year and month in the version.
> > > For example, if we're going to release a version in May this year, the
> > main
> > > version can be 22.5.
> > >
> > >
> > > Versioning is one thing, LTS strategy should also be clear.
> > > Since we're going to release several versions a year, we should plan in
> > > advance which one should be scheduled as a LTS version and maintain it
> > for
> > > a period of time if there are some vital bugs and security issues.
> > >
> > >
> > >
> > >
> > > On Thu, May 26, 2022 at 1:21 PM Suneet Saldanha 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've been thinking that we should consider re-branding this release
> as
> > > > the Druid 23.0 instead of 0.23 release. I think this is appropriate
> > > because
> > > > typically a `0.XX` software version implies that the software is in
> > it's
> > > > infancy.
> > > >
> > > > Druid is quite mature, and we've been putting good guardrails in
> place
> > to
> > > > detect and prevent breaking API changes in each release. Druid has
> also
> > > > been running in production clusters for many different use cases for
> > > quite
> > > > some
> > > > time now. I think version 23.0 is more in line with the maturity of
> the
> > > > project.
> > > >
> > > > Is there a reason not to change the version for the next release? Any
> > > > other thoughts?
> > > >
> > > > On 2022/04/11 10:21:11 Abhishek Agarwal wrote:
> > > > > Thank you for creating that PR, Frank. In the last release, we
> > excluded
> > > > > helm charts since we were not sure about IP clearance. From
> > > > > https://incubator.apache.org/ip-clearance/, we should decide on IP
> > > > > clearance whether we include helm charts in artifacts or not. Any
> > > > thoughts?
> > > > >
> > > > > On Wed, Mar 30, 2022 at 4:44 PM Frank Chen 
> > > wrote:
> > > > >
> > > > > > Hi Abhishek,
> > > > > >
> > > > > > Thank you for starting the release work.
> > > > > >
> > > > > > This PR should be merged to address a problem caused by a
> previous
> > > PR:
> > > > > > https://github.com/apache/druid/pull/12067
> > > > > > I've added it to the 0.23 milestone.
> > > > > >
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 30, 2022 at 2:15 PM Abhishek Agarwal <
> > > > > > abhishek.agar...@imply.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello everyone,
> > > > > > > It's time to kick-off the process for druid 0.23 release. I
> will
> > > need
> > > > > > help
> > > > > > > from the community in surfacing any important issues that need
> to
> > > be
> > > > > > > addressed before 0.23 release. We can use this thread to
> discuss
> > > > those
> > > > > > > issues and take a call on how to unblock the release.
> > > > > > >
> > > > > > > I have also created 0.23 milestone (
> > > > > > > https://github.com/apache/druid/milestone/45). Any issues that
> > we
> > > > must
> > > > > > > want
> > > > > > > to fix in the 0.23 release, can be tagged with this milestone.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
>

Re: Limitations of automated unused segment kill logic (Issue #10876 and PR #10877)

2022-05-05 Thread Gian Merlino

I just took a look, and it looks like a few other people did too. Sorry it
took so long!

I do think that "review for a review" is a good way to go, I think! Thanks
for volunteering.

On Mon, May 2, 2022 at 12:12 PM Lucas Capistrant 
wrote:

> Hi all,
>
> I'm writing in regards to my enhancement proposal, #10876
> , and subsequent PR, #10877
> . The issue and PR are related
> to what unused segments the Druid coordinator is able to find and kill with
> machine generated kill tasks. Currently, only segments whose interval end
> date are in the past (relative to the time the Coordinator is looking for
> segments) are able to be killed automatically. My solution allows unused
> segments to be killed whose interval end date is in the future (relative to
> when the Coordinator searches for segments to kill)
>
> My team has found the existing functionality to introduce waste in
> deepstore and metastore when our users are using Druid to build datasources
> that span into the future. These data sources are then being refreshed
> iteratively as future projections change, resulting in unused segments due
> to overshadowing (a common occurrence at my org). Before we applied my
> proposed change internally, we had built up a lot of unused data in
> deepstore and metastore. After using this new feature, we are able to keep
> our deepstore and metastore much more clean. I think this would be a great
> thing for others in the community to have access to to avoid similar data
> storage pain points.
>
> Unfortunately, it has been quite some time since the PR was created, and
> the only code review I've been able to land was from a non-committer
> colleague of mine. I fear it may never be taken up without a little extra
> push now that it is so far down the open PRs list. My hope is that bringing
> up the topic in the dev list catches the eye of a neutral party who may
> want to give it a look.
>
> I'm going to be able to spend a decent amount of time these next few weeks
> reviewing open PRs in the Druid project, so I'm more than happy to set up a
> "review for a review" type of agreement with someone who is also working on
> a new change. Feel free to reach out directly via email or a comment on my
> PR if you have something you are working to get reviewed.
>
> Thank you,
> Lucas Capistrant
>

Re: [GitHub] [druid] cryptoe commented on a diff in pull request #12339: Make AWS WebIdentityToken actually working and usable from inside EKS.

2022-04-04 Thread Gian Merlino

I thought these emails were supposed to go to comm...@druid.apache.org? I
do see a bunch on that list from today, so maybe this was a weird gitbox
snafu.

On Sun, Apr 3, 2022 at 10:53 PM GitBox  wrote:

>
> cryptoe commented on code in PR #12339:
> URL: https://github.com/apache/druid/pull/12339#discussion_r841385835
>
>
> ##
>
> extensions-core/s3-extensions/src/main/java/org/apache/druid/data/input/s3/S3InputSource.java:
> ##
> @@ -166,15 +175,21 @@ private void applyAssumeRole(
>AWSCredentialsProvider awsCredentialsProvider
>)
>{
> -String assumeRoleArn = s3InputSourceConfig.getAssumeRoleArn();
> -if (assumeRoleArn != null) {
> +// Do not run if WebIdentityToken file and assumeRole ARN are
> detected from the environment variable,
> +// we want the default s3ClientBuilder behavior for ServiceAccount +
> eks.amazonaws.com/role-arn annotation to work.
>
> Review Comment:
>Based on reading:
> https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html
> IMHO `AWS_WEB_IDENTITY_TOKEN_FILE` should be the lowest priority of
> authentication that we should support as it looks like its more supported
> for short duration access to AWS services.
>However, I would somehow first check why AWS_ROLE_ARN got picked up.
> Are you specifying it in the ingestion spec somewhere?
>
>
>
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: 0.23

2022-03-24 Thread Gian Merlino

I agree it's a good time to do a release. Most of the release-manager steps
involve having commit privileges, but nevertheless, you might find it
interesting to read about the process:
https://github.com/apache/druid/blob/master/distribution/asf-release-process-guide.md

You've actually already done the first step: start a thread on the dev
list. The next part is we have a discussion & see if there's anything
critical we should get into this release before we branch it off.

Anyone have any comments on that?

On Wed, Mar 23, 2022 at 9:54 PM Eyal Yurman 
wrote:

> Hi,
>
> Anyone plan to work on releasing 0.23?
> I'll be really glad to manage the 0.23 release but I'm not a committer.
> Assuming this requires committer privileges, I'd be glad if anyone can
> volunteer.
>
> BTW, have you noticed that we shifted away from the official quarterly
> releases? Perhaps we should discuss our release process. Especially since
> we also don't release minor versions (except for hotfixes immediately after
> a major release).
>

Multi-stage queries

2022-02-25 Thread Gian Merlino

Hey Druids,

I recently posted a proposal on GitHub about adding multi-stage distributed
queries to Druid: https://github.com/apache/druid/issues/12262

I think it'll be a powerful advancement in what Druid is capable of, and
I'm interested in what people think. It's also going to be a lot of work so
it'll need to be sequenced into smaller pieces. I included some thoughts
about how we can get there gradually. If anyone has thoughts or questions I
invite you to write them on the issue.

Gian

Re: Apache Druid Slack

2022-01-21 Thread Gian Merlino

It sounds like a good idea to me. It's not ideal that the current Slack
workspace is hard for new people to join.

On Thu, Jan 20, 2022 at 10:15 AM Vadim Ogievetsky 
wrote:

> I think that the PMC should create a new Slack channel for Apache Druid and
> shift the community towards using it away from the ASF Slack. I volunteer
> to do this on the PMCs behalf and do all the setup/admin work. I would
> share admin access with any PMC member.
>
> What is the motivation for this?
>
> As you may have heard, it’s become increasingly difficult for new users
> without an @apache.org email address to join the ASF #druid Slack channel.
> ASF Infra disabled the option to publicly provide a link to the workspace
> to anyone who wanted it, after encountering issues with spammers.
>
> Per Infra’s guidance (https://infra.apache.org/slack.html), new community
> members should only be invited as single-channel guests. Unfortunately,
> single-channel guests are unable to extend invitations to new members,
> including their colleagues who are using Druid. Only someone with full
> member privileges is able to extend an invitation to new members. This lack
> of consistency doesn’t make the community feel inclusive.
>
> There is a workaround in place (
> https://github.com/apache/druid-website-src/pull/278) – users can send an
> email to druid-u...@googlegroups.com to request an invite to the Slack
> channel from an existing member – but this still poses a barrier to entry,
> and isn’t a viable permanent solution. It also creates potential privacy
> issues as not everyone is at liberty to announce they’re using Druid nor
> wishes to display their email address in a public forum.
>
> I propose we make our own free Slack channel for Apache Druid and encourage
> people to migrate to it. Then we can have our own policy on Slack
> invitations - I would like to restore the ability for anyone on the web to
> join our Slack.
>
> This is not a 100% original idea, in fact this is what other Apache
> projects have done, notably Apache Pinot (see "join our slack" on
> https://pinot.apache.org/). I propose we do the same.
>
> Vadim
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: [E] [DISCUSS] Patch to fix new vulnerabilities in log4j

2021-12-20 Thread Gian Merlino

I think doing a 0.22.2 would be worth it for users' peace of mind, even if
Druid isn't vulnerable by default. Just because people are on edge about
log4j-related stuff right now. In case other people agree, I created an
0.22.2 branch just now. Is anyone able to release-manage this one?

Btw, John and Rahul, assuming we do 0.22.2, I'm not sure what the timing
will be. I don't think we'll do it on the same emergency schedule that we
did for 0.22.1, since this doesn't seem to affect Druid unless you're
explicitly enabling those context patterns mentioned in the log4j advisory.
And there is an easy mitigation: just don't use those context patterns. So
if you are in a rush due to your own internal schedules, you might need to
build your own versions temporarily anyway.

On Mon, Dec 20, 2021 at 11:42 AM rahul gidwani 
wrote:

> agreed, at most companies they do a vulnerability scan of the libs to see
> if you have the right version.
>
> On Mon, Dec 20, 2021 at 11:36 AM Pries, John E
>  wrote:
>
> > Can I humbly recommend a quick patch to log4j2.17.0?  The reason is
> > security organizations don't know from application to application which
> > will be impacted or not, it will force us to update ourselves creating a
> > deviation from core.
> >
> > It just makes things more complicated for everyone if we don't have a
> > recognized safe deployment.
> >
> > On Mon, Dec 20, 2021 at 11:28 AM Frank Chen 
> wrote:
> >
> > > Hi Devs,
> > >
> > > Last week, there were many people leaving comments in the issue/PR
> listed
> > > as follows to enquire that
> > > if there's a newer Druid patch release such as 0.22.2 that fixes the
> new
> > > vulnerabilities (CVE 45046
> > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nvd.nist.gov_vuln_detail_CVE-2D2021-2D45046=DwIBaQ=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ=aGbp4O059brQZJp5pRdo8_F4i3VDv_vYJ9-oUH02u60=P1QivRypv4h7qPgjo2m9vJ1cuLOTTW5VRntrhoDc9wk=fsf_xsKE8d3u6q5pCJRQRvsFKfXB3k1J3GJLT71K_rQ=
> > > > and
> > > 45105 <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nvd.nist.gov_vuln_detail_CVE-2D2021-2D45105=DwIBaQ=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ=aGbp4O059brQZJp5pRdo8_F4i3VDv_vYJ9-oUH02u60=P1QivRypv4h7qPgjo2m9vJ1cuLOTTW5VRntrhoDc9wk=um2-PddWXidIGsl3-GF4qMo2biAmn4RQyecYYWyqxHo=
> > > >) which affect log4j
> > > 2.15.0 and 2.16.0
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_issues_12054=DwIBaQ=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ=aGbp4O059brQZJp5pRdo8_F4i3VDv_vYJ9-oUH02u60=P1QivRypv4h7qPgjo2m9vJ1cuLOTTW5VRntrhoDc9wk=jvfgazS1S2hRDf7zrBAXJzQsASfndwqr7vGscDGjXgE=
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_12061=DwIBaQ=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ=aGbp4O059brQZJp5pRdo8_F4i3VDv_vYJ9-oUH02u60=P1QivRypv4h7qPgjo2m9vJ1cuLOTTW5VRntrhoDc9wk=umNt0yNlLm2_esyx655Gd2daIlX46dvuEVNcSfUlMn8=
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_12051=DwIBaQ=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ=aGbp4O059brQZJp5pRdo8_F4i3VDv_vYJ9-oUH02u60=P1QivRypv4h7qPgjo2m9vJ1cuLOTTW5VRntrhoDc9wk=7lzgp8xFJRjCaImXAJ7_REFuZUwOqIyTxiKvGVgRois=
> > >
> > > So, I bring up this topic here to discuss so that all of us can get a
> > clear
> > > message whether we should do a patch release.
> > >
> > > Following is my personal opinion:
> > >
> > > From the description of these two CVE announcements, we can see that,
> > these
> > > two problems only affect those log4j pattern layout which involves
> thread
> > > context map (MDC).
> > >
> > > Since Druid's default pattern layout DOES NOT use such pattern layout,
> I
> > > think it's safe to say that it's not affected by these vulnerabilities.
> > > So, for the patch release, we don't need to release another patch
> release
> > > to address these two problems.
> > >
> > > We can address these two in the upcoming major release 0.23 which is
> > going
> > > to release next month if everything goes well as scheduled.
> > >
> > >
> > > Frank
> > >
> >
> >
> > --
> > John Pries
> > Verizon
> > 614 560 2132
> >
>

Re: Apache Druid security advisory: critical vulnerability CVE-2021-44228 in Apache Log4j

2021-12-13 Thread Gian Merlino

To clarify about the mitigations: the "-Dlog4j2.formatMsgNoLookups=true"
mitigation that has been floating around the Internet is *not effective*
for log4j 2.8.2, which was used by Druid 0.22.0 and other recent versions.
If you are going to stay on an older version of Druid, do not use this
mitigation. Instead, use one of the two that we mention in our advisory.

(But upgrading is best!)

On Sat, Dec 11, 2021 at 1:50 AM Jihoon Son  wrote:

> Severity: critical
>
>
> Description:
>
> Apache Druid uses the Java logging library Apache Log4j, which has
> recently been identified to have a critical vulnerability that could
> lead to remote code execution (RCE). This vulnerability is triggered
> when an attacker can control any part of a log message. Due to the
> wide attack surface, it is critical that all Druid users patch or
> mitigate this vulnerability as soon as possible.
>
> The Log4j advisory is available at
> https://nvd.nist.gov/vuln/detail/CVE-2021-44228.
>
>
> Affected versions:
>
> Druid 0.22.0 and earlier are affected.
>
>
> Mitigation:
>
> We recommend that all users upgrade to Druid 0.22.1, which contains
> Apache Log4j 2.15.0. This version of Log4j has a fix for the
> vulnerability.
>
> If you are unable to upgrade Druid at this time, we recommend
> deploying a mitigation. Please refer to the Log4j announcement for
> details on possible mitigations:
> https://lists.apache.org/thread/bfnl1stql187jytr0t5k0hv0go6b76g4.
>
> Different Log4j versions have different mitigation options. Check the
> "lib" directory of your Druid installation for the "log4j-core" jar to
> see what version of Log4j you have. Recent versions of Druid use Log4j
> 2.8.2. Two possible mitigations for Log4j 2.8.2 are:
>
> 1) Specify "%m{nolookups}" in the PatternLayout configuration of your
> log4j2.xml file. Druid installations may have multiple log4j2.xml
> files; be sure to update all of them.
>
> 2) Remove the JndiLookup and JndiManager classes from the log4j-core jar.
>
> These mitigations require a cluster restart to take effect.
>
>
> References:
>
> https://nvd.nist.gov/vuln/detail/CVE-2021-44228
> https://lists.apache.org/thread/bfnl1stql187jytr0t5k0hv0go6b76g4
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Need Help Benchmarking Druid

2021-12-11 Thread Gian Merlino

Hey Abdel,

Feel free to DM me on ASF Slack. The info to join is here:
https://druid.apache.org/community/

On Fri, Dec 3, 2021 at 9:11 AM Abdelouahab Khelifati 
wrote:

> Hello,
>
> I am Abdel, a researcher of Computer Science and  I am working on a
> benchmarking paper on time series database systems.
>
> I am interested in your Druid system a lot and I would like to include it
> in the set of time series database systems that we will benchmark (test and
> evaluate).
>
> I would like to consult with one of your system experts on our use-case,
> our data format, our configuration and schema and our queries. We would
> also like some support with the implementation of some of the queries. The
> goal is to ensure that we are using Druid in the most optimal way possible.
>
> If you would like to help us, please refer to me with your prefered format
> and I would provide you with more context about our use-case.
>
> Sincerely, yours,
> _
> Abdel Khelifati
> Ph.D. of Computer Science
> University of Fribourg - Switzerland
> a...@exascale.info
>

Re: [RESULT][VOTE] Release Apache Druid 0.22.1 [RC2]

2021-12-11 Thread Gian Merlino

Thank you for running this release!

On Sat, Dec 11, 2021 at 12:28 AM Jihoon Son  wrote:

> Thanks to everyone who participated in the vote! The vote has passed
> with 3 binding +1s.
>
> Gian Merlino: +1 (binding)
> Clint Wylie: +1 (binding)
> Jonathan Wei: +1 (binding)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: [VOTE] Release Apache Druid 0.22.1 [RC2]

2021-12-10 Thread Gian Merlino

+1 on releasing 0.22.1-rc2

I verified:

- hashes / gpg
- unit tests
- compared the src and bin packages against 0.22.0 to make sure there were
no unexpected changes
- attempted to trigger the jndi lookup functionality; it triggered on
0.22.0 but not 0.22.1-rc2
- verified that task logs look normal and do not have the problem mentioned
in https://github.com/apache/druid/pull/12056

On Fri, Dec 10, 2021 at 9:02 PM Jihoon Son  wrote:

> Hi all,
>
> I have created a build for Apache Druid 0.22.1, release
> candidate 2.
>
> Thanks to everyone who has helped contribute to the release! You can read
> the proposed release notes here:
> https://github.com/apache/druid/issues/12054
>
> The release candidate has been tagged in GitHub as
> druid-0.22.1-rc2 (81e4da747d4fcfd15fa15bfebb942058152a3bba),
> available here:
> https://github.com/apache/druid/releases/tag/druid-0.22.1-rc2
>
> The artifacts to be voted on are located here:
> https://dist.apache.org/repos/dist/dev/druid/0.22.1-rc2/
>
> A staged Maven repository is available for review at:
> https://repository.apache.org/content/repositories/orgapachedruid-1028/
>
> A Docker image containing the binary of the release candidate can be
> retrieved via:
> docker pull apache/druid:0.22.1-rc2
>
> artifact checksums
> src:
>
> 2fea9417a7c164703d8f8bc19bbe70e743b82e6c5cf3ba9b7bc63c60a545b507757e36412bf06f6f522cf6de1cff1fe5141575c030c5fb779270012110f1427b
> bin:
>
> 716b83e07a76b5c9e0e26dd49028ca088bde81befb070989b41e71f0e8082d11a26601f4ac1e646bf099a4bc7420bdfeb9f7450d6da53d2a6de301e08c3cab0d
> docker: 93aae94d4509768e455c444ea7d8515e1a1b447179e043f7b39630c2350125a9
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/jihoonson.asc
>
> This key and the key of other committers can also be found in the project's
> KEYS file here:
> https://dist.apache.org/repos/dist/release/druid/KEYS
>
> (If you are a committer, please feel free to add your own key to that file
> by following the instructions in the file's header.)
>
>
> Verify checksums:
> diff <(shasum -a512 apache-druid-0.22.1-src.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-0.22.1-src.tar.gz.sha512 ; echo)
>
> diff <(shasum -a512 apache-druid-0.22.1-bin.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-0.22.1-bin.tar.gz.sha512 ; echo)
>
> Verify signatures:
> gpg --verify apache-druid-0.22.1-src.tar.gz.asc \
> apache-druid-0.22.1-src.tar.gz
>
> gpg --verify apache-druid-0.22.1-bin.tar.gz.asc \
> apache-druid-0.22.1-bin.tar.gz
>
> Please review the proposed artifacts and vote. Note that Apache has
> specific requirements that must be met before +1 binding votes can be cast
> by PMC members. Please refer to the policy at
> http://www.apache.org/legal/release-policy.html#policy for more details.
>
> As part of the validation process, the RAT license check can be run from
> source by:
> mvn apache-rat:check -Prat
>
> The release artifacts can be generated from source by running:
> mvn clean install -Papache-release,dist -Dgpg.skip
>
> The vote will pass if a majority of at least three +1 PMC votes are cast.
> Because we want to release this version as soon as possible, the vote will
> be closed as soon as it passes.
>
> [ ] +1 Release this package as Apache Druid 0.22.1
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
> [ ] -1 Do not release this package because...
>
> Thanks!
>

Re: Log4j vulnerability - hotfix?

2021-12-10 Thread Gian Merlino

Hi David,

Right now we are very much dedicating our efforts to getting a 0.22.1 patch
release out. It's taking longer than we'd hoped due to an unexpected issue
with the upgrade to log4j 2.15.0: https://github.com/apache/druid/pull/12056
.

Based on the testing we've done so far, though, I think there's another
mitigation available to you if you want to stay on 0.21: you could drop in
the 5 new log4j2 2.15.0 jars to lib/, remove the 2.8.2 jars, and add
-Dlog4j2.is.webapp=false to your jvm command line. The new jars will fix
the vulnerability and the jvm config avoids the error on shutdown.

On Fri, Dec 10, 2021 at 2:35 PM David Glasser 
wrote:

> I will note that the `%m{nolookups}` workaround feels a lot more
> challenging to feel comfortable using than the `-D`/env var
> workarounds that work in the newer versions. For example, our
> log4j2.xml file has two Appenders, one of which uses JsonLayout and
> one of which uses PatternLayout. It's hard to understand from the docs
> as a non-log4j-expert if the JsonLayout appender is vulnerable or not
> and if there's a way to apply `%m{nolookups}` to it.
>
> Because the workarounds for Druid are more challenging than for
> projects on the slightly newer versions of log4j2, perhaps it would be
> appropriate to put out one or two more patch releases, against 0.21
> and/or 0.20? I know our installation is still on 0.21, which is less
> than 2 months old.
>
> On Fri, Dec 10, 2021 at 11:35 AM Gian Merlino  wrote:
> >
> > We're working on this right now and will be getting a vote / release for
> > 0.22.1 out asap.
> >
> > Btw, the log4j announcement mentions a mitigation that does work for our
> > current version (2.8.2). It's part (b) here, specifying "%m{nolookups}"
> in
> > the PatternLayout configuration:
> > https://lists.apache.org/thread/bfnl1stql187jytr0t5k0hv0go6b76g4.
> However,
> > I haven't personally tested this, so I cannot provide any more details
> > beyond pointing to the announcement.
> >
> > On Fri, Dec 10, 2021 at 10:27 AM Lucas Capistrant <
> > capistrant.lu...@gmail.com> wrote:
> >
> > > Since it is “critical” severity, I think it would be a good idea to
> > > seriously consider pushing out a minor version of 0.22.x. Especially
> since
> > > the mitigation strategy outlined in the CVE is not available in the
> log4j
> > > version that exists today in the current stable release. There is past
> > > precedent for such releases: see 0.20.2
> > >
> > > On Fri, Dec 10, 2021 at 12:14 PM Eyal Yurman  > > .invalid>
> > > wrote:
> > >
> > > > Hello, regarding https://github.com/apache/druid/pull/12051 which
> merged
> > > > to
> > > > master,
> > > >
> > > > Is it a common practice for the project to backport and release a new
> > > minor
> > > > for the latest version?
> > > >
> > >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: [VOTE] Release Apache Druid 0.22.1 [RC1]

2021-12-10 Thread Gian Merlino

My vote is 0 on this release.

I verified the usual things, and compared the src and bin packages against
0.22.0 to make sure there were no unexpected changes. That all looks OK to
me. But there is an issue with weird errors at the end of logfiles for
processes that exit normally. It's especially noticeable for indexing
tasks. It looks like some issue with the shutter-downer stuff that was
introduced in https://github.com/apache/druid/pull/1387. The error looks
like the following. If this is fixable I think it'd be good to fix it in
0.22.1-rc2. But if we can't get that done today, I think it's OK to go
ahead with releasing 0.22.1-rc1 with this listed as a known issue, because
the security issue is serious and shouldn't wait another day.

2021-12-10T23:13:38,740 ERROR [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler
- Exception when stopping method[public void
org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()]
on
object[org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@30a5a58d
]
java.lang.reflect.InvocationTargetException: null
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:?]
at
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:?]
at
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465)
[druid-core-0.22.1.jar:0.22.1]
at
org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368)
[druid-core-0.22.1.jar:0.22.1]
at org.apache.druid.cli.CliPeon.run(CliPeon.java:323)
[druid-services-0.22.1.jar:0.22.1]
at org.apache.druid.cli.Main.main(Main.java:113)
[druid-services-0.22.1.jar:0.22.1]
Caused by: org.apache.druid.java.util.common.ISE: Expected state [STARTED]
found [INITIALIZED]
at
org.apache.druid.common.config.Log4jShutdown.stop(Log4jShutdown.java:105)
~[druid-core-0.22.1.jar:0.22.1]
at
org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop(Log4jShutterDownerModule.java:116)
~[druid-server-0.22.1.jar:0.22.1]
... 8 more
Finished peon task

On Fri, Dec 10, 2021 at 12:25 PM Jihoon Son  wrote:

> Hi all,
>
> I have created a build for Apache Druid 0.22.1, release
> candidate 1.
>
> Thanks to everyone who has helped contribute to the release! You can read
> the proposed release notes here:
> https://github.com/apache/druid/issues/12054
>
> The release candidate has been tagged in GitHub as
> druid-0.22.1-rc1 (c052fa52b2266e25c5c31b9156c530aa29aeb147),
> available here:
> https://github.com/apache/druid/releases/tag/druid-0.22.1-rc1
>
> The artifacts to be voted on are located here:
> https://dist.apache.org/repos/dist/dev/druid/0.22.1-rc1/
>
> A staged Maven repository is available for review at:
> https://repository.apache.org/content/repositories/orgapachedruid-1027/
>
> A Docker image containing the binary of the release candidate can be
> retrieved via:
> docker pull apache/druid:0.22.1-rc1
>
> artifact checksums
> src:
>
> 9a6304d2c434e0a8226ef8621c1d6653cf80eca10e2be24936b98ec4f7cc19f80193599aea6cd3992a6998e600134e6e71d001630b7e51ca0ab98310850cc064
> bin:
>
> eceacdb0ffca7da462eddc31aaed735c02f639c6f7bafc826fd050095c00a84bffc5694f0887d43436d704e3686e437d44e1a226b4f0d7701f9c450c63f1e1c8
> docker: 929d217c5c86d59b69db3f0202387a4b0103c8a35b7011ce8e00e95582ade380
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/jihoonson.asc
>
> This key and the key of other committers can also be found in the project's
> KEYS file here:
> https://dist.apache.org/repos/dist/release/druid/KEYS
>
> (If you are a committer, please feel free to add your own key to that file
> by following the instructions in the file's header.)
>
>
> Verify checksums:
> diff <(shasum -a512 apache-druid-0.22.1-src.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-0.22.1-src.tar.gz.sha512 ; echo)
>
> diff <(shasum -a512 apache-druid-0.22.1-bin.tar.gz | \
> cut -d ' ' -f1) \
> <(cat apache-druid-0.22.1-bin.tar.gz.sha512 ; echo)
>
> Verify signatures:
> gpg --verify apache-druid-0.22.1-src.tar.gz.asc \
> apache-druid-0.22.1-src.tar.gz
>
> gpg --verify apache-druid-0.22.1-bin.tar.gz.asc \
> apache-druid-0.22.1-bin.tar.gz
>
> Please review the proposed artifacts and vote. Note that Apache has
> specific requirements that must be met before +1 binding votes can be cast
> by PMC members. Please refer to the policy at
> http://www.apache.org/legal/release-policy.html#policy for more details.
>
> As part of the validation process, the release artifacts can be generated
> from source by running:
> mvn clean install -Papache-release,dist -Dgpg.skip
>
> The RAT license check can be run from source by:
> mvn apache-rat:check -Prat
>
> Because we want to make this release as soon as possible, this

Re: Log4j vulnerability - hotfix?

2021-12-10 Thread Gian Merlino

We're working on this right now and will be getting a vote / release for
0.22.1 out asap.

Btw, the log4j announcement mentions a mitigation that does work for our
current version (2.8.2). It's part (b) here, specifying "%m{nolookups}" in
the PatternLayout configuration:
https://lists.apache.org/thread/bfnl1stql187jytr0t5k0hv0go6b76g4. However,
I haven't personally tested this, so I cannot provide any more details
beyond pointing to the announcement.

On Fri, Dec 10, 2021 at 10:27 AM Lucas Capistrant <
capistrant.lu...@gmail.com> wrote:

> Since it is “critical” severity, I think it would be a good idea to
> seriously consider pushing out a minor version of 0.22.x. Especially since
> the mitigation strategy outlined in the CVE is not available in the log4j
> version that exists today in the current stable release. There is past
> precedent for such releases: see 0.20.2
>
> On Fri, Dec 10, 2021 at 12:14 PM Eyal Yurman  .invalid>
> wrote:
>
> > Hello, regarding https://github.com/apache/druid/pull/12051 which merged
> > to
> > master,
> >
> > Is it a common practice for the project to backport and release a new
> minor
> > for the latest version?
> >
>

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

2021-12-02 Thread Gian Merlino

Harini, those are interesting findings. I'm not sure if the two pauses are
necessary, but my thought is that it ideally shouldn't matter because the
supervisor shouldn't be taking that long to handle its notices. A couple
things come to mind about that:

1) Did you see what specifically the supervisor is doing when it's handling
the notices? Maybe from a stack trace? We should look into optimizing it,
or making it asynchronous or something, depending on what it is.
2) Although, there isn't really a need to trigger a run for every single
task status change anyway; I think it's ok to coalesce them. This patch
would do it: https://github.com/apache/druid/pull/12018

Jason, also interesting findings! I took a crack at rebasing your patch on
master and adding a scale test for the TaskQueue with 1000 simulated tasks:
https://github.com/apache/druid/compare/master...gianm:tq-scale-test. When
I run the scale test, "doMassLaunchAndExit" passes quickly but
"doMassLaunchAndShutdown" times out. I suppose shutting down lots of tasks
is still a bottleneck.

Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should be pretty
straightforward to make the shutdown API asynchronous, which would help
speed up anything that is shutting down lots of tasks all at once. Would
that be helpful in your environments? Or are the changes to move shutdown
out of critical sections going to be enough?

On Wed, Dec 1, 2021 at 1:27 PM Jason Koch  wrote:

> Hi Harini,
>
> We have seen issues like this related to task roll time, related to task
> queue notifications on overlord instances; I have a patch running
> internally that resolves this.
>
> These are my internal triage notes:
> ==
> - Whenever task scheduling is happening (startup, ingest segment task
> rollover, redeployment of datasource) Overlord takes a long time to assign
> workers. This compounds because tasks sit so long before deployment that it
> starts failing tasks and having to relaunch them.
>
>- TaskQueue: notifyStatus() which receives updates from the
>middlemanagers, and the manage() loop which controls services, runs
> through
>a single lock. For example, the shutdown request involves submitting
>downstream HTTP requests synchronously (while holding the lock).
>- This means for a cluster with ~700 tasks that tasks are held for
>nearly 1second, and only after each 1 second around the manage loop can
> 1-2
>notifications be processed. For a new startup, with 700 tasks, and a
> 1sec
>delay, that is 300-600-or-more seconds for the overlord to realise all
> the
>tasks are started by the middle manager.
>- Similar delays happen for any other operations.
>- Sub-optimal logging code path (double-concatening very long log
>entries),
>- ZkWorker: Worker fully deserializing all ZK payload data every time
>looking up task IDs rather than only looking at the ID fields.
> Similarly,
>repeat fetching data on task assignment.
>
> =
>
> The patch I have is here:
> https://github.com/jasonk000/druid/pull/7/files
>
> It fixes a couple of things, most importantly the task queue notification
> system. The system is much more stable with high task counts and will
> easily restart many tasks concurrently.
>
> I have other perf issues I want to look at first before I can document it
> fully, build a test case, rebase it on apache/master, etc. If you test it
> out, and it works, we could submit a PR that would resolve it.
>
> PS - I have a queue of similar fixes I'd like to submit, but need some time
> to do the documentation, build test cases, upstreaming, etc, if anyone
> wants to collaborate, I could open some Issues and share my partial notes.
>
> Thanks
> Jason
>
> On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
>  wrote:
>
> > Hi all,
> >
> > I have been investigating this in the background for a few days now and
> > need some help from the community.
> >
> > We noticed that every hour, when the tasks roll, we see a spike in the
> > ingestion lag for about 2-4 minutes. We have 180 tasks running on this
> > datasource.
> > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
> >
> > On further debugging of task logs, we found out that around the duration
> > when the ingestion lag spikes up, *the gap between pause and resume
> > commands in the task logs during checkpointing are wide ranging from few
> > seconds to couple minutes*. For example, in the following task logs you
> > can see that it was about 1.5 minutes.
> > {"@timestamp":"2021-11-18T*20:06:58.513Z*", "log.level":"DEBUG",
> > "message":"Received pause command, *pausing* ingestion until resumed.", "
> > service.name
> > ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> > process.thread.name
> >
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> > {"@timestamp":"2021-11-18T*20:08:26.326Z*", "log.level":"DEBUG",
> > "message":"Received pause command, *pausing* ingestion

Re: Push-down of operations for SystemSchema tables

2021-11-29 Thread Gian Merlino

Hey Jason,

The concept you sketched out looks promising! I hope to have a chance to
look into the code soon, but for now I just had a couple of questions based
on your email.

First: what do you think of the relative merits of having the CursorFactory
do the sort vs. having the Scan engine do the sort?

Second: I recall you were originally trying to speed up system tables and
help them scale better. Have you tried this patch to see if it helps in
that regard?

On Mon, Nov 22, 2021 at 9:19 AM Jason Koch 
wrote:

> Hi Gian
>
> It looks like I have a solution that works, and I'd like to present it for
> your thoughts.
>
>
> https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified
>
> At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it
> checks with the underlying datasource whether a scan is supported, and if
> it is, it constructs a ScanQuery to match and passes it on. Later,
> ScanQueryEngine will look at whether the adapter is a SortedCursorFactory
> and if it is, it will request the cursors to be constructed and pass
> through all relevent sort/limit/filter to the underlying table logic.
> VirtualTable implementation can then perform its magic and return a result!
>
> The solution is:
> - Create a new VirtualDataSource, a new VirtualSegment and
> VirtualStorageAdapter, as well as a new SortedCursorFactory to support the
> VirtualDataSource.
> - System tables have a DruidVirtualTableRule that will convert a
> LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing
> the query authentication information.
> - DataSource gets a function canScanOrdered so that it can advise the Query
> parser whether it can handle a given sort function - default is that only
> existing time-ordering is supported. Implementations (ie:
> VirtualDataSource) can decide to accept an ordered scan or not.
>
> Overall, I think this provides a viable solution and meets your goals. In
> rebasing this morning I see you're on the same pathway with ordered scan
> query, so I could rebase on top of that and break into a smaller set of
> PRs, nonetheless the conceptual approach and direction is something that I
> think will work.
>
> Thanks!
> Jason
>
>
>
>
>
>
> On Wed, May 19, 2021 at 9:54 PM Gian Merlino  wrote:
>
> > Hey Jason,
> >
> > It sounds like we have two different, but related goals:
> >
> > 1) Your goal is to improve the performance of system tables.
> >
> > 2) My goal with the branch Clint linked is to enable using Druid's native
> > query engine for system tables, in order to achieve consistency in how
> SQL
> > queries are executed and also so all of Druid's special functions,
> > aggregations, extensions, etc, are available to use in system tables.
> >
> > Two notes about what I'm trying to do, btw, in response to things you
> > raised. First, I'm interested in using Druid's native query engine under
> > the hood, but I'm not necessarily driving towards being able to actually
> > query system tables using native Druid queries. It still achieves my
> goals
> > if these tables are only available in SQL queries. Second, you're correct
> > that for this to work for queries with 'order by' but no 'group by', we'd
> > need to add ordering support to the Scan query.
> >
> > That's actually the main reason I stopped working on this branch: I
> started
> > looking into Scan ordering instead. Then I got distracted with other
> stuff,
> > and now I'm working on neither of those things . Anyway, I think it'll
> be
> > somewhat involved to implement Scan ordering in a scalable way for any
> > possible query on any possible datasource, but if we're focused on sys
> > tables, we can take a shortcut that is less-scalable. It wouldn't be
> tough
> > to make something that works for anything that works today, since the
> > bindable convention we use today simply does the sort in memory (see
> > org.apache.calcite.interpreter.SortNode). That might be a good way to
> > unblock the sys-table-via-native-engine work. We'd just need some
> safeguard
> > to prevent that code from executing on real datasources that are too big
> to
> > materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
> > using an undocumented Scan context parameter set by the SQL layer only
> for
> > sys tables.
> >
> > > I am interested. For my current work, I do want to keep focus on the
> > > sys.* performance work. If there's a way to do it and lay the
> > > groundwork or even get all the work done, then I am 100% for that.
> > >  Looking at what you want to do to convert the

Re: Druid-specific Calcite keywords

2021-11-05 Thread Gian Merlino

Thanks for the note! Do you have any pointers to projects that do this in
sort of a "best practices" way?

As to specifics: one thing I wanted to explore is adding keywords that do
some of the same things as our query context parameters, so you don't have
to set context parameters in order to get the behavior you want. (Sometimes
people find it difficult to do that due to the abstractions that sit
between them and the Druid API.) That's stuff like useApproximateTopN,
which maybe could be "ORDER BY APPROXIMATE " or "ORDER BY 
 APPROXIMATE".

I had also wanted to look at adding a PARTITION BY keyword that controls
partitioning of query result sets.

On Thu, Nov 4, 2021 at 11:48 PM Julian Hyde  wrote:

> Some specifics would be useful. But in general, adding a new keyword
> (reserved or non-reserved) will require changes to the paser. Calcite
> allows (I won't say it makes it easy) for projects like Druid to
> create a derived parser by building a parser from the same parser
> template as Calcite's core parser but with different template
> variables.
>
> If the keyword is non-reserved, there is an additional grammar rule
> that transforms the keyword token back into an identifier. It applies
> in all contexts except the one where the keyword is specifically
> needed by the grammar.  For example, the non-reserved keyword
> BERNOULLI can only occur immediately after the keyword TABLESAMPLE. In
> a location that expects an identifier (e.g. after FROM), BERNOULLI
> will be converted into an identifier. Thus you can use BERNOULLI as a
> table name.
>
> Julian
>
> On Thu, Nov 4, 2021 at 2:18 PM Gian Merlino  wrote:
> >
> > Hey Druids,
> >
> > I'm looking into how to add keywords to Druid's SQL dialect, and I wanted
> > to ask if anyone has enough familiarity with Calcite to point at some
> info
> > about how to do that without needing to modify Calcite itself?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Druid-specific Calcite keywords

2021-11-04 Thread Gian Merlino

Hey Druids,

I'm looking into how to add keywords to Druid's SQL dialect, and I wanted
to ask if anyone has enough familiarity with Calcite to point at some info
about how to do that without needing to modify Calcite itself?

Druid Summit 2021

2021-09-28 Thread Gian Merlino

Hey Druids,

I am excited to write to you about Druid Summit (https://druidsummit.org/),
an event being held virtually on November 9–10, 2021. The entire Apache
Druid community is welcome, and registration is free.

It would also be great to see a bunch of people from the community giving
talks about Druid. The call for presentations is open here:
https://forms.gle/81VQEdBrDs3M3Ysx5 until *Oct 4* (next week). A title and
short abstract is OK for a proposal. It's a great opportunity to share your
story with the Druid community.

There are 3 regional events scheduled. They're all happening at the same
time, virtually, and are really different facets of the same global event.
The CFP doesn't have a space to specify which region you want to speak in,
so if your talk is accepted, we will contact you to determine which region
makes the most sense for your talk.

If you have any questions about putting together a proposal, I'd be happy
to answer them.

The following kinds of presentations are all great:

- A deep dive into Druid features or architectural design.
- Architectural patterns you have used when deploying Druid.
- Operational stories about running Druid in production.
- How you use Druid to address a specific use case.
- Anything else Druid-related that sounds interesting.

Looking forward to seeing many of you there!

Gian

Re: [Proposal] - Kafka Input Format for headers, key and payload parsing

2021-09-21 Thread Gian Merlino

Hey Lokesh,

Thanks for the details. To me it makes more sense to have the user specify
the entire timestamp and key field name (it seems weird to have a
"timestamp prefix" and "key prefix" that are only used for single fields).
I just wrote that + a few comments on the PR itself:
https://github.com/apache/druid/pull/11630#pullrequestreview-760351816

On Fri, Sep 17, 2021 at 9:43 AM Lokesh Lingarajan 
wrote:

> Hi Gian,
>
> Thanks for the your reply, please find below are my comments
>
> 1) How is the timestamp exposed exactly? I see there is a
> recordTimestampLabelPrefix, but what is that a prefix to? Also: what do you
> think about accepting the entire name of the timestamp field instead?
> Finally: in the docs it would be good to have an example of how people can
> write a timestampSpec that refers to the Kafka timestamp, and also how they
> can load the Kafka timestamp as a long-typed dimension storing millis since
> the epoch (our convention for secondary timestamps).
>
> >>> The input format allows users to pick and choose the timestamp value
> either from the header/key/value portions of the kafka record. If the
> timestamp is missing in both key and value parts, then users can always
> default to the timestamp that is available in the header. Code will default
> this column with the name "kafka.timestamp". recordTimestampLabelPrefix allows
> users to change the "kafka" to something else. If this model is deviating
> from what we currently have in druid, then I agree we should change this to
> giving a full name.  Currently timestamp is loaded directly from
> ConsumerRecord data structure as follows
>
> // Add kafka record timestamp to the mergelist, we will skip record timestamp 
> if the same key exists already in the header list
> mergeMap.putIfAbsent(recordTimestampColumn, record.getRecord().timestamp());
>
>
> 2) You mention that the key will show up as "kafka.key", and in the
> example you provide I don't see a parameter enabling a choice of what that
> field is called. Is it hard-coded or is it configurable somehow?
>
> >>> this behavior is exactly the same as the timestamp discussed above. If
> nothing is done, we will have a column named "kafka.key", users have the
> choice to change the "kafka" to something else. We can make the change
> uniform here as well based on the above decision.
>
> 3) Could you write up some user-facing docs too, like an addition to
> development/extensions-core/kafka-ingestion.md? That way, people will know
> how to use this feature. And it'll help us better understand how it's
> supposed to work. (Perhaps it could have answered the two questions above)
>
> >>> Absolutely agree with you, I will do that along with other review
> comments from the code.
>
> Thanks again for looking into this :)
>
> -Lokesh
>
>
> On Thu, Sep 16, 2021 at 9:34 AM Gian Merlino  wrote:
>
>> Lokesh, it looks like you got dropped from the thread, so I'm adding you
>> back. Please check out the previous message for some comments.
>>
>> By the way, by default, replies to the dev list go back to the dev list
>> only, which can cause you to miss some replies. If you join the list you
>> will be sure to get all your replies 
>>
>> On Tue, Sep 14, 2021 at 10:10 PM Gian Merlino  wrote:
>>
>>> Hey Lokesh,
>>>
>>> The concept and API looks solid to me! Thank you for writing this up. I
>>> agree with Ben's comment. This will be really useful functionality.
>>>
>>> I have a few questions about how it would work:
>>>
>>> 1) How is the timestamp exposed exactly? I see there is a
>>> recordTimestampLabelPrefix, but what is that a prefix to? Also: what do you
>>> think about accepting the entire name of the timestamp field instead?
>>> Finally: in the docs it would be good to have an example of how people can
>>> write a timestampSpec that refers to the Kafka timestamp, and also how they
>>> can load the Kafka timestamp as a long-typed dimension storing millis since
>>> the epoch (our convention for secondary timestamps).
>>>
>>> 2) You mention that the key will show up as "kafka.key", and in the
>>> example you provide I don't see a parameter enabling a choice of what that
>>> field is called. Is it hard-coded or is it configurable somehow?
>>>
>>> 3) Could you write up some user-facing docs too, like an addition to
>>> development/extensions-core/kafka-ingestion.md? That way, people will know
>>> how to use this feature. And it'll help us better understand how it's
>>> supposed to work.

Re: [Proposal] - Kafka Input Format for headers, key and payload parsing

2021-09-16 Thread Gian Merlino

Lokesh, it looks like you got dropped from the thread, so I'm adding you
back. Please check out the previous message for some comments.

By the way, by default, replies to the dev list go back to the dev list
only, which can cause you to miss some replies. If you join the list you
will be sure to get all your replies 

On Tue, Sep 14, 2021 at 10:10 PM Gian Merlino  wrote:

> Hey Lokesh,
>
> The concept and API looks solid to me! Thank you for writing this up. I
> agree with Ben's comment. This will be really useful functionality.
>
> I have a few questions about how it would work:
>
> 1) How is the timestamp exposed exactly? I see there is a
> recordTimestampLabelPrefix, but what is that a prefix to? Also: what do you
> think about accepting the entire name of the timestamp field instead?
> Finally: in the docs it would be good to have an example of how people can
> write a timestampSpec that refers to the Kafka timestamp, and also how they
> can load the Kafka timestamp as a long-typed dimension storing millis since
> the epoch (our convention for secondary timestamps).
>
> 2) You mention that the key will show up as "kafka.key", and in the
> example you provide I don't see a parameter enabling a choice of what that
> field is called. Is it hard-coded or is it configurable somehow?
>
> 3) Could you write up some user-facing docs too, like an addition to
> development/extensions-core/kafka-ingestion.md? That way, people will know
> how to use this feature. And it'll help us better understand how it's
> supposed to work. (Perhaps it could have answered the two questions above)
>
> Full disclosure: I haven't reviewed the patch yet; these questions are
> just based on your writeup.
>
> On Mon, Aug 30, 2021 at 3:00 PM Lokesh Lingarajan
>  wrote:
>
>> Motivation
>>
>> Today we ingest a number of high cardinality metrics into Druid across
>> dimensions. These metrics are rolled up on a per minute basis, and are
>> very
>> useful when looking at metrics on a partition or client basis. Events is
>> another class of data that provides useful information about a particular
>> incident/scenario inside a Kafka cluster. Events themselves are carried
>> inside the kafka payload, but nonetheless there is some very useful
>> metadata that is carried in kafka headers that can serve as a useful
>> dimension for aggregation and in turn bringing better insights.
>>
>> PR(#10730 <https://github.com/apache/druid/pull/10730>) introduced
>> support
>> for Kafka headers in InputFormats.
>>
>> We still need an input format to parse out the headers and translate those
>> into relevant columns in Druid. Until that’s implemented, none of the
>> information available in the Kafka message headers would be exposed. So
>> first there is a need to implement an input format that can parse headers
>> in any given format(provided we support the format) like we parse payloads
>> today. Apart from headers there is also some useful information present in
>> the key portion of the kafka record. We also need a way to expose the data
>> present in the key as druid columns. We need a generic way to express at
>> configuration time what attributes from headers, key and payload need to
>> be
>> ingested into druid. We need to keep the design generic enough so that
>> users can specify different parsers for headers, key and payload.
>>
>> Proposal is to design an input format to solve the above by providing
>> wrapper around any existing input formats and merging the data into a
>> single unified Druid row.
>> Proposed changes
>>
>> Let's look at a sample input format from the above discussion
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *"inputFormat":{"type": "kafka", // New input format type
>> "headerLabelPrefix": "kafka.header.", // Label prefix for header columns,
>> this will avoid collisions while merging columns
>> "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is
>> made
>> available in case payload does not carry timestamp"headerFormat":
>> // Header parser specifying that values are of type string{
>>   "type": "string"},   "valueFormat": // Value parser from
>> json parsing   { "type": "json",
>>  "flattenSpec":
>> { "useFieldDiscovery": true,
>> "fields

Re: compression strategy concurrency

2021-09-14 Thread Gian Merlino

Hey Rahul,

What kind of errors are you seeing? I ran the test a few times with a
bumped up number of threads, and I did see a few problems but they were in
the Closer. It looks like a single Closer is used for every thread, which
is bad because Closers are not thread-safe (they are built around an
ArrayDeque and don't use synchronization). But that's a problem with the
test. (That should be fixed, ideally!)

As to the actual CompressionStrategy methods, it looks like the various
implementations all return singletons from getDecompressor, getCompressor.
So they better be thread safe! The LZ4 one uses singleton instances of
LZ4Factory.fastestInstance().safeDecompressor() and
LZ4Factory.fastestInstance().highCompressor(). Both of those look
thread-safe to me. The LZF and Uncompressed ones don't use anything at all
so they seem ok too.

On Tue, Sep 14, 2021 at 2:29 PM rahul gidwani  wrote:

> What is the desired thread safety of the CompressionStrategy class?  From
> looking at it from an API perspective, it looks to be you:
>
> Allocate input buffer, Allocate output buffer, compress / decompress.
>
> The CompressionStrategyTest.testConcurrency() test if you bump the number
> of threads to 100, and run it a few times, you will see there are race
> conditions which will cause failures.
>
> The quick and easy solution to make this thread safe is to synchronize the
> methods.  But in reality it looks like this class is mainly used to
> compress, decompress segments so that will be a a thread / segment which is
> okay.
>
> My question is I am confused as to what should be the behavior, should it
> be thread safe and a generic api to compress / decompress.  If so then we
> should fix the code in Compression Strategy to be thread safe, if not then
> maybe remove that test and mark the class as NotThreadSafe.
>
> Was wondering about other people's thoughts.
> Thanks
>

Re: [Proposal] - Kafka Input Format for headers, key and payload parsing

2021-09-14 Thread Gian Merlino

Hey Lokesh,

The concept and API looks solid to me! Thank you for writing this up. I
agree with Ben's comment. This will be really useful functionality.

I have a few questions about how it would work:

1) How is the timestamp exposed exactly? I see there is a
recordTimestampLabelPrefix, but what is that a prefix to? Also: what do you
think about accepting the entire name of the timestamp field instead?
Finally: in the docs it would be good to have an example of how people can
write a timestampSpec that refers to the Kafka timestamp, and also how they
can load the Kafka timestamp as a long-typed dimension storing millis since
the epoch (our convention for secondary timestamps).

2) You mention that the key will show up as "kafka.key", and in the example
you provide I don't see a parameter enabling a choice of what that field is
called. Is it hard-coded or is it configurable somehow?

3) Could you write up some user-facing docs too, like an addition to
development/extensions-core/kafka-ingestion.md? That way, people will know
how to use this feature. And it'll help us better understand how it's
supposed to work. (Perhaps it could have answered the two questions above)

Full disclosure: I haven't reviewed the patch yet; these questions are just
based on your writeup.

On Mon, Aug 30, 2021 at 3:00 PM Lokesh Lingarajan
 wrote:

> Motivation
>
> Today we ingest a number of high cardinality metrics into Druid across
> dimensions. These metrics are rolled up on a per minute basis, and are very
> useful when looking at metrics on a partition or client basis. Events is
> another class of data that provides useful information about a particular
> incident/scenario inside a Kafka cluster. Events themselves are carried
> inside the kafka payload, but nonetheless there is some very useful
> metadata that is carried in kafka headers that can serve as a useful
> dimension for aggregation and in turn bringing better insights.
>
> PR(#10730 ) introduced support
> for Kafka headers in InputFormats.
>
> We still need an input format to parse out the headers and translate those
> into relevant columns in Druid. Until that’s implemented, none of the
> information available in the Kafka message headers would be exposed. So
> first there is a need to implement an input format that can parse headers
> in any given format(provided we support the format) like we parse payloads
> today. Apart from headers there is also some useful information present in
> the key portion of the kafka record. We also need a way to expose the data
> present in the key as druid columns. We need a generic way to express at
> configuration time what attributes from headers, key and payload need to be
> ingested into druid. We need to keep the design generic enough so that
> users can specify different parsers for headers, key and payload.
>
> Proposal is to design an input format to solve the above by providing
> wrapper around any existing input formats and merging the data into a
> single unified Druid row.
> Proposed changes
>
> Let's look at a sample input format from the above discussion
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *"inputFormat":{"type": "kafka", // New input format type
> "headerLabelPrefix": "kafka.header.", // Label prefix for header columns,
> this will avoid collisions while merging columns
> "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made
> available in case payload does not carry timestamp"headerFormat":
> // Header parser specifying that values are of type string{
>   "type": "string"},   "valueFormat": // Value parser from
> json parsing   { "type": "json", "flattenSpec":
> { "useFieldDiscovery": true,
> "fields": [...] }},"keyFormat": // Key parser
> also from json parsing { "type": "json" }}*
>
> Since we have independent sections for header, key and payload, it will
> also enable parsing each section with its own parser, eg., headers coming
> in as string and payload as json.
>
> KafkaInputFormat(the new inputFormat class) will be the uber class
> extending inputFormat interface and will be responsible for creating
> individual parsers for header, key and payload, blend the data resolving
> conflicts in columns and generating a single unified InputRow for Druid
> ingestion.
>
> "headerFormat" will allow users to plug in a parser type for the header
> values and will add the default header prefix as "kafka.header."(can be
> overridden) for attributes to avoid collision while merging attributes with
> payload.
>
> Kafka payload parser will be responsible for parsing the Value portion of
> the Kafka record. This is where most of the data will come from and we
> should be able to plugin existing parsers. One thing to note here is that
> if batching is performed, then the code should be augmenting header and

Re: Get Druid Service details in runtime (via extension)

2021-08-23 Thread Gian Merlino

Ah, if you're wanting to get the roles of the node that your own code is
running on, try adding Set. You might need the @Self annotation
also.

On Mon, Aug 23, 2021 at 6:27 AM Jeet Patel  wrote:

> Hi Gian,
>
> Thank you for pointing that out. Although I'm getting following error when
> using DiscoveryDruidNode
>
> // Declared a variable in MyEmitter
> private final DiscoveryDruidNode discoveryDruidNode;
>
> // Added in Constructor
> public MyEmitter(
>   MyEmitterConfig myEmitterConfig,
>   ObjectMapper mapper,
>   DiscoveryDruidNode discoveryDruidNode
>   )
>   {
> this.mapper = mapper;
> this.myEmitterConfig = myEmitterConfig;
> this.whiteListMetricsMap =
> readMap(myEmitterConfig.getWhiteListMetricsMapPath());
> this.discoveryDruidNode = discoveryDruidNode;
> log.info("Constructed MyEmitter");
>   }
>
> // Added a log.info in emit() method just to check if it works
> log.info("NodeRole: " + discoveryDruidNode.getNodeRole().getJsonName());
>
> Below is the error I'm getting:
> 1) Could not find a suitable constructor in
> org.apache.druid.discovery.DiscoveryDruidNode. Classes must have either one
> (and only one) constructor annotated with @Inject
> or a zero-argument constructor that is not private.
>   at
> org.apache.druid.discovery.DiscoveryDruidNode.class(DiscoveryDruidNode.java:47)
>   while locating org.apache.druid.discovery.DiscoveryDruidNode
> for the 3rd parameter of
> com.custom.MyEmitterModule.getEmitter(MyEmitterModule.java:39)
>
> According to the error, it looks like I cannot add DiscoveryDruidNode
> because it does not have @Inject or a zero-argument constructor. But I'm
> able to ad my MyEmitterConfig class which does not have zero-argument
> constructor.
>
> On 2021/08/22 23:40:08, Gian Merlino  wrote:
> > Does the "getNodeRole()" method on DiscoveryDruidNode do what you want?
> >
> > On Fri, Aug 20, 2021 at 3:07 PM Jeet Patel  wrote:
> >
> > > Hi all,
> > >
> > > Is there a way to to know what druid services are running in a
> DruidNode
> > > (Not
> > > talking about the HTTP APIs)?
> > > I went through druid-server module, class
> > > DruidNodeDiscoveryProvider.getForNodeRole which accepts a NodeRole and
> > > returns a DruidNodeDiscovery instance after which we can use
> > > getAllNodes() method
> > > which returns Collection. And for each item in the
> > > Collection we can use getServiceName() method to
> get
> > > the service name.
> > >
> > > The question is, how can we get the instance of NodeRole running in the
> > > druid process. For example, if we have a host running broker service,
> is
> > > there a way to get NodeRole for broker process dynamically?
> > >
> > > For now I'm doing something like this. Adding all NodeRole in every
> host,
> > > since our extension runs in every host.:
> > >
> > > List druidNodeDiscoveryList = ImmutableList.of(
> > >
>  druidNodeDiscoveryProvider.getForNodeRole(NodeRole.COORDINATOR),
> > > druidNodeDiscoveryProvider.getForNodeRole(NodeRole.OVERLORD),
> > > druidNodeDiscoveryProvider.getForNodeRole(NodeRole.HISTORICAL),
> > >
>  druidNodeDiscoveryProvider.getForNodeRole(NodeRole.MIDDLE_MANAGER),
> > > druidNodeDiscoveryProvider.getForNodeRole(NodeRole.INDEXER),
> > > druidNodeDiscoveryProvider.getForNodeRole(NodeRole.BROKER),
> > > druidNodeDiscoveryProvider.getForNodeRole(NodeRole.ROUTER)
> > > );
> > >
> > > I'm trying to build an extension. So this extension will run in every
> hosts
> > > in our druid cluster. After getting the service details we wanted to
> some
> > > further procession from our side.
> > >
> > > Will really appreciate some pointers on this.
> > >
> > > Thank you :)
> > >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Get Druid Service details in runtime (via extension)

2021-08-22 Thread Gian Merlino

Does the "getNodeRole()" method on DiscoveryDruidNode do what you want?

On Fri, Aug 20, 2021 at 3:07 PM Jeet Patel  wrote:

> Hi all,
>
> Is there a way to to know what druid services are running in a DruidNode
> (Not
> talking about the HTTP APIs)?
> I went through druid-server module, class
> DruidNodeDiscoveryProvider.getForNodeRole which accepts a NodeRole and
> returns a DruidNodeDiscovery instance after which we can use
> getAllNodes() method
> which returns Collection. And for each item in the
> Collection we can use getServiceName() method to get
> the service name.
>
> The question is, how can we get the instance of NodeRole running in the
> druid process. For example, if we have a host running broker service, is
> there a way to get NodeRole for broker process dynamically?
>
> For now I'm doing something like this. Adding all NodeRole in every host,
> since our extension runs in every host.:
>
> List druidNodeDiscoveryList = ImmutableList.of(
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.COORDINATOR),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.OVERLORD),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.HISTORICAL),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.MIDDLE_MANAGER),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.INDEXER),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.BROKER),
> druidNodeDiscoveryProvider.getForNodeRole(NodeRole.ROUTER)
> );
>
> I'm trying to build an extension. So this extension will run in every hosts
> in our druid cluster. After getting the service details we wanted to some
> further procession from our side.
>
> Will really appreciate some pointers on this.
>
> Thank you :)
>

Re: Apache Druid Project Structure

2021-08-18 Thread Gian Merlino

Hey Jeet,

It sounds useful, maybe something in the main README.md or in the docs at
https://druid.apache.org/docs/latest/development/overview.html. If you are
volunteering to contribute it then that sounds awesome. Otherwise, whoever
is reading this… know that we all think it's a good idea 

On Wed, Aug 18, 2021 at 12:03 AM Jeet Patel  wrote:

> Hi Gian,
>
> This was very helpful information.
> Do you think it's a good idea to create a readme explaining at a
> high-level the project structure. As you explained it, this might be a very
> helpful information to new comers who are looking to contribute to the
> project and make them feel more confident knowing the project layout.
>
> Thank you,
> Jeet
>
> On 2021/08/17 17:12:33, Gian Merlino  wrote:
> > Hey Jeet,
> >
> > I think it is a case of "it seemed like a good idea at the time". Some
> > things about the current layout do work well: one is that there is
> actually
> > a lot of common query engine code between anything that handles queries.
> > That's historical, broker, peon, and indexer. That common query engine
> code
> > today is mostly in "core" and "processing". Another is that Druid SQL is
> > architected as a layer that sits atop the native query system, and it's
> all
> > cleanly separated into its own "sql" module. Outside of the query engine
> > code, there is a bunch of historical, broker, and coordinator specific
> > stuff in the "server" module that could be broken out into 3 separate
> > modules, but I suppose the appropriate cost/benefit hasn't been there for
> > someone to actually do that.
> >
> > On Mon, Aug 16, 2021 at 7:07 AM Jeet Patel  wrote:
> >
> > > Hello,
> > >
> > > A question about how druid directory structure came into existence.
> Druid
> > > has processes like historical, coordinator, overlord, broker, etc.
> > >
> > > We see that the current project root level directories are like
> > >
> > > druid
> > > |- indexing-service
> > > |- services
> > > |- sql
> > > |- core
> > > ...
> > > ...
> > >
> > > Can someone explain why this directory structure was formed instead of
> > > having something like following and place the code/modules related to
> the
> > > processes in their respective folders?
> > >
> > > druid
> > > |- historical
> > > |- broker
> > > |- coordinator
> > > |- extensions
> > > ...
> > > ...
> > >
> > > It would be great to know the background of this topic.
> > >
> > > Thank you :)
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Apache Druid Project Structure

2021-08-17 Thread Gian Merlino

Hey Jeet,

I think it is a case of "it seemed like a good idea at the time". Some
things about the current layout do work well: one is that there is actually
a lot of common query engine code between anything that handles queries.
That's historical, broker, peon, and indexer. That common query engine code
today is mostly in "core" and "processing". Another is that Druid SQL is
architected as a layer that sits atop the native query system, and it's all
cleanly separated into its own "sql" module. Outside of the query engine
code, there is a bunch of historical, broker, and coordinator specific
stuff in the "server" module that could be broken out into 3 separate
modules, but I suppose the appropriate cost/benefit hasn't been there for
someone to actually do that.

On Mon, Aug 16, 2021 at 7:07 AM Jeet Patel  wrote:

> Hello,
>
> A question about how druid directory structure came into existence. Druid
> has processes like historical, coordinator, overlord, broker, etc.
>
> We see that the current project root level directories are like
>
> druid
> |- indexing-service
> |- services
> |- sql
> |- core
> ...
> ...
>
> Can someone explain why this directory structure was formed instead of
> having something like following and place the code/modules related to the
> processes in their respective folders?
>
> druid
> |- historical
> |- broker
> |- coordinator
> |- extensions
> ...
> ...
>
> It would be great to know the background of this topic.
>
> Thank you :)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Question about merging groupby v2 spill files

2021-08-10 Thread Gian Merlino

Hey Will,

The sorting that happens on the data servers is really useful, because it
means the Broker can do its part of the query fully streaming instead of
buffering things up.

At one point we had a similar problem in ingestion (you could have a ton of
spill files if you had a lot of sketches) and ended up addressing that by
doing the merge hierarchically. Something similar should work in the
SpillingGrouper. Instead of opening everything all at once, we could open
things in chunks of N, merge those, and then proceed hierarchically. The
merge tree would have logarithmic height.

The ingestion version of this was:
https://github.com/apache/druid/pull/10689

On Mon, Aug 9, 2021 at 2:30 PM Will Lauer 
wrote:

> I recently submitted an issue about "Too many open files" in GroupBy v2 (
> https://github.com/apache/druid/issues/11558) and have been investigating
> a
> solution. It looked like the problem was happening because the code
> preemptively opened all the spill files for reading, which when there are a
> huge number of spill files (in our case, a single query is generating 110k
> spill files), causes the "too many open files" error when the files ulimit
> is set to an otherwise reasonable number. We can work around this for now
> by setting "ulimit -n" to a huge value (like 1 million), but I was hoping
> for a better solution.
>
> In https://github.com/apache/druid/pull/11559, I attempted to fix this by
> lazily opening files only when they were ready to be read and closing them
> immediately after they had finished being read. While this looks like it
> fixes the issue in some small edge cases, it isn't a general solution
> because many queries end up calling CloseableIterators.mergeSorted() to
> merge all the spill files together, which due to sorting necessitates
> reading all the files at once, causing the "too many files" error again. It
> looks like mergeSorted() is called because frequently the grouping code is
> assuming the results should be sorted and is calling
> ConcurrentGrouper.parallelSortAndGetGroupersIterator().
>
> My question is, can anyone think of a way to avoid the need for sorting at
> this level so as to avoid the need for opening all the spill files. Given
> how sketches work in druid right now, I don't see an easy way to reduce the
> number of spill files we are seeing, so I was hoping to address this on the
> grouper side, but right now I can't see a solution that makes this any
> better. We aren't blocked, because we can set the maximum number of files
> to a much larger number, but that is an unpalatable long term solution.
>
> Will
>
>
> 
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
>    
> 
> 
>

Re: Interested in contributing an article to your site

2021-07-30 Thread Gian Merlino

Hi Angela,

There are a couple of places on the Druid website where we include content
from the community.

1) If Sisu Data uses Druid internally, or produces Druid-based products, it
would be appropriate to describe Sisu's usage of Druid on our Powered By
page: https://druid.apache.org/druid-powered

2) If Sisu Data offers support or services for Druid, it would be
appropriate to link to information about that on our Community page:
https://druid.apache.org/community/

3) If Sisu Data writes an informative article about Druid, and that article
isn't overly commercial in nature, it would be appropriate to link to it
from the Featured Content section on our home page:
https://druid.apache.org/. These links are typically up for a limited
amount of time.

The best way to suggest new content is to make a pull request for
https://github.com/apache/druid-website-src/blob/master/druid-powered.md,
https://github.com/apache/druid-website-src/blob/master/community/index.md,
or
https://github.com/apache/druid-website-src/blob/master/_data/featured.yml,
depending on which section is appropriate.

There is no charge for any of this. If the content is appropriate then we
will include or link to it, otherwise we won't.

On Fri, Jul 30, 2021 at 10:59 AM Angela S. 
wrote:

> Hi there,
>
> My name is Angela; I work on the writers' team at Sisu Data, an analytics
> company geared towards helping companies make decisions using data. I
> recently found this page: http://druid.apache.org/community/
> <
> https://sisudatamarketing-dot-yamm-track.appspot.com/Redirect?ukey=1X2PgMwBAD4sV82Le_ksgouZwYFnJRLp75HXa6_7EHX4-635873211=YAMMID-67622530=http%3A%2F%2Fdruid.apache.org%2Fcommunity%2F
> >
> and was wondering if you are open to taking article submissions on
> druid.apache.org. We would love to contribute an informative post
> submission, if you're open to the idea.
>
> Let me know what your thoughts are, and if this is something you'd be
> interested in seeing what are the costs/guidelines?
>
> Thank you!
>
> Angela
> https://sisudata.com/
> <
> https://sisudatamarketing-dot-yamm-track.appspot.com/Redirect?ukey=1X2PgMwBAD4sV82Le_ksgouZwYFnJRLp75HXa6_7EHX4-635873211=YAMMID-67622530=https%3A%2F%2Fsisudata.com%2F
> >
> [image: beacon]
>

Re: ItemsSketch Aggregator in druid-datasketches extension

2021-07-23 Thread Gian Merlino

Btw, one additional comment on suggestion (2) above. It's also a hack, of
course, and it's more of a hack than (1) is. At least (1) is using the
proper off-heap storage almost all of the time, which means there is not
much risk of using excessive heap memory. Approach (2) has a higher risk of
using too much heap memory. The only advantage (2) has is that you don't
need a Direct version of the ItemsSketch for it to work.

On Fri, Jul 23, 2021 at 1:35 PM Gian Merlino  wrote:

> Hey Michael,
>
> Very cool!
>
> To answer your question: it is critical to have a BufferAggregator. Some
> context; there are 3 kinds of aggregators:
>
> - Aggregator: stores intermediate state on heap; is used during ingestion
> and by the non-vectorized timeseries query engine. Required, or else some
> queries won't work properly.
> - BufferAggregator: stores intermediate state off-heap; is used by the
> non-vectorized topN and groupBy engines. Required, or else some queries
> won't work properly.
> - VectorAggregator: stores intermediate state off-heap; is used by
> vectorized query engines. Optional, but if there are any aggregators in a
> query that don't have a VectorAggregator implementation, then the whole
> query will run non-vectorized, which will slow it down.
>
> The main reason that BufferAggregators exist is to support off-heap
> aggregation, which minimizes GC pressure and prevents out-of-memory errors
> during the aggregation process.
>
> > Assuming it is, we can begin talking with the datasketches team about
> the possibility of a Direct implementation.
>
> With ItemsSketch, the biggest roadblock you're likely to run into is the
> fact that the items may be of variable size. Currently in Druid each
> BufferAggregator (and VectorAggregator) must have a fixed-size amount of
> space allocated off heap. The way this is done is by returning the number
> of bytes you need from AggregatorFactory.getMaxIntermediateSize. It's all
> allocated upfront. It can depend on aggregator configuration (for example,
> the HLL aggregators need more space if they're configured to have higher
> accuracy) but it can't depend on the data that is encountered during the
> query. So if you do have variable size items, you get into quite a
> conundrum!
>
> In order to have a "proper" implementation of an off-heap variable-sized
> sketch, we'd need an aggregator API that allows them to allocate additional
> memory at runtime. The DataSketches folks have asked us for this before but
> we haven't built it yet. There's a couple ways you could solve it in the
> meantime:
>
> 1) Allocate a fixed-size amount of memory that is enough to store most
> data you reasonably expect to encounter, update that off-heap, and allocate
> additional memory on heap if needed, without Druid "knowing" about it. This
> is something that'd require a Direct version of the ItemsSketch. It's a
> technique used by the quantiles sketches too. It's a hack but it works. You
> can see it in action by looking at DirectUpdateDoublesSketch
> <https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DirectUpdateDoublesSketch.java>
> in DataSketches (check out the spots where "mem_" is reassigned: it's
> allocating new on-heap memory that Druid doesn't know about) and
> DoublesSketchBuildBufferAggregator in Druid (check out the "sketches" map
> and "relocate" method in DoublesSketchBuildBufferAggregatorHelper: it's
> making sure to retain the references just in case one of them grew into the
> heap).
>
> 2) Allocate some arbitrary amount of memory, but don't use it, use a
> technique like the one in DoublesSketchBuildBufferAggregatorHelper to
> simply maintain references to on-heap sketches. This would work but you
> would run the risk of running out of heap memory. (There's nothing that
> explicitly controls how much heap you'll use.) To a degree you can mitigate
> this by allocating a larger amount of unused off-heap memory via
> AggregatorFactory.getMaxIntermediateSize, which will get Druid to flush
> aggregation state more often (it does this when it runs out of off-heap
> buffer space), which will sort of limit how many sketches are going to be
> in flight at once. It's quite indirect but it's the best I can think of.
>
> > I am also thinking of finishing the implementation by explicitly
> serializing the entire sketch on each update, but this would only be for
> experimentation as I doubt this is the intended behavior for
> implementations of BufferedAggregator.
>
> You're right, that's not the intended sort of implementation. It works but
> it's usually too slow to be practical. (It incurs a full serialization
> roundtrip for every row.)
&

Re: ItemsSketch Aggregator in druid-datasketches extension

2021-07-23 Thread Gian Merlino

Hey Michael,

Very cool!

To answer your question: it is critical to have a BufferAggregator. Some
context; there are 3 kinds of aggregators:

- Aggregator: stores intermediate state on heap; is used during ingestion
and by the non-vectorized timeseries query engine. Required, or else some
queries won't work properly.
- BufferAggregator: stores intermediate state off-heap; is used by the
non-vectorized topN and groupBy engines. Required, or else some queries
won't work properly.
- VectorAggregator: stores intermediate state off-heap; is used by
vectorized query engines. Optional, but if there are any aggregators in a
query that don't have a VectorAggregator implementation, then the whole
query will run non-vectorized, which will slow it down.

The main reason that BufferAggregators exist is to support off-heap
aggregation, which minimizes GC pressure and prevents out-of-memory errors
during the aggregation process.

> Assuming it is, we can begin talking with the datasketches team about the
possibility of a Direct implementation.

With ItemsSketch, the biggest roadblock you're likely to run into is the
fact that the items may be of variable size. Currently in Druid each
BufferAggregator (and VectorAggregator) must have a fixed-size amount of
space allocated off heap. The way this is done is by returning the number
of bytes you need from AggregatorFactory.getMaxIntermediateSize. It's all
allocated upfront. It can depend on aggregator configuration (for example,
the HLL aggregators need more space if they're configured to have higher
accuracy) but it can't depend on the data that is encountered during the
query. So if you do have variable size items, you get into quite a
conundrum!

In order to have a "proper" implementation of an off-heap variable-sized
sketch, we'd need an aggregator API that allows them to allocate additional
memory at runtime. The DataSketches folks have asked us for this before but
we haven't built it yet. There's a couple ways you could solve it in the
meantime:

1) Allocate a fixed-size amount of memory that is enough to store most data
you reasonably expect to encounter, update that off-heap, and allocate
additional memory on heap if needed, without Druid "knowing" about it. This
is something that'd require a Direct version of the ItemsSketch. It's a
technique used by the quantiles sketches too. It's a hack but it works. You
can see it in action by looking at DirectUpdateDoublesSketch

in DataSketches (check out the spots where "mem_" is reassigned: it's
allocating new on-heap memory that Druid doesn't know about) and
DoublesSketchBuildBufferAggregator in Druid (check out the "sketches" map
and "relocate" method in DoublesSketchBuildBufferAggregatorHelper: it's
making sure to retain the references just in case one of them grew into the
heap).

2) Allocate some arbitrary amount of memory, but don't use it, use a
technique like the one in DoublesSketchBuildBufferAggregatorHelper to
simply maintain references to on-heap sketches. This would work but you
would run the risk of running out of heap memory. (There's nothing that
explicitly controls how much heap you'll use.) To a degree you can mitigate
this by allocating a larger amount of unused off-heap memory via
AggregatorFactory.getMaxIntermediateSize, which will get Druid to flush
aggregation state more often (it does this when it runs out of off-heap
buffer space), which will sort of limit how many sketches are going to be
in flight at once. It's quite indirect but it's the best I can think of.

> I am also thinking of finishing the implementation by explicitly
serializing the entire sketch on each update, but this would only be for
experimentation as I doubt this is the intended behavior for
implementations of BufferedAggregator.

You're right, that's not the intended sort of implementation. It works but
it's usually too slow to be practical. (It incurs a full serialization
roundtrip for every row.)

On Fri, Jul 23, 2021 at 12:18 PM Michael Schiff 
wrote:

> I am looking into implementing a new Aggregator in the datasketches
> extension using the ItemSketch in the frequencies package:
>
> https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
>
> https://github.com/apache/datasketches-java/tree/master/src/main/java/org/apache/datasketches/frequencies
>
> Ive started on a partial implementation here (still a WIP, lots of TODOs):
>
> https://github.com/apache/druid/compare/master...michaelschiff:fis-aggregator?expand=1
>
> From everything I've seen, it's critical that there is an efficient
> implementation of BufferAggregator. The existing aggregators take advantage
> of other sketch types providing "Direct" implementations that are
> implemented directly against a ByteBuffer.  This leads to fairly
> transparent implementation of BufferAggregator.  ItemSketch is able to
> serialize

Re: druid can't parse string

2021-07-16 Thread Gian Merlino

Including the original poster in case they are not on the dev list
themselves (hello!).

On Fri, Jul 16, 2021 at 9:44 AM Gian Merlino  wrote:

> Druid stores strings as UTF-8 and from a storage and query basis, it
> should work fine with any language. The
> "wikiticker-2015-09-12-sampled.json.gz" dataset used for the tutorial has
> strings in a variety of languages (check the "page" field):
> https://druid.apache.org/docs/latest/tutorials/index.html
>
> So I wonder if there is an encoding problem with reading your input data?
> If it's in a text format, it should be encoded as UTF-8 for Druid to be
> able to read it properly.
>
>
> On Fri, Jul 16, 2021 at 7:51 AM Y H  wrote:
>
>> hi, i am using druid for develop analytic-web.
>> And i found druid can't parse language without english
>>
>> [image: image.png]
>>
>> is there any option on utf-8 OR way to parse string correctly?
>>
>> i attached my druid environment file,
>> please let me know way to parse string in druid
>>
>> thanks.
>>
>>
>>
>> environment
>> ___
>> DRUID_XMS=1g
>> DRUID_MAXNEWSIZE=250m
>> DRUID_NEWSIZE=250m
>> DRUID_MAXDIRECTMEMORYSIZE=6172m
>>
>> druid_emitter_logging_logLevel=debug
>>
>> druid_extensions_loadList=["druid-stats","druid-histogram",
>> "druid-datasketches", "druid-lookups-cached-global",
>> "postgresql-metadata-storage", "druid-kafka-indexing-service",
>> "druid-kafka-extraction-namespace"]
>>
>> druid_zk_service_host=zookeeper
>>
>> # kafka config
>> listeners=PLAINTEXT://211.253.8.155:59092
>>
>>
>> # druid_metadata_storage_host=
>> druid_metadata_storage_type=postgresql
>>
>> druid_metadata_storage_connector_connectURI=jdbc:postgresql://postgres:5432/druid
>> druid_metadata_storage_connector_user=druid
>> druid_metadata_storage_connector_password=FoolishPassword
>>
>> druid_coordinator_balancer_strategy=cachingCost
>>
>> druid_indexer_runner_javaOptsArray=["-server", "-Xmx1g", "-Xms1g",
>> "-XX:MaxDirectMemorySize=3g", "-Duser.timezone=UTC",
>> "-Dfile.encoding=UTF-8",
>> "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
>> druid_indexer_fork_property_druid_processing_buffer_sizeBytes=268435456
>>
>> druid_storage_type=local
>> druid_storage_storageDirectory=/opt/data/segments
>> druid_indexer_logs_type=file
>> druid_indexer_logs_directory=/opt/data/indexing-logs
>>
>> druid_processing_numThreads=2
>> druid_processing_numMergeBuffers=2
>>
>>
>> DRUID_LOG4J=> status="WARN">> target="SYSTEM_OUT">> ref="Console"/>> additivity="false" level="DEBUG">> ref="Console"/>
>>
>>

Re: druid can't parse string

2021-07-16 Thread Gian Merlino

Druid stores strings as UTF-8 and from a storage and query basis, it should
work fine with any language. The "wikiticker-2015-09-12-sampled.json.gz"
dataset used for the tutorial has strings in a variety of languages (check
the "page" field): https://druid.apache.org/docs/latest/tutorials/index.html

So I wonder if there is an encoding problem with reading your input data?
If it's in a text format, it should be encoded as UTF-8 for Druid to be
able to read it properly.


On Fri, Jul 16, 2021 at 7:51 AM Y H  wrote:

> hi, i am using druid for develop analytic-web.
> And i found druid can't parse language without english
>
> [image: image.png]
>
> is there any option on utf-8 OR way to parse string correctly?
>
> i attached my druid environment file,
> please let me know way to parse string in druid
>
> thanks.
>
>
>
> environment
> ___
> DRUID_XMS=1g
> DRUID_MAXNEWSIZE=250m
> DRUID_NEWSIZE=250m
> DRUID_MAXDIRECTMEMORYSIZE=6172m
>
> druid_emitter_logging_logLevel=debug
>
> druid_extensions_loadList=["druid-stats","druid-histogram",
> "druid-datasketches", "druid-lookups-cached-global",
> "postgresql-metadata-storage", "druid-kafka-indexing-service",
> "druid-kafka-extraction-namespace"]
>
> druid_zk_service_host=zookeeper
>
> # kafka config
> listeners=PLAINTEXT://211.253.8.155:59092
>
>
> # druid_metadata_storage_host=
> druid_metadata_storage_type=postgresql
>
> druid_metadata_storage_connector_connectURI=jdbc:postgresql://postgres:5432/druid
> druid_metadata_storage_connector_user=druid
> druid_metadata_storage_connector_password=FoolishPassword
>
> druid_coordinator_balancer_strategy=cachingCost
>
> druid_indexer_runner_javaOptsArray=["-server", "-Xmx1g", "-Xms1g",
> "-XX:MaxDirectMemorySize=3g", "-Duser.timezone=UTC",
> "-Dfile.encoding=UTF-8",
> "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
> druid_indexer_fork_property_druid_processing_buffer_sizeBytes=268435456
>
> druid_storage_type=local
> druid_storage_storageDirectory=/opt/data/segments
> druid_indexer_logs_type=file
> druid_indexer_logs_directory=/opt/data/indexing-logs
>
> druid_processing_numThreads=2
> druid_processing_numMergeBuffers=2
>
>
> DRUID_LOG4J= status="WARN"> target="SYSTEM_OUT"> ref="Console"/> additivity="false" level="DEBUG"> ref="Console"/>
>
>

Re: A question about a potential bug in Druid Joins

2021-06-24 Thread Gian Merlino

Hey Jason,

I suppose you're talking about this patch:
https://github.com/apache/druid/pull/10942

1) The patch applies to a situation where the left-hand column of a join
condition is long-typed, and the right-hand side is string-typed. I suspect
the DIM CTE is getting run as a Scan subquery, and the type information is
getting lost here:
https://github.com/apache/druid/blob/druid-0.21.1/processing/src/main/java/org/apache/druid/query/scan/ScanQueryQueryToolChest.java#L177-L178.
That means that when we build an index on the subquery, we'll do it using
string keys, since strings are used when the actual type is unknown. The
code also suggests a possible workaround. If you look a few lines up, you
can see that type info is only present for virtual columns. So you could
try doing "SELECT api_client_id + 0" instead of "SELECT api_client_id",
which should cause a virtual column to be generated and thereby preserve
the type info. I haven't tested this, though, so let us know if it works
. And like the comment says: in the future we'd like to be able to fill
in the real type info. [* see note below for details]

2) I'm not sure when the next Druid release will be published. There's
usually one release every few months. The next release from master will
include this patch. That'll likely be the very next release, unless a patch
release is needed for some reason. (Patch releases are usually made from
the prior release branch instead of from master.)

[*] Some details on why we don't fill in type info here today. The
challenge is that the native query toolchests don't currently know the data
types that will actually be encountered. Usually that's fine, because we
only need to know the *result* types, and most native query outputs are
strongly typed. For example, the groupBy toolchest is able to introspect
all of its dimensions and aggregators to generate a fully typed result
signature. This isn't possible for plain Scan queries though. All it has is
the names of the input columns. So, we'd need to carry the type info
through from somewhere else; perhaps from the SQL layer.

On Thu, Jun 24, 2021 at 1:27 PM Jason Chen 
wrote:

> Hello, Druid community,
>
> Ben Krug from Imply points me to this mail list for my question about
> Druid Joins. We have a following Druid Join query that may trigger a bug in
> Druid:
> > quote_type
> > WITH DIM AS (
> >   SELECT api_client_id, title
> >   FROM inline_dimension_api_clients_1 AS API_CLIENTS
> > ),
> > FACTS AS (
> >   SELECT api_client_id, COUNT(*) as api_client_count
> >   FROM inline_data AS ORDERS
> >   WHERE ORDERS.__time >= TIMESTAMP '2021-06-10 00:00:00' AND
> ORDERS.__time < TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id =
> 25248974
> >   GROUP BY 1
> > )
> > SELECT DIM.title, FACTS.api_client_id, FACTS.api_client_count
> > FROM FACTS
> > LEFT JOIN DIM ON FACTS.api_client_id = DIM.api_client_id
>
> So the “api_client_id” field is `long` type in both
> “inline_data” and “inline_dimension_api_clients_1” datasources. However,
> when doing a join, the makeLongProcessor method will be called, and
> throw an “UnsupportedOperationException" because "index.keyType()" is
> string in MapIndex.
>
> Then I found Gian Merlino has a PR to fix the issue. I have validated that
> this fix works for our case in my local Druid cluster. The fix is not
> included in Druid v0.21.1.
>
> I have the following questions:
>
> 1. Why the index key type is `string` rather than `long` for my subquery?
> Is it implicitly transformed to `string` type for performance benefit?
> 2. When will you publish a new Druid release? Will the fix be part of the
> next release?
>
>
> Thank you
> Jason Chen
>
>
>
> Jason (Jianbin) Chen
> Senior Data Developer
> p: +1 2066608351 | e: jason.c...@shopify.com
> a: 234 Laurier Ave W Ottawa, ON K1N 5X8
>

Re: Enabling dependabot in our github repository

2021-06-08 Thread Gian Merlino

Here's a running list of PRs opened by the dependabot:
https://github.com/apache/druid/pulls?q=is%3Apr+author%3Aapp%2Fdependabot

On Mon, Jun 7, 2021 at 12:22 PM Gian Merlino  wrote:

> There's been some extra discussion this PR:
> https://github.com/apache/druid/pull/11079
>
> I just +1'ed it, but I wanted to come back here to say that IMO, we should
> avoid getting in the habit of blindly applying these updates without
> testing. There's been lots of situations in the past where a
> harmless-looking dependency upgrade broke something. Sometimes the new
> dependency version had a regression in it, and sometimes even without
> regressions it can introduce compatibility problems.
>
> So, I think it'd be good to apply the updates when we're confident in our
> ability to test them, and add ignores (or tests!) for the rest.
>
> On Thu, Apr 8, 2021 at 12:35 PM Xavier Léauté 
> wrote:
>
>> Thanks Maytas, I asked in that thread. They seemed concerned about write
>> access requested by dependabot,
>> but that should no longer be required as far as I can tell, now that it is
>> natively integrated into GitHub.
>> It should only be a matter of adding the config file to the repo, similar
>> to what we do to automate closing stale issues / PR.
>>
>> On Tue, Apr 6, 2021 at 2:50 PM Maytas Monsereenusorn 
>> wrote:
>>
>> > I remember seeing someone asked about Dependabot in asfinfra slack
>> channel
>> > a few weeks ago. However, asfinfra said they cannot allow it.
>> > Here is the link:
>> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1616539376210800
>> > I think this is the same as Github's dependabot.
>> >
>> > Best Regards,
>> > Maytas
>> >
>> >
>> > On Tue, Apr 6, 2021 at 2:37 PM Xavier Léauté  wrote:
>> >
>> > > Hi folks, as you know Druid has a lot of dependencies, and keeping up
>> > with
>> > > the latest versions of everything, whether it relates to fixing CVEs
>> or
>> > > other improvements is a lot of manual work.
>> > >
>> > > I suggest we enable Github's dependabot in our repository to keep our
>> > > dependencies up to date. The bot is also helpful in providing a short
>> > > commit log summary to understand changes.
>> > > This might yield a flurry of PRs initially, but we can configure it to
>> > > exclude libraries or version ranges that we know are unsafe for us to
>> > > upgrade to.
>> > >
>> > > It looks like some other ASF repos have this enabled already (see
>> > > https://github.com/apache/commons-imaging/pull/126), so hopefully
>> this
>> > > only
>> > > requires filing an INFRA ticket.
>> > >
>> > > Happy to take care of it if folks are on board.
>> > >
>> > > Thanks!
>> > > Xavier
>> > >
>> >
>>
>

Re: Enabling dependabot in our github repository

2021-06-07 Thread Gian Merlino

There's been some extra discussion this PR:
https://github.com/apache/druid/pull/11079

I just +1'ed it, but I wanted to come back here to say that IMO, we should
avoid getting in the habit of blindly applying these updates without
testing. There's been lots of situations in the past where a
harmless-looking dependency upgrade broke something. Sometimes the new
dependency version had a regression in it, and sometimes even without
regressions it can introduce compatibility problems.

So, I think it'd be good to apply the updates when we're confident in our
ability to test them, and add ignores (or tests!) for the rest.

On Thu, Apr 8, 2021 at 12:35 PM Xavier Léauté 
wrote:

> Thanks Maytas, I asked in that thread. They seemed concerned about write
> access requested by dependabot,
> but that should no longer be required as far as I can tell, now that it is
> natively integrated into GitHub.
> It should only be a matter of adding the config file to the repo, similar
> to what we do to automate closing stale issues / PR.
>
> On Tue, Apr 6, 2021 at 2:50 PM Maytas Monsereenusorn 
> wrote:
>
> > I remember seeing someone asked about Dependabot in asfinfra slack
> channel
> > a few weeks ago. However, asfinfra said they cannot allow it.
> > Here is the link:
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1616539376210800
> > I think this is the same as Github's dependabot.
> >
> > Best Regards,
> > Maytas
> >
> >
> > On Tue, Apr 6, 2021 at 2:37 PM Xavier Léauté  wrote:
> >
> > > Hi folks, as you know Druid has a lot of dependencies, and keeping up
> > with
> > > the latest versions of everything, whether it relates to fixing CVEs or
> > > other improvements is a lot of manual work.
> > >
> > > I suggest we enable Github's dependabot in our repository to keep our
> > > dependencies up to date. The bot is also helpful in providing a short
> > > commit log summary to understand changes.
> > > This might yield a flurry of PRs initially, but we can configure it to
> > > exclude libraries or version ranges that we know are unsafe for us to
> > > upgrade to.
> > >
> > > It looks like some other ASF repos have this enabled already (see
> > > https://github.com/apache/commons-imaging/pull/126), so hopefully this
> > > only
> > > requires filing an INFRA ticket.
> > >
> > > Happy to take care of it if folks are on board.
> > >
> > > Thanks!
> > > Xavier
> > >
> >
>

Re: FlattenSpec for Nested Data With Unknown Array Length

2021-05-20 Thread Gian Merlino

Hey Evan,

Druid's data model doesn't currently have a good way of storing arrays of
objects like this. And you're right that even though joins exist, to get
peak performance you want to avoid them at query time.

In similar situations I have stored data models like this as 3 tables
(entries, comments, reactions) and used 3 techniques to avoid the need for
joins at query time:

1) Store aggregate information about comments and reactions in the entries
table: number of comments, number of each type of reaction, etc. That way,
no join is necessary if you just want to — for example — see the average
number of comments for certain entries. You can do something like "select
avg(num_comments) from entries".

2) Store attributes about the entries in the comments and reactions table.
That way, no join is necessary if you want to find all comments that match
entries with specific attributes. For example, if you want to get the
number of users that commented on a particular user's entry, you'd do
"select count(distinct comment_username) from comments where entry_username
= 'alice'".

3) Mash up visualizations sourced from different tables in your
presentation layer. The idea is that if all tables have entry attributes
materialized in them, then you can build a dashboard that has one viz based
on comments, one based on entries, etc, each sourced with a different query
that queries just one table. Then, when the user filters on, e.g.,
"entry_country", you can apply that filter to all of the individual queries.

Hope these techniques help in your case too.

On Wed, May 5, 2021 at 9:37 PM Evan Galpin  wrote:

> Hi Druid devs!
>
> I’m investigating Druid for an analytical workload and I think it would be
> a great fit for the data and use case I have. One thing I’m stuck on right
> now is data modelling.
>
> I’ll use a somewhat classic “Blog Post” example to illustrate. Let’s assume
> a Blog Entry may have many associated “comments” (in unknown quantity), and
> many unknown “reactions” (quantity also unknown).
>
> What is the best way to model this? The example flattenSpec’s that I’ve
> seen showing array handling seem to indicate that the size of the array
> must be known and constant. Does that then rule out the possibility of
> modelling the above Blog Entry as a singular row for peak performance?
>
> One natural way to model the above with an RDBMS would be a table for each
> of Blog Entries, Comments, and Reactions, then performing joins as needed.
> But am I correct in assuming that joins ought to be avoided?
>
> Thanks in advance,
> Evan
>

Re: Push-down of operations for SystemSchema tables

2021-05-19 Thread Gian Merlino

Hey Frank,

These notes are really interesting. Thanks for writing them down.

I agree that the three things you laid out are all important. With regard
to SQL clauses from the web console, I did notice one recent change went in
that changed the SQL clauses to only query sys.segments for columns that
are actually visible, part of https://github.com/apache/druid/pull/10909.
That isn't very useful right now, since there isn't projection pushdown.
But if we add it, this will limit JSON serialization to only the fields
that are actually requested, which will be useful if not all of them are
requested by default. Switching to use OFFSET / LIMIT for tasks too would
also be good (or even just LIMIT would be a good start).

Out of curiosity how many tasks do you typically have in your sys.tasks
table?

Side note: I'm not sure if you looked into
druid.indexer.storage.recentlyFinishedThreshold, but that might be useful
as a workaround for you until some of these changes are made. You can set
it lower and it will reduce the number of complete tasks that the APIs
return.

On Tue, May 18, 2021 at 8:13 AM Chen Frank 
wrote:

> Hi Jason
>
> I have tracked this problem for quite a while. Since you are interested in
> it, I would like to share something I know with you so that you could take
> these in consideration.
>
> In 0.19.0, there was a PR #9883 improving the performance of segments
> query by eliminating the JSON serialization.
> But PR #10752 merged in 0.21.0 brings back JSON serialization. I do not
> know whether this change reverts the performance gain in previous PR.
>
> For tasks, the performance is much worse. There are some problems reported
> about task UI, e.g. #11042 and #11140. But I do not see any feedback on
> segment UI.
> One reason is that the web-console fetches ALL task records from broker
> and does pagination at client side instead of using a LIMIT clause in SQL
> to do pagination at server side.
> Another reason is that broker fetches ALL tasks via REST API from overlord
> that loads records from metadata storage directly and deserializes data
> from `pay_load` field.
>
> While For segments, the two problems above do not exist because
>
> 1. LIMIT clause is used in SQL queries
>
> 2. segments query returns a snapshot in-memory segment data which
> means there is no query to metadata database and JSON deserialization of
> `pay_load` field.
>
> In 0.20, OFFSET is supported for SQL queries, I think this could also be
> added to the queries from web console which would bring some performance
> gain in some extent.
>
> IMO, to improve the performance, we might need to make changes to
>
> 1. the SQL layer you mentioned above
>
> 2. the SQL clauses from web console
>
> 3. the task REST API to support search conditions and ordering to
> narrow down the search range on metadata table
>
> Thanks.
>
> 发件人: Jason Koch 
> 日期: 星期六, 2021年5月15日 上午3:51
> 收件人: dev@druid.apache.org 
> 主题: Re: Push-down of operations for SystemSchema tables
> @Julian - thank you for review & confirming.
>
> Hi Clint
>
> Thank you, I appreciate the response. I have responded Inline, some
> q's, I've also written in my words as a confirmation that I understand
> ...
>
> > In the mid term, I think that some of us have been thinking that moving
> > system tables into the Druid native query engine is the way to go, and
> have
> > been working on resolving a number of hurdles that are required to make
> > this happen. One of the main motivators to do this is so that we have
> just
> > the Druid query path in the planner in the Calcite layer, and deprecating
> > and eventually dropping the "bindable" path completely, described in
> > https://github.com/apache/druid/issues/9896. System tables would be
> pushed
> > into Druid Datasource implementations, and queries would be handled in
> the
> > native engine. Gian has even made a prototype of what this might look
> like,
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > since much of the ground work is now in place, though it takes a
> hard-line
> > approach of completely removing bindable instead of hiding it behind a
> > flag, and doesn't implement all of the system tables yet, at least last
> > time I looked at it.
>
> Looking over the changes it seems that:
> - a new VirtualDataSource is introduced, which the Druid non-sql
> processing engine can process, that can wrap an Iterable. This exposes
> lazy segment & iterable using  InlineDataSource.
> - the SegmentsTable has been converted from a ScannableTable to a
> DruidTable, and a ScannableTableIterator is introduced to generate an
> iterable containing the rows; the new VirtualDataSource can be used to
> access the rows of this table.
> - finally, the Bindable convention is discarded from DruidPlanner and
> Rules.
>
> > I think there are a couple of remaining parts to resolve that would make
> > this feasible. The first is native scan queries need support for ordering
> > by

Re: Push-down of operations for SystemSchema tables

2021-05-19 Thread Gian Merlino

Hey Jason,

It sounds like we have two different, but related goals:

1) Your goal is to improve the performance of system tables.

2) My goal with the branch Clint linked is to enable using Druid's native
query engine for system tables, in order to achieve consistency in how SQL
queries are executed and also so all of Druid's special functions,
aggregations, extensions, etc, are available to use in system tables.

Two notes about what I'm trying to do, btw, in response to things you
raised. First, I'm interested in using Druid's native query engine under
the hood, but I'm not necessarily driving towards being able to actually
query system tables using native Druid queries. It still achieves my goals
if these tables are only available in SQL queries. Second, you're correct
that for this to work for queries with 'order by' but no 'group by', we'd
need to add ordering support to the Scan query.

That's actually the main reason I stopped working on this branch: I started
looking into Scan ordering instead. Then I got distracted with other stuff,
and now I'm working on neither of those things . Anyway, I think it'll be
somewhat involved to implement Scan ordering in a scalable way for any
possible query on any possible datasource, but if we're focused on sys
tables, we can take a shortcut that is less-scalable. It wouldn't be tough
to make something that works for anything that works today, since the
bindable convention we use today simply does the sort in memory (see
org.apache.calcite.interpreter.SortNode). That might be a good way to
unblock the sys-table-via-native-engine work. We'd just need some safeguard
to prevent that code from executing on real datasources that are too big to
materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
using an undocumented Scan context parameter set by the SQL layer only for
sys tables.

> I am interested. For my current work, I do want to keep focus on the
> sys.* performance work. If there's a way to do it and lay the
> groundwork or even get all the work done, then I am 100% for that.
>  Looking at what you want to do to convert these sys.* to native
> tables, if we have a viable solution or are comfortable with my
> suggestions above I'd be happy to build it out.

I think the plan you laid out makes sense, especially part (3). Using a
VirtualDruidTable that 'represents' a system table, but hasn't yet actually
built an Iterable, would allow us to defer creation of the
VirtualDataSource til later in the planning process. That'll enable more
kinds of pushdown via rules, like you mentioned. Coupling that with an
in-memory sort for the Scan query — appropriately guarded — I think would
get us all the way there. Later on, as a separate project, we'll want to
extend the Scan ordering to scale beyond what can be done in memory, but
that wouldn't be necessary for the system tables.

Btw, I would suggest not removing the bindable stuff completely. I did that
in my branch, but I regret it, since I think the change is risky enough
that it should be an option that is opt-in for a while. We could rip out
the old stuff after a few releases, once we're happy with the stability of
the new mechanism.

What do you think?

On Fri, May 14, 2021 at 2:51 PM Jason Koch 
wrote:

> @Julian - thank you for review & confirming.
>
> Hi Clint
>
> Thank you, I appreciate the response. I have responded Inline, some
> q's, I've also written in my words as a confirmation that I understand
> ...
>
> > In the mid term, I think that some of us have been thinking that moving
> > system tables into the Druid native query engine is the way to go, and
> have
> > been working on resolving a number of hurdles that are required to make
> > this happen. One of the main motivators to do this is so that we have
> just
> > the Druid query path in the planner in the Calcite layer, and deprecating
> > and eventually dropping the "bindable" path completely, described in
> > https://github.com/apache/druid/issues/9896. System tables would be
> pushed
> > into Druid Datasource implementations, and queries would be handled in
> the
> > native engine. Gian has even made a prototype of what this might look
> like,
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > since much of the ground work is now in place, though it takes a
> hard-line
> > approach of completely removing bindable instead of hiding it behind a
> > flag, and doesn't implement all of the system tables yet, at least last
> > time I looked at it.
>
> Looking over the changes it seems that:
> - a new VirtualDataSource is introduced, which the Druid non-sql
> processing engine can process, that can wrap an Iterable. This exposes
> lazy segment & iterable using  InlineDataSource.
> - the SegmentsTable has been converted from a ScannableTable to a
> DruidTable, and a ScannableTableIterator is introduced to generate an
> iterable containing the rows; the new VirtualDataSource can be used to
> access the rows

Re: Adding support to Kafka events keys

2021-04-21 Thread Gian Merlino

Hey Noam,

I think this would certainly be useful, and thank you for your interest in
contributing!

I think the toughest part will be designing a good API (meaning: what would
users specify in the kafka supervisor json spec in order to activate and
configure this feature?). So a good way to proceed would be to propose some
API, gather some community feedback on the design of the API, and then
start working on a patch.

Some thoughts on API design:

1) https://github.com/apache/druid/pull/10730 adds some related
functionality that you would want to hook into. This patch added Java APIs
that can be used in extensions, but didn't add any JSON APIs that can be
used by regular users. But you could build some JSON APIs on top of this.

2) Some keys are "formatted" (like the examples you gave: json and
delimited). Formatted keys should be parsed and fields extracted from them
somehow, using their own InputFormat. Maybe we should call it the
"keyInputFormat". We need to figure out what semantics make the most sense
for presenting the parsed key to later stages of the system (which expect a
single namespace). Merging the parsed key map with the parsed value map
seems like a bad idea, since there might be field name collisions. So maybe
we should prefix them with some string like "__key.". There could still be
collisions, but they'd be less likely if we choose an uncommon prefix. At
some point, we may also need to let users specify their own prefix, or even
something fancier like an explicit mapping. But I think we won't need that
feature on day 1.

3) There are also unformatted keys that might be simple strings or byte
arrays. These unformatted keys should become a single field. I’m not sure
which is more prevalent, or which one we should build first, but I think
ultimately we’ll want to support both styles.

On Fri, Apr 16, 2021 at 3:36 PM noam shaish  wrote:

> Hi,
> I would like to try and add a InputFormat for Kafka to support also fields
> coming from the event key.
> In my scenario there are to options:
> 1. both key and value are json
> 2. key is delimited string and the value is json.
>
> Would such a feature will be welcome for contribution? or should I keep on
> my own fork?
>
> Thanks,
> Noam
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Subject: [CVE-2021-26919] Authenticated users can execute arbitrary code from malicious MySQL database systems

2021-04-01 Thread Gian Merlino

I wanted to add a few more details about this advisory, in the hopes that
it will be helpful to people that are upgrading.

Here's a link to the relevant docs about the new properties:
https://druid.apache.org/docs/latest/configuration/index.html#ingestion-security-configuration

And the most secure setup for these properties is:

druid.access.jdbc.enforceAllowedProperties = true
druid.access.jdbc.allowUnknownJdbcUrlFormat = false

If you aren't reading any data from JDBC into Druid, you should add both of
these to your common.runtime.properties. If you are reading data from JDBC,
then you need to understand a little bit about how the properties work to
get a secure setup that won't break your JDBC workflow.

The first property enforces jdbc property validation for mysql and
postgresql. This is enough to block the MySQL-based attack mentioned in
this CVE. That's because the attack relies on setting a specific property a
specific way, which will be blocked by the validation. To set this without
breaking your workflow, make sure that any properties you use in JDBC urls
are added to the cluster-wide druid.access.jdbc.allowedProperties whitelist.

The second property disables connections to other kinds of databases, where
we don't have code to validate properties. (Each driver's URL format is
unfortunately a bit different, so Druid can't understand what properties
are in use for arbitrary JDBC drivers.) This doesn't prevent any known
attacks, because the only one we know of specifically exploits the MySQL
driver. The purpose of this setting is to prevent any similar and
currently-unknown attacks that may involve other jdbc drivers. We provide
this option in case you are feeling paranoid.

Setting these properties may impact legitimate use cases. For example,
legitimate use cases would be impacted if you were using mysql or
postgresql properties that aren't on the default allow list, or if you were
using jdbc connections to database types other than mysql and postgresql.
We didn't want to break these things by surprise in a patch release, so the
most secure setup isn't enabled by default. In a future major version we'll
switch the defaults to the more secure ones.

On Mon, Mar 29, 2021 at 12:22 PM Jihoon Son  wrote:

> Severity: Medium
>
> Vendor:
> The Apache Software Foundation
>
> Versions Affected:
> Druid 0.20.1 and earlier
>
> Description:
> Druid allows users to read data from other database systems using
> JDBC. This functionality is to allow trusted users with the proper
> permissions to set up lookups or submit ingestion tasks. The MySQL
> JDBC driver supports certain properties, which, if left unmitigated,
> can allow an attacker to execute arbitrary code from a
> hacker-controlled malicious MySQL server within Druid server
> processes.
>
> Mitigation:
> Users should upgrade to Druid 0.20.2 and enable new Druid
> configurations to mitigate vulnerable MySQL JDBC properties.
> Whenever possible, network access to cluster machines should be
> restricted to trusted hosts only.
> Ensure that users have the minimum set of Druid permissions necessary,
> and are not granted access to functionality that they do not require.
>
> Credit:
> This issue was discovered by fantasyC4t from the Ant FG Security Lab.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: SpringBoot +MyBatis +Apache Druid

2021-03-10 Thread Gian Merlino

Hey Shamriya,

It would help to know some more about what kind of integration you're
trying to do, and which kind of driver class isn't being recognized.

On Wed, Mar 10, 2021 at 11:36 AM nandalapadu shamriyashaik <
nshamr...@gmail.com> wrote:

> Hi,
>
> I am new to Druid and struggling to integrate apache druid + SpringBoot +
> Mybatis.
> Tried to integrate SpringBoot +Druid but failed. It says driver classname
> is not recognized.
>
> Please help.
>
> Thanks
> Shamriya
>

Re: Contribute a new Community extensions : Launch Peon Pods Based on K8s

2021-03-02 Thread Gian Merlino

Hey Yue,

Very interesting idea. I am not a kubernetes expert, but this seems like a
neat concept. I guess the idea is only one MM would be needed? (Or maybe a
handful, if one can't manage every pod?) If so, great. Hopefully someone
that is more of a kubernetes expert will be able to chime in on the PR.

On Fri, Feb 26, 2021 at 12:46 AM Yue Zhang  wrote:

> Hi Druid,
>
> I’d like to contribute a new Community Extension named
> druid-kubernetes-middlemanager-extensions. Proposal Link:
> https://github.com/apache/druid/issues/10824 and PR link :
> https://github.com/apache/druid/pull/10910
>
> When enable this feature, Druid can launch Peon as a pod to do data
> ingestion. The advantages are as follows:
>
> Cost saving: there is no need to let MiddleManager take up a lot of
> resources in advance, and just require resources whenever it will use. Now
> we can use a single 2Cores and 2Gi Memory middelmanager pod to control
> Dozens or even hundreds peon pods.
> Improve resource utilization: Now different kinds of tasks can use
> different configs including CPU resources and Memory resources if necessary.
>
>   If you are interested in this feature,  please let me know. Looking
> forward to your reply.
>
> Best wishes,
>
> Yue Zhang
>
>
>

Re: Spark-Druid Connectors

2021-03-02 Thread Gian Merlino

Thank you!

On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
wrote:

> Hey Gian,
>
> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was
> not driven by a lack of faith in the Druid community or the Druid
> committers but by the fact that these connectors may be an awkward fit in
> the Druid code base without more buy-in from the community writ large.
>
> The information you’re asking for is spread across a few places. I’ll
> consolidate it into the PR, emphasizing the UX and the tests. I should have
> it up within a day or so.
>
> Thanks,
> Julian
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: L1 (caffeine) cache hits/misses metrics not emitted

2021-02-24 Thread Gian Merlino

Hey Vadim,

According to
https://druid.apache.org/docs/latest/operations/metrics.html#cache, today,
the number of hits and misses for the hybrid cache are both emitted, but
there isn't differentiation between L1 hits and L2 hits. Is that what you
mean?

If so, I think the main issue is there just isn't any place for them in the
metric schema described in the doc page above. There's just one set of
standardized "query/cache/delta/" and "query/cache/total/" metrics, without
specifying how things would be broken down if you had a multi-level cache.
If you proposed an update to the schema that maintains compatibility with
the existing one (this'll probably mean emitting some new metrics), I think
that'd be a good start. Then you could implement it and have it
incorporated into Druid via pull request.

On Wed, Feb 24, 2021 at 12:37 AM Vararu, Vadim 
wrote:

> Hi all,
>
> I’ve found out that the L1 cache hits/misses metrics are not emitted. Only
> the total number of requests (hits+misses). Any reason for that? Where
> could I suggest that to be implemented? It seems essential when it comes to
> caching and I see no technical challenge with that.
>
> Thanks,
> Vadim.
>

Re: Spark-Druid Connectors

2021-02-23 Thread Gian Merlino

Hey Julian,

Your pessimism in this matter is understandable but regrettable!

It would be great to see this effort become part of mainline Druid. It is a
more maintainable approach than a separate repo, because it gets rid of the
risk of interface drift, and it makes sure that all the tests are run
whenever we do a Druid release. It's more upfront work for you (and for
us), but Spark and Druid are both important OSS projects and I think it is
good to encourage better integration between them. I have also written in
the past about the importance of us getting better at accepting
contributions (at https://s.apache.org/aqicd). It is not always easy, since
reviewing contributions takes time, and it is mostly done on a volunteer
basis. But I think if you are game to work with us on this one, let's try
to get it in. I say that out of pure idealism, not having looked at the
design or code at all 

In the mail I linked, I had written:

> For contributors, focusing on UX and tests means writing out (in natural
> language) how your patch changes user experience, and why you think this
> change is a good idea. It also means having good testing of the new stuff
> you're adding, and writing out (in natural language) why you think your
> tests cover all the important cases. Speaking as a person that has
reviewed
> a lot of code: these natural language descriptions are *very helpful*,
> especially when they add context to the patch. Don't make reviewers
> reverse-engineer your code to guess what you were thinking.

As I said, I haven't looked at your design doc or PR yet. But if they cover
the above stuff, could you please point me to the right places that have
the most up-to-date info, and I will put my money where my mouth is and
review them in the way that I suggested in that thread. (i.e., focusing on
user experience and test coverage.)

By the way, I think the mailing list chomped your links. I'll reproduce
them here.

1) Mailing list:
https://lists.apache.org/thread.html/r8219a7be0583ae3d9a2303fa7f21872782cf0703812a410bb62acfef%40%3Cdev.druid.apache.org%3E
2) Slack: https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600
3) GitHub: https://github.com/apache/druid/issues/9780
4) Pull request: https://github.com/apache/druid/pull/10920

On Tue, Feb 23, 2021 at 10:37 PM Julian Jaffe 
wrote:

>
> Hey Druids,
>
> Last April, there was some discussion on this mailing list, Slack, and
> GitHub around building Spark-Druid connectors. After working up a rough
> cut, the effort was dormant until a few weeks ago when I returned to it.
> I’ve opened a pull request for the connectors, but I don’t realistically
> expect it to be accepted. Am I too pessimistic in my assumptions here?
> Otherwise, what’s the best course of action - create a standalone repo and
> add a link in the Druid docs?
>
> Julian
>

Re: Deprecate support for ZooKeeper 3.4.x

2021-01-19 Thread Gian Merlino

About time, I suppose. I replied to the issue on GitHub. I think the
trickiest part is figuring out what migration will look like for users so
we can write up some useful release notes.

On Tue, Jan 19, 2021 at 5:43 PM Xavier Léauté  wrote:

> Hi everyone, I wrote up a short issue on deprecating support for ZooKeeper
> 3.4.x
> https://github.com/apache/druid/issues/10780
>
> It might be worth marking 3.4.x deprecated in 0.21 so we can accelerate the
> move to 3.5.x and unblock the work to support more recent JDKs.
>
> If we wait until 0.22, it means removing support in 0.23 at the earliest,
> which would mean waiting at least another 6 months.
>
> Does anyone else think this is reasonable, any thoughts?
>
> Thanks,
> Xavier
>

Re: Forbidding forced git push

2021-01-15 Thread Gian Merlino

Will this help for the (common) case where PR branches are in people's
forks?

On Fri, Jan 15, 2021 at 1:00 PM Jihoon Son  wrote:

> Hi all,
>
> The forced git push is usually used to make the commit history clean, which
> I understand its importance. However, one of its downsides is, because it
> overwrites the commit history, we cannot tell the exact change between
> commits while reviewing a PR. This increases the burden for reviewers
> because they have to go through the entire PR again after a forced push.
> For the same reason, we are suggesting to not use it in our documentation (
>
> https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master
> ),
> but I don't believe this documentation is well read by many people (It is a
> good doc, BTW. Maybe we should promote it more effectively).
>
> Since branch sharing doesn't usually happen for us (AFAIK, there has been
> no branch sharing so far), I think this is the biggest downside of using
> forced push. To me, clean commit history is not a big gain compared to how
> much it can make the review process worse, especially when the PR is big.
>
> So, I would like to suggest forbidding git forced push for the Druid
> repository. It seems possible to disable it by creating an infra ticket (
>
> https://issues.apache.org/jira/browse/INFRA-13613?jql=text%20~%20%22force%20push%22
> ).
> I can do it if everyone agrees.
>
> Would like to hear what people think.
> Jihoon
>

Re: Non JSON-query API clients

2020-11-13 Thread Gian Merlino

I'm not aware of plans to build out official clients for those other APIs;
when I've written python programs to integrate with them I've usually
called them through http directly.

I'm not familiar with OpenAPI, but looking at it briefly, it seems like an
interesting concept and a potential way to get some nice clients going.

The API endpoints are stable, in the sense that they change rarely, and if
they ever do change it will be in a major version and will be called out in
the release notes.

On Fri, Nov 6, 2020 at 7:56 AM ayoub mrini  wrote:

> Hello,
>
> I'm aware of some Python clients for the JSON query API, but I cannot find
> any Python client for the other endpoints, the ones documented here:
> https://druid.apache.org/docs/latest/operations/api-reference.html
>
> Are planning to provide any official ones? Are those endpoints stable?
>
> If we can write the OpenAPI spec, for example, for those endpoints, we can
> generate the basic boilerplate client code for multiple langages.
>
>
> Ayoub Mrini
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: [E] Re: Removing Druid support for JDK 8 and adding support for JDK 11

2020-11-13 Thread Gian Merlino

Seconding (thirding?) the idea that keeping JDK 8 for integration with
Hadoop is important. Druid's Hadoop integration is built against Hadoop 2.x
and that version only supports JDK 8:
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions. We
shouldn't drop JDK 8 support until we move to Hadoop 3.3 or later (or drop
the Hadoop integration completely).

The costs of keeping things stable on JDK 8 don't seem that high. The ones
Suneet raised were:

1) Java 8 is end of life.
2) We can reduce our Travis usage by almost half.

As Xavier pointed out, though, (1) isn't wholly fair since there are still
plenty of supported enterprise distributions for JDK 8. And ASF Infra
hasn't complained to us about (2).

To me, the benefits of dropping JDK 8 support don't outweigh the downsides.

I do think we should start telling people to deploy on JDK 11 by default,
though, and we should focus any future expansion of testing efforts (like
perf tests, etc) on JDK 11. I would whole-heartedly support this for the
next Druid release.

I also agree with Julian that supporting newer JDKs, if feasible, would be
good too.

On Thu, Nov 12, 2020 at 5:25 PM Xavier Léauté  wrote:

> I agree with Himanshu we should keep JDK 8 support for integrations if we
> feel those are important.
>
> As far as lifecycle goes, JDK 8 support will continue as part of several
> linux distributions.
> For instance, RedHat has taken on the role of publishing clean upstream
> builds for OpenJDK JDK8
>  and JDK11
>  updates,
> with
> support until 2026 .
> This means JDK8 will probably not go away for some time with enterprise
> users.
>
> Before we can even deprecate 8 we should at least make 11 (or later) the
> default / preferred JDK version.
> What would help is to have some first hand accounts of people running
> production deployments with 11.
> I don't think we have a good sense of whether there are any significant
> performance differences between 8 and 11.
> We should rule out any regressions before we make it the default.
>
> I also agree with Julian that we should keep the ball rolling. The biggest
> hurdle was (hopefully) to get from 8 to 11,
> and the changes for 15 should hopefully be smaller, but it's important we
> keep the momentum going.
>
> Generally I would recommend we support the LTS JDK versions + the latest
> Other projects (e.g. Apache Kafka) does the same and builds 8+11+15.
> If we are concerned about build times, we can consider running integration
> tests for preferred version
> and reserve the full suite of JDKs for release candidates.
>
> Xavier
>

Code reviews, UX, and tests

2020-10-15 Thread Gian Merlino

Hey Druids,

I am writing to you all to ask for your help 

In particular, your help in ensuring that potential code contributions are
reviewed in a timely fashion. Right now we have 72 open PRs, which due to
stalebot are mostly opened pretty recently. That's a lot of people that
want to contribute stuff!

First, I wanted to point out that a lot of the reviews are done by a
relatively small number of people. If you are one of those people, thank
you. If you aren't, please consider becoming one.

Second, I've been thinking about how to make reviewing a little easier.
Maybe this will help more people become reviewers. I'd like to suggest
focusing on user experience and tests.

1) "User experience" means any new properties being introduced, any new
features being introduced, any behavior changes to existing features.

2) "Tests" are ideally automated tests that run as part of Travis CI. But
there are other kinds of tests too.

It's important to get user experience right, because if we release a
feature with poor UX, then it's hard to change later without breaking
compatibility. We'd like the master branch to be releasable at all times,
which means we should figure this stuff out before committing a patch.

It's also important to get tests right, because they ensure high quality
releases, and they also ensure that future changes can be made with low
risk.

It's less important to get other code-level things right, because we can
always change them later, and if there aren't UX or quality impacts then no
harm was done.

For contributors, focusing on UX and tests means writing out (in natural
language) how your patch changes user experience, and why you think this
change is a good idea. It also means having good testing of the new stuff
you're adding, and writing out (in natural language) why you think your
tests cover all the important cases. Speaking as a person that has reviewed
a lot of code: these natural language descriptions are *very helpful*,
especially when they add context to the patch. Don't make reviewers
reverse-engineer your code to guess what you were thinking.

For reviewers, it means that if we are confident in the UX and testing, we
don't need to spend a ton of time poring over every line of code. (I'd
still take the time to review key classes and interfaces, but this doesn't
take as much time as reviewing every line.) It also means that if we get a
PR that isn't set up to enable quick review, then we should write that to
the contributor and invite them to improve their PR, rather than ignoring
it or spending too much time on it. (Of course, we should only ask for this
if we can actually follow up with a review when the PR is improved later
on.)

I'd love to hear what people think.

Gian

Re: Help in Configuring data retention

2020-09-21 Thread Gian Merlino

Hey Satish,

Are you asking if Druid can write a log of load/drop rule changes to a
Kafka topic?

If so, no, it cannot. But it does write them to the metadata store, and
perhaps you could use a tool to copy them from the metadata store into
Kafka.

On Mon, Sep 21, 2020 at 6:46 AM Satish Embadi <
satish.emb...@senecaglobal.com> wrote:

> Hi Team,
>
> We are working on configuring data retention. Is there any way to publish
> the configured rules to kafka topic,  whenever we save?
>
> Appreciate your help at earliest.
>
> Thanks,
> Satish Embadi
>

New committer: Atul Mohan

2020-09-02 Thread Gian Merlino

Hey Druids,

The Druid PMC has invited Atul Mohan (@a2l007 
on github) to become a committer and we are pleased to announce that he has
accepted. Atul has been actively working on various parts of Druid,
including indexing from SQL sources and result-level caching.

Congratulations Atul!

Re: [CRON] Broken: apache/druid#28120 (master - c72f96a)

2020-08-19 Thread Gian Merlino

There's a lot of these with messages like:

> [ERROR] Failed to execute goal
org.owasp:dependency-check-maven:5.3.2:check (default-cli) on project
druid: Fatal exception(s) analyzing Druid: One or more exceptions occurred
during analysis:
> [ERROR] Unable to connect to the dependency-check database

This Github issue suggests there's a variety of fiddly things that can
cause the error to happen:
https://github.com/jeremylong/DependencyCheck/issues/1783

Is anyone that's familiar with this check able to look into it?

On Wed, Aug 19, 2020 at 5:21 AM Travis CI  wrote:

> Build Update for apache/druid
> -
>
> Build: #28120
> Status: Broken
>
> Duration: 2 mins and 52 secs
> Commit: c72f96a (master)
> Author: Clint Wylie
> Message: fix bug with expressions on sparse string realtime columns
> without explicit null valued rows (#10248)
>
> * fix bug with realtime expressions on sparse string columns
>
> * fix test
>
> * add comment back
>
> * push capabilities for dimensions to dimension indexers since they know
> things
>
> * style
>
> * style
>
> * fixes
>
> * getting a bit carried away
>
> * missed one
>
> * fix it
>
> * benchmark build fix
>
> * review stuffs
>
> * javadoc and comments
>
> * add comment
>
> * more strict check
>
> * fix missed usaged of impl instead of interface
>
> View the changeset:
> https://github.com/apache/druid/compare/35284e51668b7315808176a1a04464ad48fb15d2...c72f96a4babdf5055912bb0fb5eb2236cfe0ef23
>
> View the full build log and details:
> https://travis-ci.org/github/apache/druid/builds/717091967?utm_medium=notification_source=email
>
>
> --
>
> You can unsubscribe from build emails from the apache/druid repository
> going to
> https://travis-ci.org/account/preferences/unsubscribe?repository=578446_medium=notification_source=email
> .
> Or unsubscribe from *all* email updating your settings at
> https://travis-ci.org/account/preferences/unsubscribe?utm_medium=notification_source=email
> .
> Or configure specific recipients for build notifications in your
> .travis.yml file. See https://docs.travis-ci.com/user/notifications.
>
>

Re: SQL Support for Tuple Sketches

2020-08-12 Thread Gian Merlino

Hey Mithal,

I'm not aware of anyone currently working on it, so you certainly are
welcome to!

On Mon, Aug 10, 2020 at 11:56 AM Mithal Kothari  wrote:

> Hi Druid Dev team,
>
> I just wanted to follow up with you'll and find out if there is a
> plan/possibility to introduce sql support for tuple sketches?
>
> --
> Regards,
> Mithal
>

Re: Study On Rejected Refactorings

2020-08-12 Thread Gian Merlino

Hey Jevgenija,

I recently filled out the survey — hope the response is helpful!

On Tue, Aug 11, 2020 at 1:05 PM Jevgenija Pantiuchina <
jevgenija.pantiuch...@usi.ch> wrote:

> Dear contributors,
>
> As part of a research team from Università della Svizzera italiana
> (Switzerland) and University of Sannio (Italy), we have analyzed
> refactoring pull requests in apache/druid repository and are looking for
> developers for a short 5-10 min survey (
> https://usi.eu.qualtrics.com/jfe/form/SV_cO6Ayah0D6q4eSF). Would you
> please spare your time by answering some questions about
> refactoring-related contributions? We would greatly appreciate your input —
> it would help us understand how developers can improve the quality of
> refactoring contributions, and benefit the development process. The
> responses will be anonymized and handled confidentially! Thank you a lot!
>
>

Re: Druid not listed in Apache project list by category?

2020-07-31 Thread Gian Merlino

That's a good point. We must be missing some metadata. I'm not sure how
this page works — does anyone else know?

On Fri, Jul 31, 2020 at 11:49 AM Will Lauer 
wrote:

> I was browsing the list of Apache projects today looking for something, and
> while I was there, I noticed that Druid was missing. While it showed up in
> the list of all projects (https://projects.apache.org/projects.html), I
> don't see it listed anywhere in the list of projectes grouped by category (
> https://projects.apache.org/projects.html?category). I think this second
> place (projects grouped by category) is the more useful of the two for
> exploring Apache projects, so I was surprised to see Druid missing from
> that list. Perhaps some metadata is missing on the druid project to get it
> to be categorized correctly?
>
> Will
>
> 
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
>    
> 
> 
>

Re: Druid + Presto?

2020-07-10 Thread Gian Merlino

One other thing I'm wondering is how similar are the two forks of Presto?
Are patches generally being shared between them or are they going off in
different directions? One example: as I understand it, aggregate pushdown
support was added to the core of both forks relatively recently — within
the last year or so — does it work the same way in each one? I'm wondering
how much work can be shared between these different efforts and perhaps
between these efforts and the Druid project itself.

On Thu, Jul 9, 2020 at 11:24 PM Gian Merlino  wrote:

> Hey Samarth,
>
> Thanks for sharing these details.
>
> In the overall warehouse + Druid setup you're envisioning, would Druid be
> the main way of querying the tables that it stores? Or would they all be
> synced periodically from the warehouse into Druid, using the warehouse as a
> source of truth? I'm asking since I'm wondering how important it is to
> think about functionality that might help load datasources based on tables
> that are in the Presto metastore.
>
> >  You bring up an interesting idea on the reverse connector. What do you
> think the value of such a connector will be? I am assuming Druid SQL for
> the most part is ANSI SQL.
>
> Druid SQL is ANSI SQL for the most part but there are two big differences.
> First, it doesn't support everything in ANSI SQL (two examples: it
> currently doesn't support shuffle joins and windowed aggregations). Second,
> it supports some functionality that is not in ANSI SQL (like the TIME_ and
> DS_ operators). So it is smaller in some ways and bigger in other ways. I
> was thinking a reverse translator could let you write a Druid SQL query
> that uses our special operators, but also requires a shuffle join, and then
> translate and execute it as an equivalent Presto SQL query. The idea being
> you can express your query in either dialect and get routed to the right
> place in the end.
>
> On Thu, Jul 9, 2020 at 4:36 PM Samarth Jain  wrote:
>
>> Gian,
>>
>> For the presto-sql version of Druid connector, for V1, we decided to
>> pursue
>> the JDBC route. You can follow along on the progress here -
>> https://github.com/prestosql/presto/issues/1855
>> My colleague, Parth (cc'ed as well) is working on implementing Druid
>> aggregation push down including support for top-n style queries. Our
>> immediate use cases, and what we think Druid
>> generally is more suitable for, is for solving for aggregate group by
>> style
>> queries. Having a presto-druid connector also enables us to join data in
>> Druid with the rest of our warehouse.
>> In general though, for queries that don't do any aggregations i.e. which
>> get translated to Druid SCAN queries, it makes sense to by-pass the Druid
>> datanodes altogether and directly go
>> to the deep storage. I think Druid provides enough metadata about the
>> active segment files to be able to do that relatively easily.
>>
>> You bring up an interesting idea on the reverse connector. What do you
>> think the value of such a connector will be? I am assuming Druid SQL for
>> the most part is ANSI SQL.
>>
>> On Thu, Jul 9, 2020 at 12:56 PM Zhenxiao Luo 
>> wrote:
>>
>> > Thank you, Mainak.
>> >
>> > Hi Gian,
>> >
>> > Glad to see you are interested in Presto Druid connector.
>> >
>> > My colleague, @Hao Luo  @Beinan Wang
>> >  and
>> > me, together, implemented the Presto Druid connector in PrestoDB:
>> > https://prestodb.io/docs/current/connector/druid.html
>> >
>> > Our implementation includes:
>> > 1. Presto could scan Druid segments to compute SQL results
>> > 2. aggregation pushdown, where Presto leverages Druid fast aggregation
>> > capabilities, and stream aggregated result from Druid
>> > actually, we implemented 2 execution paths, users could use
>> configurations
>> > to control whether they'd like to scan segments or pushdown all
>> sub-queries
>> > to Druid
>> >
>> > We had run benchmarkings comparing Presto Druid connector with other SQL
>> > engines. And are ready to run production workloads.
>> >
>> > Thanks,
>> > Zhenxiao
>> >
>> > On Thu, Jul 9, 2020 at 12:40 PM Mainak Ghosh 
>> wrote:
>> >
>> > > Hello Gian,
>> > >
>> > > We are currently testing the (other) Presto Druid connector at our
>> end.
>> > It
>> > > has aggregation push down support. Adding Zhenxiao to this thread
>> since
>> > he
>> > > is the primary developer of the connector. He can provide the kind of
>> > &

Re: Any benchmarks for druid iingesting, querying (min, max, topn avg etc)

2020-07-10 Thread Gian Merlino

Hey Rajiv,

I'm not aware of one for ingestion. For querying, two recent results using
the Star Schema Benchmark are this paper comparing Druid, Hive, and Presto:
https://www.researchgate.net/publication/333831332_Challenging_SQL-on-Hadoop_Performance_with_Apache_Druid,
and this blog post comparing Druid and BigQuery:
https://imply.io/post/apache-druid-google-bigquery-benchmark. In these
results Druid came out being many times faster than the other systems it
was compared with.

On Thu, Jul 9, 2020 at 11:07 PM Rajiv Mordani  wrote:

> Are there any benchmarks that measure ingestion, querying and processing
> of data in druid?
>
>
>   *   Rajiv
>

Re: Druid + Presto?

2020-07-10 Thread Gian Merlino

Hey Samarth,

Thanks for sharing these details.

In the overall warehouse + Druid setup you're envisioning, would Druid be
the main way of querying the tables that it stores? Or would they all be
synced periodically from the warehouse into Druid, using the warehouse as a
source of truth? I'm asking since I'm wondering how important it is to
think about functionality that might help load datasources based on tables
that are in the Presto metastore.

>  You bring up an interesting idea on the reverse connector. What do you
think the value of such a connector will be? I am assuming Druid SQL for
the most part is ANSI SQL.

Druid SQL is ANSI SQL for the most part but there are two big differences.
First, it doesn't support everything in ANSI SQL (two examples: it
currently doesn't support shuffle joins and windowed aggregations). Second,
it supports some functionality that is not in ANSI SQL (like the TIME_ and
DS_ operators). So it is smaller in some ways and bigger in other ways. I
was thinking a reverse translator could let you write a Druid SQL query
that uses our special operators, but also requires a shuffle join, and then
translate and execute it as an equivalent Presto SQL query. The idea being
you can express your query in either dialect and get routed to the right
place in the end.

On Thu, Jul 9, 2020 at 4:36 PM Samarth Jain  wrote:

> Gian,
>
> For the presto-sql version of Druid connector, for V1, we decided to pursue
> the JDBC route. You can follow along on the progress here -
> https://github.com/prestosql/presto/issues/1855
> My colleague, Parth (cc'ed as well) is working on implementing Druid
> aggregation push down including support for top-n style queries. Our
> immediate use cases, and what we think Druid
> generally is more suitable for, is for solving for aggregate group by style
> queries. Having a presto-druid connector also enables us to join data in
> Druid with the rest of our warehouse.
> In general though, for queries that don't do any aggregations i.e. which
> get translated to Druid SCAN queries, it makes sense to by-pass the Druid
> datanodes altogether and directly go
> to the deep storage. I think Druid provides enough metadata about the
> active segment files to be able to do that relatively easily.
>
> You bring up an interesting idea on the reverse connector. What do you
> think the value of such a connector will be? I am assuming Druid SQL for
> the most part is ANSI SQL.
>
> On Thu, Jul 9, 2020 at 12:56 PM Zhenxiao Luo 
> wrote:
>
> > Thank you, Mainak.
> >
> > Hi Gian,
> >
> > Glad to see you are interested in Presto Druid connector.
> >
> > My colleague, @Hao Luo  @Beinan Wang
> >  and
> > me, together, implemented the Presto Druid connector in PrestoDB:
> > https://prestodb.io/docs/current/connector/druid.html
> >
> > Our implementation includes:
> > 1. Presto could scan Druid segments to compute SQL results
> > 2. aggregation pushdown, where Presto leverages Druid fast aggregation
> > capabilities, and stream aggregated result from Druid
> > actually, we implemented 2 execution paths, users could use
> configurations
> > to control whether they'd like to scan segments or pushdown all
> sub-queries
> > to Druid
> >
> > We had run benchmarkings comparing Presto Druid connector with other SQL
> > engines. And are ready to run production workloads.
> >
> > Thanks,
> > Zhenxiao
> >
> > On Thu, Jul 9, 2020 at 12:40 PM Mainak Ghosh  wrote:
> >
> > > Hello Gian,
> > >
> > > We are currently testing the (other) Presto Druid connector at our end.
> > It
> > > has aggregation push down support. Adding Zhenxiao to this thread since
> > he
> > > is the primary developer of the connector. He can provide the kind of
> > > details you are looking for.
> > >
> > > Thanks,
> > > Mainak
> > >
> > > > On Jul 9, 2020, at 12:25 PM, Gian Merlino  wrote:
> > > >
> > > > By the way, I see that the other Presto has a Druid connector too:
> > > > https://prestodb.io/docs/current/connector/druid.html. From the docs
> > it
> > > > looks like it has different lineage and might even work differently.
> > > >
> > > > On Thu, Jul 9, 2020 at 12:22 PM Gian Merlino 
> wrote:
> > > >
> > > >> I was thinking of exploring ideas like pushing down aggregations,
> > > enabling
> > > >> Presto to query directly from deep storage (in cases where there
> > aren't
> > > any
> > > >> interesting things to push down, this may be more efficient than
> > > querying
> > > >> Druid s

Re: Druid + Presto?

2020-07-10 Thread Gian Merlino

Hey Zhenxiao, Hao, Beinan, Mainak,

Thanks for sharing information about your work.

You mention benchmarks — I'm curious, did you have a chance to benchmark
each execution path? How do they look?

When you were developing the connector, did you feel like any changes in
Druid would make it easier to integrate things between the two projects?

On Thu, Jul 9, 2020 at 12:56 PM Zhenxiao Luo 
wrote:

> Thank you, Mainak.
>
> Hi Gian,
>
> Glad to see you are interested in Presto Druid connector.
>
> My colleague, @Hao Luo  @Beinan Wang
>  and
> me, together, implemented the Presto Druid connector in PrestoDB:
> https://prestodb.io/docs/current/connector/druid.html
>
> Our implementation includes:
> 1. Presto could scan Druid segments to compute SQL results
> 2. aggregation pushdown, where Presto leverages Druid fast aggregation
> capabilities, and stream aggregated result from Druid
> actually, we implemented 2 execution paths, users could use configurations
> to control whether they'd like to scan segments or pushdown all sub-queries
> to Druid
>
> We had run benchmarkings comparing Presto Druid connector with other SQL
> engines. And are ready to run production workloads.
>
> Thanks,
> Zhenxiao
>
> On Thu, Jul 9, 2020 at 12:40 PM Mainak Ghosh  wrote:
>
> > Hello Gian,
> >
> > We are currently testing the (other) Presto Druid connector at our end.
> It
> > has aggregation push down support. Adding Zhenxiao to this thread since
> he
> > is the primary developer of the connector. He can provide the kind of
> > details you are looking for.
> >
> > Thanks,
> > Mainak
> >
> > > On Jul 9, 2020, at 12:25 PM, Gian Merlino  wrote:
> > >
> > > By the way, I see that the other Presto has a Druid connector too:
> > > https://prestodb.io/docs/current/connector/druid.html. From the docs
> it
> > > looks like it has different lineage and might even work differently.
> > >
> > > On Thu, Jul 9, 2020 at 12:22 PM Gian Merlino  wrote:
> > >
> > >> I was thinking of exploring ideas like pushing down aggregations,
> > enabling
> > >> Presto to query directly from deep storage (in cases where there
> aren't
> > any
> > >> interesting things to push down, this may be more efficient than
> > querying
> > >> Druid servers), enabling translation from Druid's SQL dialect to
> > Presto's
> > >> SQL dialect (a "reverse connector"), etc. Do you (or anyone else on
> this
> > >> list) have any thoughts on any of those?
> > >>
> > >> I'm also curious what kinds of improvements you're planning to the
> > >> connector you built.
> > >>
> > >> On Thu, Jul 9, 2020 at 10:18 AM Samarth Jain 
> > >> wrote:
> > >>
> > >>> Hi Gian,
> > >>>
> > >>> I contributed the jdbc based presto-druid connector in prestosql
> which
> > >>> went
> > >>> out in release 337
> > >>> https://prestosql.io/docs/current/release/release-337.html. The v1
> > >>> version
> > >>> of the connector doesn’t support aggregate push down yet. It is being
> > >>> actively worked on and we expect it to be improved over the next few
> > >>> releases. We are currently evaluating using the presto-druid
> connector
> > in
> > >>> our Tableau setup. It would be interesting to see what changes in
> Druid
> > >>> would be needed to support that integration.
> > >>>
> > >>> Thanks,
> > >>> Samarth
> > >>>
> > >>> On Thu, Jul 9, 2020 at 10:07 AM Gian Merlino 
> wrote:
> > >>>
> > >>>> Hey Druids,
> > >>>>
> > >>>> I was wondering, is anyone on this list using Druid + Presto
> together?
> > >>> If
> > >>>> so, what does your architecture look like and which edition / flavor
> > of
> > >>>> Presto and Druid connector are you using? What's your experience
> been
> > >>> like?
> > >>>> I'm asking since I'm starting to think about whether it makes sense
> to
> > >>> look
> > >>>> at ways to improve the integration between the two projects.
> > >>>>
> > >>>> Gian
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Druid + Presto?

2020-07-09 Thread Gian Merlino

By the way, I see that the other Presto has a Druid connector too:
https://prestodb.io/docs/current/connector/druid.html. From the docs it
looks like it has different lineage and might even work differently.

On Thu, Jul 9, 2020 at 12:22 PM Gian Merlino  wrote:

> I was thinking of exploring ideas like pushing down aggregations, enabling
> Presto to query directly from deep storage (in cases where there aren't any
> interesting things to push down, this may be more efficient than querying
> Druid servers), enabling translation from Druid's SQL dialect to Presto's
> SQL dialect (a "reverse connector"), etc. Do you (or anyone else on this
> list) have any thoughts on any of those?
>
> I'm also curious what kinds of improvements you're planning to the
> connector you built.
>
> On Thu, Jul 9, 2020 at 10:18 AM Samarth Jain 
> wrote:
>
>> Hi Gian,
>>
>> I contributed the jdbc based presto-druid connector in prestosql which
>> went
>> out in release 337
>> https://prestosql.io/docs/current/release/release-337.html. The v1
>> version
>> of the connector doesn’t support aggregate push down yet. It is being
>> actively worked on and we expect it to be improved over the next few
>> releases. We are currently evaluating using the presto-druid connector in
>> our Tableau setup. It would be interesting to see what changes in Druid
>> would be needed to support that integration.
>>
>> Thanks,
>> Samarth
>>
>> On Thu, Jul 9, 2020 at 10:07 AM Gian Merlino  wrote:
>>
>> > Hey Druids,
>> >
>> > I was wondering, is anyone on this list using Druid + Presto together?
>> If
>> > so, what does your architecture look like and which edition / flavor of
>> > Presto and Druid connector are you using? What's your experience been
>> like?
>> > I'm asking since I'm starting to think about whether it makes sense to
>> look
>> > at ways to improve the integration between the two projects.
>> >
>> > Gian
>> >
>>
>

Re: Druid + Presto?

2020-07-09 Thread Gian Merlino

I was thinking of exploring ideas like pushing down aggregations, enabling
Presto to query directly from deep storage (in cases where there aren't any
interesting things to push down, this may be more efficient than querying
Druid servers), enabling translation from Druid's SQL dialect to Presto's
SQL dialect (a "reverse connector"), etc. Do you (or anyone else on this
list) have any thoughts on any of those?

I'm also curious what kinds of improvements you're planning to the
connector you built.

On Thu, Jul 9, 2020 at 10:18 AM Samarth Jain  wrote:

> Hi Gian,
>
> I contributed the jdbc based presto-druid connector in prestosql which went
> out in release 337
> https://prestosql.io/docs/current/release/release-337.html. The v1 version
> of the connector doesn’t support aggregate push down yet. It is being
> actively worked on and we expect it to be improved over the next few
> releases. We are currently evaluating using the presto-druid connector in
> our Tableau setup. It would be interesting to see what changes in Druid
> would be needed to support that integration.
>
> Thanks,
> Samarth
>
> On Thu, Jul 9, 2020 at 10:07 AM Gian Merlino  wrote:
>
> > Hey Druids,
> >
> > I was wondering, is anyone on this list using Druid + Presto together? If
> > so, what does your architecture look like and which edition / flavor of
> > Presto and Druid connector are you using? What's your experience been
> like?
> > I'm asking since I'm starting to think about whether it makes sense to
> look
> > at ways to improve the integration between the two projects.
> >
> > Gian
> >
>

Druid + Presto?

2020-07-09 Thread Gian Merlino

Hey Druids,

I was wondering, is anyone on this list using Druid + Presto together? If
so, what does your architecture look like and which edition / flavor of
Presto and Druid connector are you using? What's your experience been like?
I'm asking since I'm starting to think about whether it makes sense to look
at ways to improve the integration between the two projects.

Gian

New committer: Maggie Brewster

2020-07-07 Thread Gian Merlino

Hey Druids,

The Druid PMC has invited Maggie Brewster (@mcbrewster
 on github) to become a committer and we are
pleased to announce that she has accepted. Maggie has made dozens of
contributions to Druid, especially to the (relatively) new web console.

Congratulations Maggie!

New committer: Suneet Saldanha

2020-07-07 Thread Gian Merlino

Hey Druids,

The Druid PMC has invited Suneet Saldanha (@suneet-s
 on github) to become a committer and we are
pleased to announce that he has accepted. Suneet has contributed to areas
including the new join functionality, documentation, and general code
quality. He has also been active in reviewing the work of others, even
before becoming a committer, which is always appreciated.

Congratulations Suneet!

New committer: Lucas Capistrant

2020-07-07 Thread Gian Merlino

Hey Druids,

The Druid PMC has invited Lucas Capistrant (@capistrant
 on github) to become a committer and we are
pleased to announce that he has accepted. Lucas has been active throughout
the past year, contributing various enhancements and fixes.

Congratulations Lucas!

1 2 3 4 5 >

1 - 100 of 439 matches

Mail list logo