Coordinator and Overlord Readiness Checks

2019-04-05 Thread Julian Jaffe
Hey all, I'm looking to add readiness checks to coordinator and overlord nodes to improve automatic deployment of clusters. More specifically, I'm planning to add an endpoint to the coordinator and overlord that returns 200 OK when the node is ready to process lookup/tiering

Re: Spark batch with Druid

2019-02-11 Thread Julian Jaffe
data from deep storage instead of sending an HTTP request to the Druid cluster and waiting for the response. On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani wrote: > Thanks Julian, > See some questions in-line: > > On 2/6/19, 3:01 PM, "Julian Jaffe" wrote: > &g

Re: Spark batch with Druid

2019-02-06 Thread Julian Jaffe
I think this question is going the other way (e.g. how to read data into Spark, as opposed to into Druid). For that, the quickest and dirtiest approach is probably to use Spark's json support to parse a Druid response. You may also be able to repurpose some code from

Re: Coordinator and Overlord Readiness Checks

2019-04-08 Thread Julian Jaffe
; On Fri, 5 Apr 2019 at 23:23, Julian Jaffe > wrote: > > > Hey all, > > > > I'm looking to add readiness checks to coordinator and overlord nodes to > > improve automatic deployment of clusters. More specifically, I'm planning > > to add an endpoint to the

Running UNION ALL queries in parallel

2019-04-19 Thread Julian Jaffe
Hey all, Druid currently executes UNION ALL queries sequentially ( https://github.com/apache/incubator-druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidUnionRel.java#L98). There's a comment in that method that restates this, but does not explain why. Is there a reason why

Re: Spark-based ingestion into Druid

2020-03-03 Thread Julian Jaffe
I've submitted https://github.com/apache/druid/pull/9454 today to add a `OnHeapMemorySegmentWriteOutMediumFactory`. On Mon, Mar 2, 2020 at 8:57 AM Oğuzhan Mangır wrote: > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > Hey, > > Per Gian's proposal, and following this thread in Druid user

Re: Spark-based ingestion into Druid

2020-02-27 Thread Julian Jaffe
I think for whatever approach we take, we'll need to expose a OnHeapMemorySegmentWriteOutMediumFactory for OnHeapMemorySegmentWriteOutMedium that parallels OffHeapMemorySegmentWriteOutMediumFactory. Although off heap index building will be faster, it's very difficult to get most schedulers to

Re: Spark-based ingestion into Druid

2020-03-05 Thread Julian Jaffe
dataQueries, to allow > batch-oriented queries to work with against the deep storage :) > > Anyway, as I said, I think we can focus on write capabilities for now, and > worry about read capabilities later (if that's OK). > > On 2020/03/05 18:29:09, Julian Jaffe > wrote: > >

Re: Spark-based ingestion into Druid

2020-03-05 Thread Julian Jaffe
The spark-druid-connector you shared brings up another design decision we should probably talk through. That connector effectively wraps an HTTP query client with Spark plumbing. An alternative approach (and the one I ended up building due to our business requirements) is to build a reader that

Re: Druid Namespacing Proposal

2020-03-10 Thread Julian Jaffe
op > and Kafka? > > Julian > > > On Mar 9, 2020, at 4:48 PM, Julian Jaffe > wrote: > > > > Hey all, > > > > I recently wrote a proposal <https://github.com/apache/druid/issues/9463 > > > > to add namespacing to Druid segments to a

Re: Spark-based ingestion into Druid

2020-04-02 Thread Julian Jaffe
t of the requirements please include querying / reading from > Spark > > > as well. This is a high priority for us. > > > > > > - Rajiv > > > > > > On 3/10/20, 1:26 AM, "Oguzhan Mangir" > > > wrote: > > > > > > What we wil

Spark-Druid Connectors Proposal

2020-04-28 Thread Julian Jaffe
Hey all, There have been ongoing discussions on this list and in Slack about improving interoperability between Spark and Druid by creating Spark connectors that can read from and write to Druid clusters. As these discussions have begun to converge on a potential solution, I've opened a proposal

Re: Spark-based ingestion into Druid

2020-04-28 Thread Julian Jaffe
Github proposal <https://github.com/apache/druid/issues/9780>. I'll send a separate email to the dev list in the morning as well. On Thu, Apr 2, 2020 at 11:04 AM Julian Jaffe wrote: > I had a few hours last night, so I worked up a rough cut of a spark reader > <htt

Re: Spark-Druid Connectors

2021-06-27 Thread Julian Jaffe
Bimonthly ping for reviews :) I’m perfectly willing to hop on Slack or a video call to walk through the code and design as well if potential reviewers would find that helpful. > On Apr 14, 2021, at 10:06 AM, Julian Jaffe wrote: > >  > Hey Samarth, > > I’m overjoyed to

Re: Type 2 SCDs

2021-04-23 Thread Julian Jaffe
Hey Jagannatha, Please see the Druid Schema Design Tips[1] for more information, but for SCD2 usually the easiest and most performant solution is to denormalize your data. If you store the current value at ingestion time, when your dimensions change, new rows will be written with the new

Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-07 Thread Julian Jaffe
Hey Benedict, Have you tried creating indices on your segments table? I’ve managed Druid clusters with orders of magnitude more segments without this issue by indexing key filter columns. (The coordinator is still a painful bottle neck, just not due to query times to the metadata server )

Re: Spark-Druid Connectors

2021-02-25 Thread Julian Jaffe
Hey Gian, I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was not driven by a lack of faith in the Druid community or the Druid committers but by the fact that these connectors may be an awkward fit in the Druid code base without more buy-in from the community writ

Spark-Druid Connectors

2021-02-23 Thread Julian Jaffe
Hey Druids, Last April, there was some discussion on this mailing list, Slack, and GitHub around building Spark-Druid connectors. After working up a rough cut, the effort was dormant until a few weeks ago when I returned to it. I’ve opened a pull request for the connectors, but I don’t

Re: Spark-Druid Connectors

2021-04-14 Thread Julian Jaffe
our Spark-Druid connector PRs. Ingesting data > into Druid using Spark SQL and Dataframe API is something we are very keen > to onboard. > Could you point me to them or alternatively add me as a reviewer? > > - Samarth > >> On Tue, Apr 13, 2021 at 11:51 PM Julian Jaffe >

Re: Spark-Druid Connectors

2021-04-14 Thread Julian Jaffe
Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe >> wrote: >> >> Hey Gian, >> >> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was >> not driven by a lack of faith in the Druid community or the Druid >> committers but by the fact that

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

2022-08-22 Thread Julian Jaffe
For Spark support, the connector I wrote remains functional but I haven’t updated the PR for six months or so since it didn’t seem like there was an appetite for review. If that’s changing I could migrate back some more recent changes to the OSS PR. Even with an up-to-date patch though I see

Spark Druid connectors, take 2

2023-08-08 Thread Julian Jaffe
Hey all, There was talk earlier this year about resurrecting the effort to add direct Spark readers and writers to Druid. Rather than repeat the previous attempt and parachute in with updated connectors, I’d like to start by building a little more consensus around what the Druid dev community