Fwd: Announcing ApacheCon @Home 2020

2020-07-01 Thread Felix Cheung

-- Forwarded message -

We are pleased to announce that ApacheCon @Home will be held online,
September 29 through October 1.

More event details are available at https://apachecon.com/acah2020 but
there’s a few things that I want to highlight for you, the members.

Yes, the CFP has been reopened. It will be open until the morning of
July 13th. With no restrictions on space/time at the venue, we can
accept talks from a much wider pool of speakers, so we look forward to
hearing from those of you who may have been reluctant, or unwilling, to
travel to the US.
Yes, you can add your project to the event, whether that’s one talk, or
an entire track - we have the room now. Those of you who are PMC members
will be receiving information about how to get your projects represented
at the event.
Attendance is free, as has been the trend in these events in our
industry. We do, however, offer donation options for attendees who feel
that our content is worth paying for.
Sponsorship opportunities are available immediately at
https://www.apachecon.com/acna2020/sponsors.html

If you would like to volunteer to help, we ask that you join the
plann...@apachecon.com mailing list and discuss 
it there, rather than
here, so that we do not have a split discussion, while we’re trying to
coordinate all of the things we have to get done in this very short time
window.

Rich Bowen,
VP Conferences, The Apache Software Foundation




Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
Well, the difference is, a technical user writes the UDF and a
non-technical user may use this built-in thing (misconfigure it) and shoot
themselves in the foot.

On Wed, Jul 1, 2020, 6:40 PM Andrew Melo  wrote:

> On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz  wrote:
> >
> > I'm not sure having a built-in sink that allows you to DDOS servers is
> the best idea either. foreachWriter is typically used for such use cases,
> not foreachBatch. It's also pretty hard to guarantee exactly-once, rate
> limiting, etc.
>
> If you control the machines and can run arbitrary code, you can DDOS
> whatever you want. What's the difference between this proposal and
> writing a UDF that opens 1,000 connections to a target machine?
>
> > Best,
> > Burak
> >
> > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau 
> wrote:
> >>
> >> I think adding something like this (if it doesn't already exist) could
> help make structured streaming easier to use, foreachBatch is not the best
> API.
> >>
> >> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>>
> >>> I guess the method, query parameter, header, and the payload would be
> all different for almost every use case - that makes it hard to generalize
> and requires implementation to be pretty much complicated to be flexible
> enough.
> >>>
> >>> I'm not aware of any custom sink implementing REST so your best bet
> would be simply implementing your own with foreachBatch, but so someone
> might jump in and provide a pointer if there is something in the Spark
> ecosystem.
> >>>
> >>> Thanks,
> >>> Jungtaek Lim (HeartSaVioR)
> >>>
> >>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
> wrote:
> 
>  Hi All,
> 
> 
>  We ingest alot of restful APIs into our lake and I'm wondering if it
> is at all possible to created a rest sink in structured streaming?
> 
>  For now I'm only focusing on restful services that have an
> incremental ID so my sink can just poll for new data then ingest.
> 
>  I can't seem to find a connector that does this and my gut instinct
> tells me it's probably because it isn't possible due to something
> completely obvious that I am missing
> 
>  I know some RESTful API obfuscate the IDs to a hash of strings and
> that could be a problem but since I'm planning on focusing on just
> numerical IDs that just get incremented I think I won't be facing that issue
> 
> 
>  Can anyone let me know if this sounds like a daft idea? Will I need
> something like Kafka or kinesis as a buffer and redundancy or am I
> overthinking this?
> 
> 
>  I would love to bounce ideas with people who runs structured
> streaming jobs in production
> 
> 
>  Kind regards
>  San
> 
> 
> >>
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: REST Structured Steaming Sink

2020-07-01 Thread Andrew Melo
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz  wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the 
> best idea either. foreachWriter is typically used for such use cases, not 
> foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, 
> etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help 
>> make structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim  
>> wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all 
>>> different for almost every use case - that makes it hard to generalize and 
>>> requires implementation to be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would 
>>> be simply implementing your own with foreachBatch, but so someone might 
>>> jump in and provide a pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:

 Hi All,


 We ingest alot of restful APIs into our lake and I'm wondering if it is at 
 all possible to created a rest sink in structured streaming?

 For now I'm only focusing on restful services that have an incremental ID 
 so my sink can just poll for new data then ingest.

 I can't seem to find a connector that does this and my gut instinct tells 
 me it's probably because it isn't possible due to something completely 
 obvious that I am missing

 I know some RESTful API obfuscate the IDs to a hash of strings and that 
 could be a problem but since I'm planning on focusing on just numerical 
 IDs that just get incremented I think I won't be facing that issue


 Can anyone let me know if this sounds like a daft idea? Will I need 
 something like Kafka or kinesis as a buffer and redundancy or am I 
 overthinking this?


 I would love to bounce ideas with people who runs structured streaming 
 jobs in production


 Kind regards
 San


>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
On Wed, Jul 1, 2020 at 6:13 PM Burak Yavuz  wrote:

> I'm not sure having a built-in sink that allows you to DDOS servers is the
> best idea either
>
Do you think it would be used accidentally? If so we could have it with
default per server rate limits that people would have to explicitly tune.

> . foreachWriter is typically used for such use cases, not foreachBatch.
> It's also pretty hard to guarantee exactly-once, rate limiting, etc.
>

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:
>
>> I think adding something like this (if it doesn't already exist) could
>> help make structured streaming easier to use, foreachBatch is not the best
>> API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
>> wrote:
>>
>>> I guess the method, query parameter, header, and the payload would
>>> be all different for almost every use case - that makes it hard to
>>> generalize and requires implementation to be pretty much complicated to be
>>> flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet
>>> would be simply implementing your own with foreachBatch, but so someone
>>> might jump in and provide a pointer if there is something in the Spark
>>> ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>>> wrote:
>>>
 Hi All,


 We ingest alot of restful APIs into our lake and I'm wondering if it is
 at all possible to created a rest sink in structured streaming?

 For now I'm only focusing on restful services that have an incremental
 ID so my sink can just poll for new data then ingest.

 I can't seem to find a connector that does this and my gut instinct
 tells me it's probably because it isn't possible due to something
 completely obvious that I am missing

 I know some RESTful API obfuscate the IDs to a hash of strings and that
 could be a problem but since I'm planning on focusing on just numerical IDs
 that just get incremented I think I won't be facing that issue


 Can anyone let me know if this sounds like a daft idea? Will I need
 something like Kafka or kinesis as a buffer and redundancy or am I
 overthinking this?


 I would love to bounce ideas with people who runs structured streaming
 jobs in production


 Kind regards
 San



>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
I'm not sure having a built-in sink that allows you to DDOS servers is the
best idea either. foreachWriter is typically used for such use cases, not
foreachBatch. It's also pretty hard to guarantee exactly-once, rate
limiting, etc.

Best,
Burak

On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:

> I think adding something like this (if it doesn't already exist) could
> help make structured streaming easier to use, foreachBatch is not the best
> API.
>
> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
> wrote:
>
>> I guess the method, query parameter, header, and the payload would be all
>> different for almost every use case - that makes it hard to generalize and
>> requires implementation to be pretty much complicated to be flexible enough.
>>
>> I'm not aware of any custom sink implementing REST so your best bet would
>> be simply implementing your own with foreachBatch, but so someone might
>> jump in and provide a pointer if there is something in the Spark ecosystem.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>> wrote:
>>
>>> Hi All,
>>>
>>>
>>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>>> at all possible to created a rest sink in structured streaming?
>>>
>>> For now I'm only focusing on restful services that have an incremental
>>> ID so my sink can just poll for new data then ingest.
>>>
>>> I can't seem to find a connector that does this and my gut instinct
>>> tells me it's probably because it isn't possible due to something
>>> completely obvious that I am missing
>>>
>>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>>> could be a problem but since I'm planning on focusing on just numerical IDs
>>> that just get incremented I think I won't be facing that issue
>>>
>>>
>>> Can anyone let me know if this sounds like a daft idea? Will I need
>>> something like Kafka or kinesis as a buffer and redundancy or am I
>>> overthinking this?
>>>
>>>
>>> I would love to bounce ideas with people who runs structured streaming
>>> jobs in production
>>>
>>>
>>> Kind regards
>>> San
>>>
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help
make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
wrote:

> I guess the method, query parameter, header, and the payload would be all
> different for almost every use case - that makes it hard to generalize and
> requires implementation to be pretty much complicated to be flexible enough.
>
> I'm not aware of any custom sink implementing REST so your best bet would
> be simply implementing your own with foreachBatch, but so someone might
> jump in and provide a pointer if there is something in the Spark ecosystem.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:
>
>> Hi All,
>>
>>
>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>> at all possible to created a rest sink in structured streaming?
>>
>> For now I'm only focusing on restful services that have an incremental ID
>> so my sink can just poll for new data then ingest.
>>
>> I can't seem to find a connector that does this and my gut instinct tells
>> me it's probably because it isn't possible due to something completely
>> obvious that I am missing
>>
>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>> could be a problem but since I'm planning on focusing on just numerical IDs
>> that just get incremented I think I won't be facing that issue
>>
>>
>> Can anyone let me know if this sounds like a daft idea? Will I need
>> something like Kafka or kinesis as a buffer and redundancy or am I
>> overthinking this?
>>
>>
>> I would love to bounce ideas with people who runs structured streaming
>> jobs in production
>>
>>
>> Kind regards
>> San
>>
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Jungtaek Lim
I guess the method, query parameter, header, and the payload would be all
different for almost every use case - that makes it hard to generalize and
requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would
be simply implementing your own with foreachBatch, but so someone might
jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:

> Hi All,
>
>
> We ingest alot of restful APIs into our lake and I'm wondering if it is at
> all possible to created a rest sink in structured streaming?
>
> For now I'm only focusing on restful services that have an incremental ID
> so my sink can just poll for new data then ingest.
>
> I can't seem to find a connector that does this and my gut instinct tells
> me it's probably because it isn't possible due to something completely
> obvious that I am missing
>
> I know some RESTful API obfuscate the IDs to a hash of strings and that
> could be a problem but since I'm planning on focusing on just numerical IDs
> that just get incremented I think I won't be facing that issue
>
>
> Can anyone let me know if this sounds like a daft idea? Will I need
> something like Kafka or kinesis as a buffer and redundancy or am I
> overthinking this?
>
>
> I would love to bounce ideas with people who runs structured streaming
> jobs in production
>
>
> Kind regards
> San
>
>
>


REST Structured Steaming Sink

2020-07-01 Thread Sam Elamin
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at
all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID
so my sink can just poll for new data then ingest.

I can't seem to find a connector that does this and my gut instinct tells
me it's probably because it isn't possible due to something completely
obvious that I am missing

I know some RESTful API obfuscate the IDs to a hash of strings and that
could be a problem but since I'm planning on focusing on just numerical IDs
that just get incremented I think I won't be facing that issue


Can anyone let me know if this sounds like a daft idea? Will I need
something like Kafka or kinesis as a buffer and redundancy or am I
overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs
in production


Kind regards
San


Re: Truncate table

2020-07-01 Thread Russell Spitzer
I'm not sure what you're really trying to do here but it sounds like saving
the data to a park a file or other temporary store before truncating would
protect you in case of failure.

On Wed, Jul 1, 2020, 9:48 AM Amit Sharma  wrote:

> Hi, i have scenario where i have to read certain raw from a table and
> truncate the table and store the certain raws back to the table. I am doing
> below steps
>
> 1. reading certain raws in DF1 from cassandra table A.
> 2. saving into cassandra as override in table A
>
>
> the problem is when I truncate the table at step 2 I will lose the data
> in DF1 as it shows empty.
> I have two solutions
> 1. Store the DF1 in another temp table before truncating table A
> 2. Cache DF1 before truncating.
>
> Do we have any better solution ?
>
>
> Thanks
> Amit
>


Truncate table

2020-07-01 Thread Amit Sharma
Hi, i have scenario where i have to read certain raw from a table and
truncate the table and store the certain raws back to the table. I am doing
below steps

1. reading certain raws in DF1 from cassandra table A.
2. saving into cassandra as override in table A


the problem is when I truncate the table at step 2 I will lose the data  in
DF1 as it shows empty.
I have two solutions
1. Store the DF1 in another temp table before truncating table A
2. Cache DF1 before truncating.

Do we have any better solution ?


Thanks
Amit


Re: Running Apache Spark Streaming on the GraalVM Native Image

2020-07-01 Thread Pasha Finkelshteyn
Hi Ivo, 

I believe there's absolutely no way that Spark will work on GraalVM
Native Image because Spark generates code and loads classes in runtime,
while GraalVM Native Image works only in closed world and has no any way
to load classes which are not present in classpath at compie time.


On 20/07/01 09:56AM, ivo.kn...@t-online.de wrote:
> Hi guys,
>  
> so I want to get Apache Spark to run on the GraalVM Native Image in a 
> simple single-node streaming application, but I get the following error, 
> when trying to build the native image: (check attached file)
>  
> And as I researched online, there seems to be no successful combination of 
> Spark and GraalVM Native Image. Did anyone ever succeed and how?
>  
> Best regards,
>  
> Ivo
>  
> 


> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
Regards,
Pasha

Big Data Tools @ JetBrains


signature.asc
Description: PGP signature


upsert dataframe to kudu

2020-07-01 Thread Umesh Bansal
Hi All,

We are running into issues when spark is trying to insert a dataframe into
the kudu table having 300 columns. Few of the tables getting inserted with
NULL values.

In code, we are using upsert built in method and passing dataframe on it

Thanks


Running Apache Spark Streaming on the GraalVM Native Image

2020-07-01 Thread ivo.kn...@t-online.de
Hi guys,
 
so I want to get Apache Spark to run on the GraalVM Native Image in a 
simple single-node streaming application, but I get the following error, 
when trying to build the native image: (check attached file)
 
And as I researched online, there seems to be no successful combination of 
Spark and GraalVM Native Image. Did anyone ever succeed and how?
 
Best regards,
 
Ivo
 


Spark GraalVM Native Image Error
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org