Unsubscribe

2023-02-07 Thread Tushar Machavolu
Unsubscribe


Re: Spark with GPU

2023-02-07 Thread Alessandro Bellina
For Apache Spark a stand-alone worker can manage all the resources of the
box, including all GPUs. So a spark worker could be set up to manage N gpus
in the box via *spark.worker.resource.gpu.amount,* and then
*spark.executor.resource.gpu.amount, *as provided by on app submit, assigns
GPU resources to executors as they come up. Here is a getting started guide
for spark-rapids but I am not sure if that's what you are looking to use.
Either way, it may help with the resource setup:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster
.

Not every node in the cluster needs to have GPUs. You could request 0 GPUs
for an app (default value of spark.executor.resource.gpu.amount), and the
executors will not require this resource.

If you are using a yarn/k8s cluster there are other configs to pay
attention to. If you need help with those let us know.

On Sun, Feb 5, 2023 at 1:50 PM Jack Goodson  wrote:

> As far as I understand you will need a GPU for each worker node or you
> will need to partition the GPU processing somehow to each node which I
> think would defeat the purpose. In Databricks for example when you select
> GPU workers there is a GPU allocated to each worker. I assume this is the
> “correct” approach to this problem
>
> On Mon, 6 Feb 2023 at 8:17 AM, Mich Talebzadeh 
> wrote:
>
>> if you have several nodes with only one node having GPUs, you still have
>> to wait for the result set to complete. In other words it will be as fast
>> as the lowest denominator ..
>>
>> my postulation
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 5 Feb 2023 at 13:38, Irene Markelic  wrote:
>>
>>> Hello,
>>>
>>> has anyone used spark with GPUs? I wonder if every worker node in a
>>> cluster needs one GPU or if you can have several worker nodes of which
>>> only one has a GPU.
>>>
>>> Thank you!
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: How to upgrade a spark structure streaming application

2023-02-07 Thread Mich Talebzadeh
Hi,

Check the thread on graceful shutdown. That might help

On Tue, 7 Feb 2023 at 12:47, Yoel Benharrous 
wrote:

> Hi all,
>
> I would like to ask how you perform a Spark Streaming application upgrade?
> I didn't find any builtin solution.
> I found some people writing a marker on file system and polling
> periodically to stop running query.
>
> Thanks,
>
> Yoel
>
>
>
> --



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Fwd: Graceful shutdown SPARK Structured Streaming

2023-02-07 Thread Mich Talebzadeh
-- Forwarded message -
From: Mich Talebzadeh 
Date: Thu, 6 May 2021 at 20:07
Subject: Re: Graceful shutdown SPARK Structured Streaming
To: ayan guha 
Cc: Gourav Sengupta , user @spark <
user@spark.apache.org>


That is a valid question and I am not aware of any new addition to Spark
Structured Streaming (SSS) in newer releases for this graceful shutdown.

Going back to my earlier explanation, there are occasions that you may want
to stop the Spark program gracefully. Gracefully meaning that Spark
application handles the last streaming message completely and terminates
the application. This is different from invoking interrupts such as CTRL-C.
Of course one can terminate the process based on the following


   1.

   query.awaitTermination() # Waits for the termination of this query, with
   stop() or with error
   2.

   query.awaitTermination(timeoutMs) # Returns true if this query is
   terminated within the timeout in milliseconds.

So the first one above waits until an interrupt signal is received. The
second one will count the timeout and will exit when timeout in
milliseconds is reached

The issue is that one needs to predict how long the streaming job needs to
run. Clearly any interrupt at the terminal or OS level (kill process), may
end up the processing terminated without a proper completion of the
streaming process.
So I gather if we agree on what constitutes a graceful shutdown we can
consider both the tool offerings from Spark itself  or what solutions we
can come up with.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 May 2021 at 13:28, ayan guha  wrote:

> What are some other "newer" methodologies?
>
> Really interested to understand what is possible here as this is a topic
> came up in this forum time and again.
>
> On Thu, 6 May 2021 at 5:13 pm, Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> thanks a ton for your kind response, looks like we are still using the
>> earlier methodologies for stopping a spark streaming program gracefully.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, May 5, 2021 at 6:04 PM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>> I believe I discussed this in this forum. I sent the following to
>>> spark-dev forum as an add-on to Spark functionality. This is the gist of
>>> it.
>>>
>>>
>>> Spark Structured Streaming AKA SSS is a very useful tool in dealing with
>>> Event Driven Architecture. In an Event Driven Architecture, there is
>>> generally a main loop that listens for events and then triggers a call-back
>>> function when one of those events is detected. In a streaming application
>>> the application waits to receive the source messages in a set interval or
>>> whenever they happen and reacts accordingly.
>>>
>>> There are occasions that you may want to stop the Spark program
>>> gracefully. Gracefully meaning that Spark application handles the last
>>> streaming message completely and terminates the application. This is
>>> different from invoking interrupts such as CTRL-C. Of course one can
>>> terminate the process based on the following
>>>
>>>
>>>1.
>>>
>>>query.awaitTermination() # Waits for the termination of this query,
>>>with stop() or with error
>>>2.
>>>
>>>query.awaitTermination(timeoutMs) # Returns true if this query is
>>>terminated within the timeout in milliseconds.
>>>
>>> So the first one above waits until an interrupt signal is received. The
>>> second one will count the timeout and will exit when timeout in
>>> milliseconds is reached
>>>
>>> The issue is that one needs to predict how long the streaming job needs
>>> to run. Clearly any interrupt at the terminal or OS level (kill process),
>>> may end up the processing terminated without a proper completion of the
>>> streaming process.
>>>
>>> I have devised a method that allows one to terminate the spark
>>> application internally after processing the last received message. Within
>>> say 2 seconds of the confirmation of shutdown, the process will invoke
>>>
>>> How to shutdown the topic doing work for the message being processed,
>>> wait for it to complete and shutdown the streaming process for a given
>>> topic.
>>>
>>>
>>> I thought about this and looked at options. Using sensors to
>>> implement this like airflow would be expensive as for example reading a
>>> file from object storage or from an underlying database would have incurred
>>> additional I/O overheads through continuous polling.
>>>
>>>
>>> So the design had to be incorporated into the streaming process itself.
>>> What I came up with was an 

[Spark SQL] : Delete is only supported on V2 tables.

2023-02-07 Thread Jeevan Chhajed
Hi,
How do we create V2 tables? I tried a couple of things using sql but was
unable to do so.

Can you share links/content it will be of much help.

Is delete support on V2 tables still under dev ?

Thanks,
Jeevan


How to upgrade a spark structure streaming application

2023-02-07 Thread Yoel Benharrous
Hi all,

I would like to ask how you perform a Spark Streaming application upgrade?
I didn't find any builtin solution.
I found some people writing a marker on file system and polling
periodically to stop running query.

Thanks,

Yoel


SQL GROUP BY alias with dots, was: Spark SQL question

2023-02-07 Thread Enrico Minack

Hi,

you are right, that is an interesting question.

Looks like GROUP BY is doing something funny / magic here (spark-shell 
3.3.1 and 3.5.0-SNAPSHOT):


With an alias, it behaves as you have pointed out:

spark.range(3).createTempView("ids_without_dots")
spark.sql("SELECT * FROM ids_without_dots").show()

// works
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
an.id").show()

// fails
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
`an.id`").show()



Without an alias, it behaves as expected, which is the opposite of above 
(a column with a dot exists, no alias used in SELECT):


spark.range(3).select($"id".as("an.id")).createTempView("ids_with_dots")
spark.sql("SELECT `an.id` FROM ids_with_dots").show()

// works
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY `an.id`").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY an.id").show()


With a struct column, it also behaves as expected:

spark.range(3).select(struct($"id").as("an")).createTempView("ids_with_struct")
spark.sql("SELECT an.id FROM ids_with_struct").show()

// works
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY an.id").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY an.id").show()
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY `an.id`").show()
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show()


This does not feel very consistent.

Enrico



Am 28.01.23 um 00:34 schrieb Kohki Nishio:

this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look 
like this


select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no 
expert in SQL, but feel like it's a strange behavior... does anybody 
have a good explanation for it ?


Thanks

--
Kohki Nishio




Unsubscribe

2023-02-07 Thread Spyros Gasteratos
Unsubscribe