Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread Sonal Goyal
I see some nice answers at
https://stackoverflow.com/questions/46072411/can-i-change-the-nullability-of-a-column-in-my-spark-dataframe

On Thu, 14 Oct 2021 at 5:21 PM, ashok34...@yahoo.com.INVALID
 wrote:

> Gurus,
>
> I have an RDD in PySpark that I can convert to DF through
>
> df = rdd.toDF()
>
> However, when I do
>
> df.printSchema()
>
> I see the columns as nullable. = true by default
>
> root
>  |-- COL-1: long (nullable = true)
>  |-- COl-2: double (nullable = true)
>  |-- COl-3: string (nullable = true)
>
> What would be the easiest way to make COL-1 NOT NULLABLE
>
> Thanking you
>
-- 
Cheers,
Sonal
https://github.com/zinggAI/zingg


unsubscribe

2021-10-14 Thread Luis Mateos
unsubscribe


Unsubscribe

2021-10-14 Thread 676366545
unsubscribe

Unsubscribe

2021-10-14 Thread Jesús Vásquez
I want to unsubscribe


Re: Spark for Image Processing Acceleration

2021-10-14 Thread Sean Owen
(The suggestion here is to use Tensorflow with Spark - definitely doable
for a long time with things like Horovod. Spark handles the image
processing just fine)

On Thu, Oct 14, 2021 at 10:17 AM Artemis User 
wrote:

> Spark is good with SQL type of structured data, not image data.  Unless
> you algorithms don' t require dealing with image data directly. I guess
> your best option would be to go with Tensorflow since it has image
> classification models built-in and can integrate with NVidia GPUs out of
> the box.  There is no out-of-the-box data source APIs for image data in
> Spark.  Hope this helps.
>
> -- ND
>
> On 10/13/21 11:54 PM, 刘沛文 wrote:
>
> Hi,
> My name is Peiwen. I'm working with Dr. Brain, an AI company focused on
> medical imaging processing and deep learning. Our website is
> http://drbrain.net/index_en.aspx
> We basically do 2 major things. 1. image process, like lesion drawing 2.
> deep learning for neural disease prediction, like stroke, Alzheimer's
> Disease.
> Currently we use Tensorflow and other deep learning frameworks. Due to the
> size of the medical image (1 ~ 5 GB per record), with traditional framework
> on single computer, it takes long time (a few hours) for data processing
> and model training before we get the result.
> I'm writing the email to check if there's some good solution that Apache
> Spark can provide to accelerate the calculation.
> I know Tensorflow can work with Spark. Just want to have a brief
> understanding that compared to traditional Tensorflow, how faster Apache
> Spark can help achieve, saying a cluster of 10 nodes.
>
> Thank you very much!
>
> Peiwen
>
>
>


Re: Spark for Image Processing Acceleration

2021-10-14 Thread Artemis User
Spark is good with SQL type of structured data, not image data. Unless 
you algorithms don' t require dealing with image data directly. I guess 
your best option would be to go with Tensorflow since it has image 
classification models built-in and can integrate with NVidia GPUs out of 
the box.  There is no out-of-the-box data source APIs for image data in 
Spark.  Hope this helps.


-- ND

On 10/13/21 11:54 PM, 刘沛文 wrote:

Hi,
My name is Peiwen. I'm working with Dr. Brain, an AI company focused 
on medical imaging processing and deep learning. Our website is 
http://drbrain.net/index_en.aspx 
We basically do 2 major things. 1. image process, like lesion drawing 
2. deep learning for neural disease prediction, like stroke, 
Alzheimer's Disease.
Currently we use Tensorflow and other deep learning frameworks. Due to 
the size of the medical image (1 ~ 5 GB per record), with traditional 
framework on single computer, it takes long time (a few hours) for 
data processing and model training before we get the result.
I'm writing the email to check if there's some good solution that 
Apache Spark can provide to accelerate the calculation.
I know Tensorflow can work with Spark. Just want to have a brief 
understanding that compared to traditional Tensorflow, how faster 
Apache Spark can help achieve, saying a cluster of 10 nodes.


Thank you very much!

Peiwen




Re: apache-spark

2021-10-14 Thread Mich Talebzadeh
Also have you tried to see what is going on within k8s driver?

DRIVER_POD_NAME=`kubectl get pods -n $NAMESPACE |grep driver|awk '{print
$1}'`
 kubectl describe pod $DRIVER_POD_NAME -n $NAMESPACE
 kubectl logs $DRIVER_POD_NAME -n $NAMESPACE




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 14 Oct 2021 at 14:13, Mich Talebzadeh 
wrote:

> Hi,
>
> Airflow is nothing but a new version of cron on linux with dag dependency.
> What operator in airflow are you using to submit your spark-submit for
> example BashOperator?
>
> Can you actually run the command outside of airflow by submitting
> spark-submit to K8s cluster? Is that GKE cluster or something else?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 14 Oct 2021 at 14:02, Nick Shivhare 
> wrote:
>
>> Hi All,
>>
>> We are facing an issue and would be thankful if anyone can help us on
>> this issue.
>> Environment: Spark, Kubernetes and Airflow.
>> Airflow is used to schedule job spark job over kubernetes.
>> We are using bash script which is using spark submit command to submit
>> spark jobs.
>> Issue:
>> We are submitting spark job through airflow in a *cluster mode*.
>> However, when the job is completed and executors are closed, airflow is not
>> able to schedule another job.
>>
>> As per our investigation, we found that jobs are completed but spark
>> submit command in our script is not able to exit and continuously running
>> with following logs:
>> 21/10/12 08:54:26 INFO LoggingPodStatusWatcherImpl: Application status
>> for spark-3f914f93ad684743b1a7b17aa26b4329 (phase: Running)
>>
>> In order to confirm this issue is not from the airflow side, we tried to
>> kill spark submit command and it was able to schedule another job so our
>> observation is that somehow after completion of job spark-submit script is
>> not able to exit and still running.
>>
>> FYI we have closed the spark session already in our code.
>> One of the weird observations was that it is running completely fine in
>> local mode and we are checking for client mode presently.
>>
>>
>> Would be thankful if you can guide us on this?
>>
>>
>> Thanks,
>> Shishir
>>
>


Re: apache-spark

2021-10-14 Thread Mich Talebzadeh
Hi,

Airflow is nothing but a new version of cron on linux with dag dependency.
What operator in airflow are you using to submit your spark-submit for
example BashOperator?

Can you actually run the command outside of airflow by submitting
spark-submit to K8s cluster? Is that GKE cluster or something else?

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 14 Oct 2021 at 14:02, Nick Shivhare 
wrote:

> Hi All,
>
> We are facing an issue and would be thankful if anyone can help us on
> this issue.
> Environment: Spark, Kubernetes and Airflow.
> Airflow is used to schedule job spark job over kubernetes.
> We are using bash script which is using spark submit command to submit
> spark jobs.
> Issue:
> We are submitting spark job through airflow in a *cluster mode*. However,
> when the job is completed and executors are closed, airflow is not able to
> schedule another job.
>
> As per our investigation, we found that jobs are completed but spark
> submit command in our script is not able to exit and continuously running
> with following logs:
> 21/10/12 08:54:26 INFO LoggingPodStatusWatcherImpl: Application status for
> spark-3f914f93ad684743b1a7b17aa26b4329 (phase: Running)
>
> In order to confirm this issue is not from the airflow side, we tried to
> kill spark submit command and it was able to schedule another job so our
> observation is that somehow after completion of job spark-submit script is
> not able to exit and still running.
>
> FYI we have closed the spark session already in our code.
> One of the weird observations was that it is running completely fine in
> local mode and we are checking for client mode presently.
>
>
> Would be thankful if you can guide us on this?
>
>
> Thanks,
> Shishir
>


apache-spark

2021-10-14 Thread Nick Shivhare
Hi All,

We are facing an issue and would be thankful if anyone can help us on  this
issue.
Environment: Spark, Kubernetes and Airflow.
Airflow is used to schedule job spark job over kubernetes.
We are using bash script which is using spark submit command to submit
spark jobs.
Issue:
We are submitting spark job through airflow in a *cluster mode*. However,
when the job is completed and executors are closed, airflow is not able to
schedule another job.

As per our investigation, we found that jobs are completed but spark submit
command in our script is not able to exit and continuously running with
following logs:
21/10/12 08:54:26 INFO LoggingPodStatusWatcherImpl: Application status for
spark-3f914f93ad684743b1a7b17aa26b4329 (phase: Running)

In order to confirm this issue is not from the airflow side, we tried to
kill spark submit command and it was able to schedule another job so our
observation is that somehow after completion of job spark-submit script is
not able to exit and still running.

FYI we have closed the spark session already in our code.
One of the weird observations was that it is running completely fine in
local mode and we are checking for client mode presently.


Would be thankful if you can guide us on this?


Thanks,
Shishir


How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread ashok34...@yahoo.com.INVALID
Gurus,
I have an RDD in PySpark that I can convert to DF through
df = rdd.toDF()

However, when I do
df.printSchema()

I see the columns as nullable. = true by default
root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- 
COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT 
NULLABLE
Thanking you