Re: measure running time

2021-12-24 Thread bitfox
Thanks a lot Hollis. It is does due to the pypi version. Now I updated 
it.


$ pip3 -V
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

$ pip3 install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0

$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
...

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from 
range(1000) cross join range(1000) cross join range(100)").show()')

+-+
| count(1)|
+-+
|1|
+-+
...


Hope it helps to others who have met the same issue.
Happy holidays. :0

Bitfox


On 2021-12-25 09:48, Hollis wrote:

 Replied mail 

 From
 Mich Talebzadeh

 Date
 12/25/2021 00:25

 To
 Sean Owen

 Cc
 user、Luca Canali

 Subject
     Re: measure running time

Hi Sean,

I have already discussed an issue in my case with Spark 3.1.1 and
sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.

HTH,

Mich

   view my Linkedin profile [1]

Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

On Fri, 24 Dec 2021 at 14:51, Sean Owen  wrote:


You probably did not install it on your cluster, nor included the
python package with your app

On Fri, Dec 24, 2021, 4:35 AM  wrote:


but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in

pysaprk.



from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select

count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages

ch.cern.sparkmeasure:spark-measure_2.12:0.17


I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide

in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be

quite

useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at

all in

distributed computation. Just saying that an operation in RDD

and

Dataframe can be compared based on their start and stop time

may

not

provide any valid information.

You will have to look into the details of timing and the

steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the

command?

I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:










https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.











-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org










-

To uns

Re: measure running time

2021-12-24 Thread Hollis
Hi
I can run this in my pc.
I check the email chian. bitfox install the spark measure with python2 and he 
launch the pyspark with python3. I think it's the reason.

Regards.
Hollis




 Replied mail 
| From | Mich Talebzadeh |
| Date | 12/25/2021 00:25 |
| To | Sean Owen |
| Cc | user、Luca Canali |
| Subject | Re: measure running time |



Hi Sean,




I have already discussed an issue in my case with Spark 3.1.1 and sparkmeasure  
with the author Luca Canali on this matter. It has been reproduced. I think we 
ought to wait for a patch.




HTH,




Mich 







   view my Linkedin profile

 

Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

 





On Fri, 24 Dec 2021 at 14:51, Sean Owen  wrote:

You probably did not install it on your cluster, nor included the python 
package with your app 


On Fri, Dec 24, 2021, 4:35 AM  wrote:

but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:
> Hi bitfox,
>
> you need pip install sparkmeasure firstly. then can lanch in pysaprk.
>
>>>> from sparkmeasure import StageMetrics
>>>> stagemetrics = StageMetrics(spark)
>>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)
> from range(1000) cross join range(1000) cross join
> range(100)").show()')
> +-+
>
> | count(1)|
> +-+
> |1|
> +-+
>
> Regards,
> Hollis
>
> At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
>> Hello list,
>>
>> I run with Spark 3.2.0
>>
>> After I started pyspark with:
>> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>>
>> I can't load from the module sparkmeasure:
>>
>>>>> from sparkmeasure import StageMetrics
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ModuleNotFoundError: No module named 'sparkmeasure'
>>
>> Do you know why? @Luca thanks.
>>
>>
>> On 2021-12-24 04:20, bit...@bitfox.top wrote:
>>> Thanks Gourav and Luca. I will try with the tools you provide in
> the
>>> Github.
>>>
>>> On 2021-12-23 23:40, Luca Canali wrote:
>>>> Hi,
>>>>
>>>> I agree with Gourav that just measuring execution time is a
> simplistic
>>>> approach that may lead you to miss important details, in
> particular
>>>> when running distributed computations.
>>>>
>>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
>>>> useful for further drill down. See
>>>> https://spark.apache.org/docs/latest/monitoring.html
>>>>
>>>> You can also have a look at this tool that takes care of
> automating
>>>> collecting and aggregating some executor task metrics:
>>>> https://github.com/LucaCanali/sparkMeasure
>>>>
>>>> Best,
>>>>
>>>> Luca
>>>>
>>>> From: Gourav Sengupta 
>>>> Sent: Thursday, December 23, 2021 14:23
>>>> To: bit...@bitfox.top
>>>> Cc: user 
>>>> Subject: Re: measure running time
>>>>
>>>> Hi,
>>>>
>>>> I do not think that such time comparisons make any sense at all in
>>>> distributed computation. Just saying that an operation in RDD and
>>>> Dataframe can be compared based on their start and stop time may
> not
>>>> provide any valid information.
>>>>
>>>> You will have to look into the details of timing and the steps.
> For
>>>> example, please look at the SPARK UI to see how timings are
> calculated
>>>> in distributed computing mode, there are several well written
> papers
>>>> on this.
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Gourav Sengupta
>>>>
>>>> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
>>>>
>>>>> hello community,
>>>>>
>>>>> In pyspark how can I measure the running time to the command?
>>>>> I just want to compare the running time of the RDD API and
> dataframe
>>>>>
>>>>> API, in my this blog:
>>>>>
>>>>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>>>>>
>>>>> I tried spark.time() it doesn't work.
>>>>> Thank you.
>>>>>
>>>>>
>>>>
> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread Mich Talebzadeh
Hi Sean,


I have already discussed an issue in my case with Spark 3.1.1
and sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.


HTH,


Mich



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 Dec 2021 at 14:51, Sean Owen  wrote:

> You probably did not install it on your cluster, nor included the python
> package with your app
>
> On Fri, Dec 24, 2021, 4:35 AM  wrote:
>
>> but I already installed it:
>>
>> Requirement already satisfied: sparkmeasure in
>> /usr/local/lib/python2.7/dist-packages
>>
>> so how? thank you.
>>
>> On 2021-12-24 18:15, Hollis wrote:
>> > Hi bitfox,
>> >
>> > you need pip install sparkmeasure firstly. then can lanch in pysaprk.
>> >
>> >>>> from sparkmeasure import StageMetrics
>> >>>> stagemetrics = StageMetrics(spark)
>> >>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)
>> > from range(1000) cross join range(1000) cross join
>> > range(100)").show()')
>> > +-+
>> >
>> > | count(1)|
>> > +-+
>> > |1|
>> > +-+
>> >
>> > Regards,
>> > Hollis
>> >
>> > At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
>> >> Hello list,
>> >>
>> >> I run with Spark 3.2.0
>> >>
>> >> After I started pyspark with:
>> >> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>> >>
>> >> I can't load from the module sparkmeasure:
>> >>
>> >>>>> from sparkmeasure import StageMetrics
>> >> Traceback (most recent call last):
>> >>   File "", line 1, in 
>> >> ModuleNotFoundError: No module named 'sparkmeasure'
>> >>
>> >> Do you know why? @Luca thanks.
>> >>
>> >>
>> >> On 2021-12-24 04:20, bit...@bitfox.top wrote:
>> >>> Thanks Gourav and Luca. I will try with the tools you provide in
>> > the
>> >>> Github.
>> >>>
>> >>> On 2021-12-23 23:40, Luca Canali wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I agree with Gourav that just measuring execution time is a
>> > simplistic
>> >>>> approach that may lead you to miss important details, in
>> > particular
>> >>>> when running distributed computations.
>> >>>>
>> >>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
>> >>>> useful for further drill down. See
>> >>>> https://spark.apache.org/docs/latest/monitoring.html
>> >>>>
>> >>>> You can also have a look at this tool that takes care of
>> > automating
>> >>>> collecting and aggregating some executor task metrics:
>> >>>> https://github.com/LucaCanali/sparkMeasure
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Luca
>> >>>>
>> >>>> From: Gourav Sengupta 
>> >>>> Sent: Thursday, December 23, 2021 14:23
>> >>>> To: bit...@bitfox.top
>> >>>> Cc: user 
>> >>>> Subject: Re: measure running time
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I do not think that such time comparisons make any sense at all in
>> >>>> distributed computation. Just saying that an operation in RDD and
>> >>>> Dataframe can be compared based on their start and stop time may
>> > not
>> >>>> provide any valid information.
>> >>>>
>> >>>> You will have to look into the details of timing and the steps.
>> > For
>> >>>> example, please look at the SPARK UI to see how timings are
>> > calculated
>> >>>> in distributed computing mode, there are several well written
>> > papers
>> >>>> on this.
>> >>>>
>> >>>> Thanks and Regards,
>> >>>>
>> >>>> Gourav Sengupta
>> >>>>
>> >>>> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
>> >>>>
>> >>>>> hello community,
>> >>>>>
>> >>>>> In pyspark how can I measure the running time to the command?
>> >>>>> I just want to compare the running time of the RDD API and
>> > dataframe
>> >>>>>
>> >>>>> API, in my this blog:
>> >>>>>
>> >>>>
>> >
>> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>> >>>>>
>> >>>>> I tried spark.time() it doesn't work.
>> >>>>> Thank you.
>> >>>>>
>> >>>>>
>> >>>>
>> > -
>> >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>> >>>
>> > -
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: measure running time

2021-12-24 Thread Sean Owen
You probably did not install it on your cluster, nor included the python
package with your app

On Fri, Dec 24, 2021, 4:35 AM  wrote:

> but I already installed it:
>
> Requirement already satisfied: sparkmeasure in
> /usr/local/lib/python2.7/dist-packages
>
> so how? thank you.
>
> On 2021-12-24 18:15, Hollis wrote:
> > Hi bitfox,
> >
> > you need pip install sparkmeasure firstly. then can lanch in pysaprk.
> >
> >>>> from sparkmeasure import StageMetrics
> >>>> stagemetrics = StageMetrics(spark)
> >>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)
> > from range(1000) cross join range(1000) cross join
> > range(100)").show()')
> > +-+
> >
> > | count(1)|
> > +-+
> > |1|
> > +-+
> >
> > Regards,
> > Hollis
> >
> > At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
> >> Hello list,
> >>
> >> I run with Spark 3.2.0
> >>
> >> After I started pyspark with:
> >> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
> >>
> >> I can't load from the module sparkmeasure:
> >>
> >>>>> from sparkmeasure import StageMetrics
> >> Traceback (most recent call last):
> >>   File "", line 1, in 
> >> ModuleNotFoundError: No module named 'sparkmeasure'
> >>
> >> Do you know why? @Luca thanks.
> >>
> >>
> >> On 2021-12-24 04:20, bit...@bitfox.top wrote:
> >>> Thanks Gourav and Luca. I will try with the tools you provide in
> > the
> >>> Github.
> >>>
> >>> On 2021-12-23 23:40, Luca Canali wrote:
> >>>> Hi,
> >>>>
> >>>> I agree with Gourav that just measuring execution time is a
> > simplistic
> >>>> approach that may lead you to miss important details, in
> > particular
> >>>> when running distributed computations.
> >>>>
> >>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
> >>>> useful for further drill down. See
> >>>> https://spark.apache.org/docs/latest/monitoring.html
> >>>>
> >>>> You can also have a look at this tool that takes care of
> > automating
> >>>> collecting and aggregating some executor task metrics:
> >>>> https://github.com/LucaCanali/sparkMeasure
> >>>>
> >>>> Best,
> >>>>
> >>>> Luca
> >>>>
> >>>> From: Gourav Sengupta 
> >>>> Sent: Thursday, December 23, 2021 14:23
> >>>> To: bit...@bitfox.top
> >>>> Cc: user 
> >>>> Subject: Re: measure running time
> >>>>
> >>>> Hi,
> >>>>
> >>>> I do not think that such time comparisons make any sense at all in
> >>>> distributed computation. Just saying that an operation in RDD and
> >>>> Dataframe can be compared based on their start and stop time may
> > not
> >>>> provide any valid information.
> >>>>
> >>>> You will have to look into the details of timing and the steps.
> > For
> >>>> example, please look at the SPARK UI to see how timings are
> > calculated
> >>>> in distributed computing mode, there are several well written
> > papers
> >>>> on this.
> >>>>
> >>>> Thanks and Regards,
> >>>>
> >>>> Gourav Sengupta
> >>>>
> >>>> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
> >>>>
> >>>>> hello community,
> >>>>>
> >>>>> In pyspark how can I measure the running time to the command?
> >>>>> I just want to compare the running time of the RDD API and
> > dataframe
> >>>>>
> >>>>> API, in my this blog:
> >>>>>
> >>>>
> >
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
> >>>>>
> >>>>> I tried spark.time() it doesn't work.
> >>>>> Thank you.
> >>>>>
> >>>>>
> >>>>
> > -
> >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>>
> > -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: measure running time

2021-12-24 Thread Gourav Sengupta
Hi,

There are too many blogs out there with absolutely no value. Before writing
another blog, which does not make much sense by doing run time comparisons
between RDD and dataframes (as stated earlier), it may be  useful to first
understand what you are trying to achieve by writing this blog.

Then perhaps based on that you may want to look at different options.


Regards,
Gourav Sengupta



On Fri, Dec 24, 2021 at 10:42 AM  wrote:

> As you see below:
>
> $ pip install sparkmeasure
> Collecting sparkmeasure
>Using cached
>
> https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl
> Installing collected packages: sparkmeasure
> Successfully installed sparkmeasure-0.14.0
>
>
> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
> Python 3.6.9 (default, Jan 26 2021, 15:33:00)
> [GCC 8.4.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> ..
> >>> from sparkmeasure import StageMetrics
> Traceback (most recent call last):
>File "", line 1, in 
> ModuleNotFoundError: No module named 'sparkmeasure'
>
>
> That doesn't work still.
> I run spark 3.2.0 on an ubuntu system.
>
> Regards.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: measure running time

2021-12-24 Thread bitfox

As you see below:

$ pip install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0


$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
..

from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'


That doesn't work still.
I run spark 3.2.0 on an ubuntu system.

Regards.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox

but I already installed it:

Requirement already satisfied: sparkmeasure in 
/usr/local/lib/python2.7/dist-packages


so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in pysaprk.


from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may

not

provide any valid information.

You will have to look into the details of timing and the steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:




https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.





-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:Re: measure running time

2021-12-24 Thread Hollis
Hi bitfox,


you need pip install sparkmeasure firstly. then can lanch in pysaprk.


| >>> from sparkmeasure import StageMetrics
>>> stagemetrics = StageMetrics(spark)
>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from 
>>> range(1000) cross join range(1000) cross join range(100)").show()')
+-+ 
| count(1)|
+-+
|1|
+-+



|


Regards,
Hollis






At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
>Hello list,
>
>I run with Spark 3.2.0
>
>After I started pyspark with:
>$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>
>I can't load from the module sparkmeasure:
>
>>>> from sparkmeasure import StageMetrics
>Traceback (most recent call last):
>   File "", line 1, in 
>ModuleNotFoundError: No module named 'sparkmeasure'
>
>Do you know why? @Luca thanks.
>
>
>On 2021-12-24 04:20, bit...@bitfox.top wrote:
>> Thanks Gourav and Luca. I will try with the tools you provide in the 
>> Github.
>> 
>> On 2021-12-23 23:40, Luca Canali wrote:
>>> Hi,
>>> 
>>> I agree with Gourav that just measuring execution time is a simplistic
>>> approach that may lead you to miss important details, in particular
>>> when running distributed computations.
>>> 
>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
>>> useful for further drill down. See
>>> https://spark.apache.org/docs/latest/monitoring.html
>>> 
>>> You can also have a look at this tool that takes care of automating
>>> collecting and aggregating some executor task metrics:
>>> https://github.com/LucaCanali/sparkMeasure
>>> 
>>> Best,
>>> 
>>> Luca
>>> 
>>> From: Gourav Sengupta 
>>> Sent: Thursday, December 23, 2021 14:23
>>> To: bit...@bitfox.top
>>> Cc: user 
>>> Subject: Re: measure running time
>>> 
>>> Hi,
>>> 
>>> I do not think that such time comparisons make any sense at all in
>>> distributed computation. Just saying that an operation in RDD and
>>> Dataframe can be compared based on their start and stop time may not
>>> provide any valid information.
>>> 
>>> You will have to look into the details of timing and the steps. For
>>> example, please look at the SPARK UI to see how timings are calculated
>>> in distributed computing mode, there are several well written papers
>>> on this.
>>> 
>>> Thanks and Regards,
>>> 
>>> Gourav Sengupta
>>> 
>>> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
>>> 
>>>> hello community,
>>>> 
>>>> In pyspark how can I measure the running time to the command?
>>>> I just want to compare the running time of the RDD API and dataframe
>>>> 
>>>> API, in my this blog:
>>>> 
>>> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>>>> 
>>>> I tried spark.time() it doesn't work.
>>>> Thank you.
>>>> 
>>>> 
>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>-
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: measure running time

2021-12-23 Thread bitfox

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread Mich Talebzadeh
h,
>
>
>
> With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:
>
>
>
> bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, December 23, 2021 19:59
> *To:* Luca Canali 
> *Cc:* user 
> *Subject:* Re: measure running time
>
>
>
> Hi Luca,
>
>
>
> Have you tested this link  https://github.com/LucaCanali/sparkMeasure
>
>
>
> With Spark 3.1.1/PySpark,   I am getting this error
>
>
>
>
>
> pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17
>
>
>
> :: problems summary ::
>
>  ERRORS
>
> unknown resolver null
>
>
>
> SERVER ERROR: Bad Gateway url=
> https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar
>
>
>
> SERVER ERROR: Bad Gateway url=
> https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar
>
>
>
> Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
>
> Spark context Web UI available at http://rhes76:4040
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1640285629478).
>
> SparkSession available as 'spark'.
>
>
>
> >>> from sparkmeasure import StageMetrics
>
> >>> stagemetrics = StageMetrics(spark)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/home/hduser/anaconda3/envs/pyspark_venv/lib/python3.7/site-packages/sparkmeasure/stagemetrics.py",
> line 15, in __init__
>
> self.stagemetrics =
> self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
>
>   File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1569, in __call__
>
>   File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco
>
> return f(*a, **kw)
>
>   File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line
> 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.ch.cern.sparkmeasure.StageMetrics.
>
> : java.lang.NoClassDefFoundError: scala/Product$class
>
> at ch.cern.sparkmeasure.StageMetrics.(stagemetrics.scala:111)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:238)
>
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException: scala.Product$class
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 12 more
>
>
>
> Thanks
>
>
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 23 Dec 2021 at 15:41, Luca Canali  wrote:
>
> Hi,
>
>
>
> I agree with Gourav that just measuring execution time is a simplistic
> approach that may lead you to miss important details, in particular when
> running distributed computations.
>
> WebUI, REST API, and metrics instrumentation in Spark can be quite useful
> for further drill down. See
> https://spark.apache.org/docs/latest/monitoring.html
>
> You can also have a look at this tool that takes care of automating
>

RE: measure running time

2021-12-23 Thread Luca Canali
Hi Mich,

 

With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:  

 

bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

 

Best,

Luca

 

From: Mich Talebzadeh  
Sent: Thursday, December 23, 2021 19:59
To: Luca Canali 
Cc: user 
Subject: Re: measure running time

 

Hi Luca,

 

Have you tested this link  https://github.com/LucaCanali/sparkMeasure

 

With Spark 3.1.1/PySpark,   I am getting this error 

 

 

pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17

 

:: problems summary ::

 ERRORS

unknown resolver null

 

SERVER ERROR: Bad Gateway 
url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar

 

SERVER ERROR: Bad Gateway 
url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar

 

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

Spark context Web UI available at http://rhes76:4040

Spark context available as 'sc' (master = local[*], app id = 
local-1640285629478).

SparkSession available as 'spark'.

 

>>> from sparkmeasure import StageMetrics

>>> stagemetrics = StageMetrics(spark)

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/home/hduser/anaconda3/envs/pyspark_venv/lib/python3.7/site-packages/sparkmeasure/stagemetrics.py",
 line 15, in __init__

self.stagemetrics = 
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1569, in __call__

  File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco

return f(*a, **kw)

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, 
in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
None.ch.cern.sparkmeasure.StageMetrics.

: java.lang.NoClassDefFoundError: scala/Product$class

at ch.cern.sparkmeasure.StageMetrics.(stagemetrics.scala:111)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:238)

at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)

at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassNotFoundException: scala.Product$class

at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 12 more

 

Thanks

 

 

   view my Linkedin profile 
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

 

 

On Thu, 23 Dec 2021 at 15:41, Luca Canali mailto:luca.can...@cern.ch> > wrote:

Hi,

 

I agree with Gourav that just measuring execution time is a simplistic approach 
that may lead you to miss important details, in particular when running 
distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite useful for 
further drill down. See https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating collecting 
and aggregating some executor task metrics: 
https://github.com/LucaCanali/sparkMeasure

 

Best,

Luca

 

From: Gourav Sengupta mailto:gourav.sengu...@gmail.com> > 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user mailto:user@spark.apache.org> >
Subject: Re: measure running time

 

Hi,

 

I do not think that such time comparisons make any sense at all in distributed 
computation. Just saying that an operation in RDD and Dataframe can be compared 
based on their start and stop time may not provide any valid information.

 

You will have to look into the details of timing and the steps. For example, 
please look at the SPARK UI to see how timings are calculated in distributed 
computing mode, there are several well wri

Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Hi Luca,

Have you tested this link  https://github.com/LucaCanali/sparkMeasure


With Spark 3.1.1/PySpark,   I am getting this error



pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17

:: problems summary ::

 ERRORS

unknown resolver null


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar


Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

Spark context Web UI available at http://rhes76:4040

Spark context available as 'sc' (master = local[*], app id =
local-1640285629478).

SparkSession available as 'spark'.

>>> from sparkmeasure import StageMetrics
>>> stagemetrics = StageMetrics(spark)
Traceback (most recent call last):
  File "", line 1, in 
  File
"/home/hduser/anaconda3/envs/pyspark_venv/lib/python3.7/site-packages/sparkmeasure/stagemetrics.py",
line 15, in __init__
self.stagemetrics =
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1569, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line
328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
None.ch.cern.sparkmeasure.StageMetrics.
: java.lang.NoClassDefFoundError: scala/Product$class
at ch.cern.sparkmeasure.StageMetrics.(stagemetrics.scala:111)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Dec 2021 at 15:41, Luca Canali  wrote:

> Hi,
>
>
>
> I agree with Gourav that just measuring execution time is a simplistic
> approach that may lead you to miss important details, in particular when
> running distributed computations.
>
> WebUI, REST API, and metrics instrumentation in Spark can be quite useful
> for further drill down. See
> https://spark.apache.org/docs/latest/monitoring.html
>
> You can also have a look at this tool that takes care of automating
> collecting and aggregating some executor task metrics:
> https://github.com/LucaCanali/sparkMeasure
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Gourav Sengupta 
> *Sent:* Thursday, December 23, 2021 14:23
> *To:* bit...@bitfox.top
> *Cc:* user 
> *Subject:* Re: measure running time
>
>
>
> Hi,
>
>
>
> I do not think that such time comparisons make any sense at all in
> distributed computation. Just saying that an operation in RDD and Dataframe
> can be compared based on their start and stop time may not provide any
> valid information.
>
>
>
> You will have to look into the details of timing and the steps. For
> example, please look at the SPARK UI to see how timings are calculated in
> distributed computing mode, there are several well written papers on this.
>
>
>
>
>
> Thanks and Regards,
>
> Gourav Sengupta
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
>
> hello community,
>
> In pyspark how can I measur

RE: measure running time

2021-12-23 Thread Luca Canali
Hi,

 

I agree with Gourav that just measuring execution time is a simplistic approach 
that may lead you to miss important details, in particular when running 
distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite useful for 
further drill down. See https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating collecting 
and aggregating some executor task metrics: 
https://github.com/LucaCanali/sparkMeasure

 

Best,

Luca

 

From: Gourav Sengupta  
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

 

Hi,

 

I do not think that such time comparisons make any sense at all in distributed 
computation. Just saying that an operation in RDD and Dataframe can be compared 
based on their start and stop time may not provide any valid information.

 

You will have to look into the details of timing and the steps. For example, 
please look at the SPARK UI to see how timings are calculated in distributed 
computing mode, there are several well written papers on this.

 

 

Thanks and Regards,

Gourav Sengupta

 

 

 

 

 

On Thu, Dec 23, 2021 at 10:57 AM mailto:bit...@bitfox.top> 
> wrote:

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:
https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org> 



Re: measure running time

2021-12-23 Thread Gourav Sengupta
Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and Dataframe
can be compared based on their start and stop time may not provide any
valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated in
distributed computing mode, there are several well written papers on this.


Thanks and Regards,
Gourav Sengupta





On Thu, Dec 23, 2021 at 10:57 AM  wrote:

> hello community,
>
> In pyspark how can I measure the running time to the command?
> I just want to compare the running time of the RDD API and dataframe
> API, in my this blog:
>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>
> I tried spark.time() it doesn't work.
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Try this simple thing first

import time
def main():
start_time = time.time()

   print("\nStarted at");uf.println(lst)
   # your code

   print("\nFinished at");uf.println(lst)
   end_time = time.time()
   time_elapsed = (end_time - start_time)
   print(f"""Elapsed time in seconds is {time_elapsed}""")
   spark_session.stop()


see

https://github.com/michTalebzadeh/spark_on_gke/blob/main/src/RandomDataBigQuery.py

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Dec 2021 at 10:58,  wrote:

> hello community,
>
> In pyspark how can I measure the running time to the command?
> I just want to compare the running time of the RDD API and dataframe
> API, in my this blog:
>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>
> I tried spark.time() it doesn't work.
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


measure running time

2021-12-23 Thread bitfox

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org