RE: why the pyspark RDD API is so slow?

2022-01-30 Thread Theodore J Griesenbrock
Any particular code sample you can suggest to review on your tips?

> On Jan 30, 2022, at 06:16, Sebastian Piu  wrote:
> 
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a 
> spun python worker, so there is additional overhead than if you stay fully in 
> scala. 
> 
> Your python code might make this worse too, for example if not yielding from 
> operations
> 
> You can look at using UDFs and arrow or trying to stay as much as possible on 
> datagrams operations only
> 
>> On Sun, 30 Jan 2022, 10:11 Bitfox,  wrote:
>> Hello list,
>> 
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure 
>> scala program. The result shows the pyspark RDD is too slow.
>> 
>> For the operations and dataset please see:
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>> 
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>> 
>> Thanks a lot.
>> 
>> 
>> program  time
>> scala program49s
>> pyspark dataframe56s
>> scala RDD1m31s
>> pyspark RDD  7m15s



RE: Is user@spark indexed by google?

2022-01-21 Thread Theodore J Griesenbrock
Try searching here:
 
https://lists.apache.org/list.html?user@spark.apache.org
 
-T.J.
 
 
T.J. Griesenbrock
Technical Release Manager
Watson Health
He/Him/His
 
+1 (602) 377-7673 (Text only)t...@ibm.com 
IBM
 
 
- Original message -From: "Mich Talebzadeh" To:Cc: "user @spark" Subject: [EXTERNAL] Re: Is user@spark indexed by google?Date: Fri, Jan 21, 2022 16:08  Well agreed that this user@spark is a great place to search for answers and no I don't think this email list is indexed by Google. For this reason I use gmail and all my user@/dev@ memberships are added to my gmail account. For example,ZjQcmQRYFpfptBannerStart  

This Message Is From an External Sender
This message came from outside your organization. ZjQcmQRYFpfptBannerEnd 
 
Well agreed that this user@spark is a great place to search for answers and no I don't think this email list is indexed by Google.
 
For this reason I use gmail and all my user@/dev@ memberships are added to my gmail account. For example, I can search starting from 2016 onwards the gmail mailing list. I suggest you explore that option if it helps.
 
Mich
 
   view my Linkedin profile
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
  

On Fri, 21 Jan 2022 at 18:03, Andrew Davidson  wrote:
There is a ton of great info in this archive. I noticed when I do a google search it does not seem to find results from this source
 
Kind regards
 
Andy
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: questions on these functions

2022-01-21 Thread Theodore J Griesenbrock
I discovered several instances of discussion on leftFold and rightFold in a variety of forums, but I can not find anything related to RDD in the official documentation:
 
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
 
It appears to be non-related to Spark, and probably something Hadoop-related.  Can you please be more specific on how leftFold and rightFold is imported and the language you are using to implement Spark?
 
Thanks!
 
-T.J.
 
 
T.J. Griesenbrock
Technical Release Manager
Watson Health
He/Him/His
 
+1 (602) 377-7673 (Text only)t...@ibm.com 
IBM
 
 
- Original message -From: "Sherd Fox" To: user@spark.apache.orgCc:Subject: [EXTERNAL] questions on these functionsDate: Fri, Jan 21, 2022 04:26  
Hello sparkers,
 
What were the differences for leftFold, rightFold and the fold in RDD functions?
 
I am not very clear about the usage of them.
 
Thanks.
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-19 Thread Theodore J Griesenbrock
Again, sorry to bother you.
 
What is the best option available to ensure we get notified when a new version is released for Apache Spark?  I do not see any RSS feeds, nor do I see any e-mail subscription option for this page:  https://spark.apache.org/news/index.html
 
Please let me know what we can do to ensure we stay up to date with the news.
 
Thanks!
 
-T.J.
 
 
T.J. Griesenbrock
Technical Release Manager
Watson Health
He/Him/His
 
+1 (602) 377-7673 (Text only)t...@ibm.com 
IBM
 
 
- Original message -From: "Sean Owen" To: "Juan Liu" Cc: "Theodore J Griesenbrock" , "User" Subject: [EXTERNAL] Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?Date: Thu, Jan 13, 2022 08:05  
Yes, Spark does not use the SocketServer mentioned in CVE-2019-17571, however, so is not affected.
3.3.0 would probably be out in a couple months. 

On Thu, Jan 13, 2022 at 3:14 AM Juan Liu <liuj...@cn.ibm.com> wrote:
We are informed that CVE-2021-4104 is not only problem with Log4J 1.x. There is one more CVE-2019-17571, and as Apache announced EOL in 2015, so Spark 3.3.0 will be very expected. Do you think middle 2022 is a reasonable time for Spark 3.3.0 release?  
Juan Liu (刘娟) PMP®    Release Management, Watson Health, China Development LabEmail: liuj...@cn.ibm.comPhone: 86-10-82452506           
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org