Re: [pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Keith Chapman
Yes that is correct, that would cause computation twice. If you want the computation to happen only once you can cache the dataframe and call count and write on the cached dataframe. Regards, Keith. http://keith-chapman.com On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote: > Hi All, > > Just

[pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Rishi Shah
Hi All, Just wanted to confirm my understanding around actions on dataframe. If dataframe is not persisted at any point, & count() is called on a dataframe followed by write action --> this would trigger dataframe computation twice (which could be the performance hit for a larger dataframe)..

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
most likely have to set something in spark-defaults.conf like spark.master yarn spark.submit.deployMode client On Mon, May 20, 2019 at 3:14 PM Nicolas Paris wrote: > Finally that was easy to connect to both hive/hdfs. I just had to copy > the hive-site.xml from the old spark version and that

Re: High level explanation of dropDuplicates

2019-05-20 Thread Nicholas Hakobian
>From doing some searching around in the spark codebase, I found the following: https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474 So it appears there is no direct operation

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
Finally that was easy to connect to both hive/hdfs. I just had to copy the hive-site.xml from the old spark version and that worked instantly after unzipping. Right now I am stuck on connecting to yarn. On Mon, May 20, 2019 at 02:50:44PM -0400, Koert Kuipers wrote: > we had very little issues

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
we had very little issues with hdfs or hive, but then we use hive only for basic reading and writing of tables. depending on your vendor you might have to add a few settings to your spark-defaults.conf. i remember on hdp you had to set the hdp.version somehow. we prefer to build spark with hadoop

High level explanation of dropDuplicates

2019-05-20 Thread Yeikel
Hi , I am looking for a high level explanation(overview) on how dropDuplicates[1] works. [1] https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 Could someone please explain? Thank you -- Sent from:

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
> correct. note that you only need to install spark on the node you launch it > from. spark doesnt need to be installed on cluster itself. That sound reasonably doable for me. My guess is I will have some troubles to make that spark version work with both hive & hdfs installed on the cluster - or

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
correct. note that you only need to install spark on the node you launch it from. spark doesnt need to be installed on cluster itself. the shared components between spark jobs on yarn are only really spark-shuffle-service in yarn and spark-history-server. i have found compatibility for these to

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster. The danger increases with the semver change and this one is not just a build #. In other word 2.4 is considered to be a fairly major change from 2.3. Not much else can be said. From: Nicolas Paris Reply:

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
> you will need the spark version you intend to launch with on the machine you > launch from and point to the correct spark-submit does this mean to install a second spark version (2.4) on the cluster ? thanks On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote: > yarn can happily run

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
yarn can happily run multiple spark versions side-by-side you will need the spark version you intend to launch with on the machine you launch from and point to the correct spark-submit On Mon, May 20, 2019 at 1:50 PM Nicolas Paris wrote: > Hi > > I am wondering whether that's feasible to: > -

run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
Hi I am wondering whether that's feasible to: - build a spark application (with sbt/maven) based on spark2.4 - deploy that jar on yarn on a spark2.3 based installation thanks by advance, -- nicolas - To unsubscribe e-mail:

Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Hariharan
It makes scheduling faster. If you have a node that can accommodate 20 containers, and you schedule one container per heartbeat, it would take 20 seconds to schedule all the containers. OTOH if you schedule multiple containers to a heartbeat it is much faster. - Hari On Mon, 20 May 2019, 15:40

Fetching LinkedIn data into PySpark using OAuth2.0

2019-05-20 Thread Aakash Basu
Hi, Just curious to know if anyone was successful in connecting LinkedIn using OAuth2.0, client ID and client secret to fetch data and process in Python/PySpark. I'm getting stuck at connection establishment. Any help? Thanks, Aakash.

Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Akshay Bhardwaj
Hi Hari, Thanks for this information. Do you have any resources on/can explain, why YARN has this as default behaviour? What would be the advantages/scenarios to have multiple assignments in single heartbeat? Regards Akshay Bhardwaj +91-97111-33849 On Mon, May 20, 2019 at 1:29 PM Hariharan

Watermark handling on initial query start (Structured Streaming)

2019-05-20 Thread Joe Ammann
Hi all I'm currently developing a Spark structured streaming application which joins/aggregates messages from ~7 Kafka topics and produces messages onto another Kafka topic. Quite often in my development cycle, I want to "reprocess from scratch": I stop the program, delete the target topic

Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread JB Data31
There is a kind of check in the *yarn-site.xml* *yarn.nodemanager.remote-app-log-dir /var/yarn/logs* ** Using *hdfs://:9000* as* fs.defaultFS* in *core-site.xml* you have to *hdfs dfs -mkdir /var/yarn/logs* Using *S3://* as * fs.defaultFS*... Take care of *.dir* properties in*

Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Hariharan
Hi Akshay, I believe HDP uses the capacity scheduler by default. In the capacity scheduler, assignment of multiple containers on the same node is determined by the option yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled, which is true by default. If you would like YARN to

Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread Hariharan
Hi Huizhe, You can set the "fs.defaultFS" field in core-site.xml to some path on s3. That way your spark job will use S3 for all operations that need HDFS. Intermediate data will still be stored on local disk though. Thanks, Hari On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari wrote: > While