join with just 1 record causes all data to go to a single node

2019-11-21 Thread Marcelo Valle
Hi,

I am using spark on EMR 5.28.0.

We were having a problem in production where, after a join between 2
dataframes, in some situations all data was being moved to a single node,
and then the cluster was failing after retrying many times.

Our join is something like that:

```

df1.join(df2,
  df1("field1") <=> df2("field1")
&& df1("field2") <=> df2("field2"))

```

After some harvesting, we were able to isolate the corner case - this was
only happening when all join fields were NULL. Notice the `<=>` operator
instead of `===`.

Would someone be able to explain this behavior? It looks like a bug to me,
but I could be missing something.

Thanks,
Marcelo.

This email is confidential [and may be protected by legal privilege]. If you 
are not the intended recipient, please do not copy or disclose its content but 
contact the sender immediately upon receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
Kingdom


Re: Spark onApplicationEnd run multiple times during the application failure

2019-11-21 Thread hemant singh
This is how it is. It is the whole application level retry after 4 tasks
attempts fail the whole application fails and then the application retry.

Thanks,
Hemant

On Thu, 21 Nov 2019 at 7:24 PM, Jiang, Yi J (CWM-NR) 
wrote:

> Hello,
>
> Thank you for replying, it is retry, but why retry is happened from whole
> application level?
>
> To my understanding, the retry can be done in job level.
>
> Jacky
>
>
>
>
>
> *From:* hemant singh [mailto:hemant2...@gmail.com]
> *Sent:* November 21, 19 3:12 AM
> *To:* Jiang, Yi J (CWM-NR) 
> *Cc:* Martin, Phil ; user@spark.apache.org
> *Subject:* Re: Spark onApplicationEnd run multiple times during the
> application failure
>
>
>
> Could it be because of re-try.
>
>
>
> Thanks
>
>
>
> On Thu, 21 Nov 2019 at 3:35 AM, Jiang, Yi J (CWM-NR) <
> yi.j.ji...@rbc.com.invalid> wrote:
>
> Hello
>
> We are running into an issue.
>
> We have customized the SparkListener class, and added that to spark
> context. But when the spark job is failed, we find the “onApplicationEnd”
> function was triggered twice.
>
> Is that supposed to be triggered just once when the spark job is failed?
> Because the application level of spark job is only launched once, how can
> it be triggered twice when it is failed.
>
> Please let us know
>
> Thank you
>
>
>
>
>
> ___
>
> This email is intended only for the use of the individual(s) to whom it is
> addressed and may be privileged and confidential.
> Unauthorised use or disclosure is prohibited. If you receive this e-mail
> in error, please advise immediately and delete the original message.
> This message may have been altered without your or our knowledge and the
> sender does not accept any liability for any errors or omissions in the
> message.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux
> droits et obligations qui s'y rapportent.
> Toute diffusion, utilisation ou copie de ce message ou des renseignements
> qu'il contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite.
> Si vous recevez ce courriel par erreur, veuillez m'en aviser
> immédiatement, par retour de courriel ou par un autre moyen.
>
>


RE: Spark onApplicationEnd run multiple times during the application failure

2019-11-21 Thread Jiang, Yi J (CWM-NR)
Hello,
Thank you for replying, it is retry, but why retry is happened from whole 
application level?
To my understanding, the retry can be done in job level.
Jacky


From: hemant singh [mailto:hemant2...@gmail.com]
Sent: November 21, 19 3:12 AM
To: Jiang, Yi J (CWM-NR) 
Cc: Martin, Phil ; user@spark.apache.org
Subject: Re: Spark onApplicationEnd run multiple times during the application 
failure

Could it be because of re-try.

Thanks

On Thu, 21 Nov 2019 at 3:35 AM, Jiang, Yi J (CWM-NR) 
mailto:yi.j.ji...@rbc.com.invalid>> wrote:
Hello
We are running into an issue.
We have customized the SparkListener class, and added that to spark context. 
But when the spark job is failed, we find the “onApplicationEnd” function was 
triggered twice.
Is that supposed to be triggered just once when the spark job is failed? 
Because the application level of spark job is only launched once, how can it be 
triggered twice when it is failed.
Please let us know
Thank you



___

This email is intended only for the use of the individual(s) to whom it is 
addressed and may be privileged and confidential.
Unauthorised use or disclosure is prohibited. If you receive this e-mail in 
error, please advise immediately and delete the original message.
This message may have been altered without your or our knowledge and the sender 
does not accept any liability for any errors or omissions in the message.

Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux droits 
et obligations qui s'y rapportent.
Toute diffusion, utilisation ou copie de ce message ou des renseignements qu'il 
contient par une personne autre que le (les) destinataire(s) désigné(s) est 
interdite.
Si vous recevez ce courriel par erreur, veuillez m'en aviser immédiatement, par 
retour de courriel ou par un autre moyen.


Re: Spark onApplicationEnd run multiple times during the application failure

2019-11-21 Thread hemant singh
Could it be because of re-try.

Thanks

On Thu, 21 Nov 2019 at 3:35 AM, Jiang, Yi J (CWM-NR)
 wrote:

> Hello
>
> We are running into an issue.
>
> We have customized the SparkListener class, and added that to spark
> context. But when the spark job is failed, we find the “onApplicationEnd”
> function was triggered twice.
>
> Is that supposed to be triggered just once when the spark job is failed?
> Because the application level of spark job is only launched once, how can
> it be triggered twice when it is failed.
>
> Please let us know
>
> Thank you
>
>
>
>
>
> ___
>
> This email is intended only for the use of the individual(s) to whom it is
> addressed and may be privileged and confidential.
> Unauthorised use or disclosure is prohibited. If you receive this e-mail
> in error, please advise immediately and delete the original message.
> This message may have been altered without your or our knowledge and the
> sender does not accept any liability for any errors or omissions in the
> message.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux
> droits et obligations qui s'y rapportent.
> Toute diffusion, utilisation ou copie de ce message ou des renseignements
> qu'il contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite.
> Si vous recevez ce courriel par erreur, veuillez m'en aviser
> immédiatement, par retour de courriel ou par un autre moyen.
>