Re: Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Akshay Bhardwaj
Hi Shyam,

It will be good if you mention what are you using the --master url as? Is
it running on YARN, Mesos or Spark cluster?

However, I faced such an issue in my earlier trials with spark, in which I
created connections with a lot of external databases like Cassandra within
the Driver (or main program of my app).
After the job completed, my Main program/driver task never finished, after
debugging I found out to be the reason as open sessions with Cassandra.
Closing out those connections at the end of my main program helped resolve
the problem. As you can guess, this issue was then irrespective of the
Cluster manager used.


Akshay Bhardwaj
+91-97111-33849


On Tue, Jun 11, 2019 at 7:41 PM Shyam P  wrote:

> Hi,
> Any clue why spark job goes into UNDEFINED state ?
>
> More detail are in the url.
>
> https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined
>
>
> Appreciate your help.
>
> Regards,
> Shyam
>


What is the compatibility between releases?

2019-06-11 Thread email
Dear Community , 

 

>From what I understand , Spark uses a variation of Semantic Versioning[1] ,
but this information is not enough for me to clarify if it is compatible or
not within versions. 

 

For example , if my cluster is running Spark 2.3.1 , can I develop using API
additions in Spark 2.4? (higher order functions to give an  example). What
about the other way around? 

 

Typically , I assume that a job created in Spark 1.x will fail in Spark 2.x
, but that's also something I would like to get a confirmation. 

 

Thank you for your help!

 

[1] https://spark.apache.org/versioning-policy.html 



[pyspark 2.3+] count distinct returns different value every time it is run on the same dataset

2019-06-11 Thread Rishi Shah
Hi All,

countDistinct on dataframe returns different results every time it is run,
I expect that when approxCountDistinct is used but even for
countDistinct()? Is there a way to get accurate count using pyspark
(deterministic result)?

-- 
Regards,

Rishi Shah


Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-06-11 Thread Prudhvi Chennuru (CONT)
Hey Oliver,

 I am also facing the same issue on my kubernetes
cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi,
> I did not try on another vendor, so I can't say if it's only related to
> gke, and no, I did not notice anything on the kubelet or kube-dns
> processes...
>
> Regards
>
> Le ven. 3 mai 2019 à 03:05, Li Gao  a écrit :
>
>> hi Olivier,
>>
>> This seems a GKE specific issue? have you tried on other vendors ? Also
>> on the kubelet nodes did you notice any pressure on the DNS side?
>>
>> Li
>>
>>
>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>>> and sometimes while running these jobs a pretty bad thing happens, the
>>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>>> executor pods.
>>> So far so good, but the k8s "Service" associated to the driver does not
>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>> with a "spark-application-..cluster.svc.local" does not exists.
>>>
>>> All executors failing the driver should be failing too, but it considers
>>> that it's a "pending" initial allocation and stay stuck forever in a loop
>>> of "Initial job has not accepted any resources, please check Cluster UI"
>>>
>>> Has anyone else observed this king of behaviour ?
>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>>> backend.
>>>
>>> I can work on a fix / workaround but I'd like to check with you the
>>> proper way forward :
>>>
>>>- Some processes (like the airflow helm recipe) rely on a "sleep
>>>30s" before launching the dependent pods (that could be added to
>>>/opt/entrypoint.sh used in the kubernetes packing)
>>>- We can add a simple step to the init container trying to do the
>>>DNS resolution and failing after 60s if it did not work
>>>
>>> But these steps won't change the fact that the driver will stay stuck
>>> thinking we're still in the case of the Initial allocation delay.
>>>
>>> Thoughts ?
>>>
>>> --
>>> *Olivier Girardot*
>>> o.girar...@lateral-thoughts.com
>>>
>>

-- 
*Thanks,*
*Prudhvi Chennuru.*


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


RE: Spark on Kubernetes - log4j.properties not read

2019-06-11 Thread Dave Jaffe
That did the trick, Abhishek! Thanks for the explanation, that answered a lot
of questions I had.

Dave



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Shyam P
Hi,
Any clue why spark job goes into UNDEFINED state ?

More detail are in the url.
https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined


Appreciate your help.

Regards,
Shyam


Re: Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3

2019-06-11 Thread rmartine
Hi folks,

Does anyone know what is happening in this case? I tried both with MySQL and
PostgreSQL and none of them finishes schema creation without error. It seems
something has changed from 2.2. to 2.4 that broke schema generation for Hive
Metastore.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony


I'm writing a large dataset in Parquet format to HDFS using Spark and it runs 
rather slowly in EMR vs say Databricks. I realize that if I was able to use 
Hadoop 3.1, it would be much more performant because it has a high performance 
output committer. Is this the case, and if so - when will there be a version of 
EMR that uses Hadoop 3.1 ? The current version I'm using is 5.21.
Sent from my iPhone
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark kafka streaming job stopped

2019-06-11 Thread Amit Sharma
Please provide update if any one knows.

On Monday, June 10, 2019, Amit Sharma  wrote:

>
> We have spark kafka sreaming job running on standalone spark cluster. We
> have below kafka architecture
>
> 1. Two cluster running on two data centers.
> 2. There is LTM on top on each data center (load balance)
> 3. There is GSLB on top of LTM.
>
> I observed when ever any of the node in kafka cluster is down  our spark
> stream job stopped. We are using GLSB url in our code to connect to Kafka
> not the IP address. Please let me know is it expected behavior if not then
> what config we need to change.
>
> Thanks
> Amit
>


Re: best docker image to use

2019-06-11 Thread Riccardo Ferrari
Hi Marcelo,

I'm used to work with https://github.com/jupyter/docker-stacks. There's the
Scala+jupyter option too. Though there might be better option with Zeppelin
too.
Hth


On Tue, 11 Jun 2019, 11:52 Marcelo Valle,  wrote:

> Hi,
>
> I would like to run spark shell + scala on a docker environment, just to
> play with docker in development machine without having to install JVM + a
> lot of things.
>
> Is there something as an "official docker image" I am recommended to use?
> I saw some on docker hub, but it seems they are all contributions from
> pro-active individuals. I wonder whether the group maintaining Apache Spark
> also maintains some docker images for use cases like this?
>
> Thanks,
> Marcelo.
>
> This email is confidential [and may be protected by legal privilege]. If
> you are not the intended recipient, please do not copy or disclose its
> content but contact the sender immediately upon receipt.
>
> KTech Services Ltd is registered in England as company number 10704940.
>
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
> United Kingdom
>


Re: Read hdfs files in spark streaming

2019-06-11 Thread nitin jain
Hi Deepak,
Please let us know - how you managed it ?

Thanks,
NJ

On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma  wrote:

> Thanks All.
> I managed to get this working.
> Marking this thread as closed.
>
> On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma 
> wrote:
>
>> This is the project requirement , where paths are being streamed in kafka
>> topic.
>> Seems it's not possible using spark structured streaming.
>>
>>
>> On Mon, Jun 10, 2019 at 3:59 PM Shyam P  wrote:
>>
>>> Hi Deepak,
>>>  Why are you getting paths from kafka topic? any specific reason to do
>>> so ?
>>>
>>> Regards,
>>> Shyam
>>>
>>> On Mon, Jun 10, 2019 at 10:44 AM Deepak Sharma 
>>> wrote:
>>>
 The context is different here.
 The file path are coming as messages in kafka topic.
 Spark streaming (structured) consumes form this topic.
 Now it have to get the value from the message , thus the path to file.
 read the json stored at the file location into another df.

 Thanks
 Deepak

 On Sun, Jun 9, 2019 at 11:03 PM vaquar khan 
 wrote:

> Hi Deepak,
>
> You can use textFileStream.
>
> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
>
> Plz start using stackoverflow to ask question to other ppl so get
> benefits of answer
>
>
> Regards,
> Vaquar khan
>
> On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma 
> wrote:
>
>> I am using spark streaming application to read from  kafka.
>> The value coming from kafka message is path to hdfs file.
>> I am using spark 2.x , spark.read.stream.
>> What is the best way to read this path in spark streaming and then
>> read the json stored at the hdfs path , may be using spark.read.json , 
>> into
>> a df inside the spark streaming app.
>> Thanks a lot in advance
>>
>> --
>> Thanks
>> Deepak
>>
>

 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Spark structured streaming leftOuter join not working as I expect

2019-06-11 Thread Jungtaek Lim
Got the point. If you would like to get "correct" output, you may need to
set global watermark as "min", because watermark is not only used for
evicting rows in state, but also discarding input rows later than
watermark. Here you may want to be aware that there're two stateful
operators which will receive inputs from previous stage and discard them
via watermark before processing.

Btw, you may also need to consider the difference of the concept of
watermark between Spark and others:

1. Spark uses high watermark (picks highest event timestamp of input rows)
even for single watermark whereas other frameworks use low watermark (picks
lowest event timestamp of input rows). So you may always need to set enough
delay on watermark.

2. Spark uses global watermark whereas other frameworks normally use
operator-wise watermark. This is limitation of Spark (given outputs of
previous stateful operator will become inputs of next stateful operator,
they should have different watermark) and one of contributor proposes the
approach [1] which would fit for Spark (unfortunately it haven't been
reviewed by committers so long).

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://github.com/apache/spark/pull/23576

On Tue, Jun 11, 2019 at 7:06 AM Joe Ammann  wrote:

> Hi all
>
> it took me some time to get the issues extracted into a piece of
> standalone code. I created the following gist
>
> https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
>
> I has messages for 4 topics A/B/C/D and a simple Python program which
> shows 6 use cases, with my expectations and observations with Spark 2.4.3
>
> It would be great if you could have a look and check if I'm doing
> something wrong, or this is indeed a limitation of Spark?
>
> On 6/5/19 5:35 PM, Jungtaek Lim wrote:
> > Nice to hear you're investigating the issue deeply.
> >
> > Btw, if attaching code is not easy, maybe you could share
> logical/physical plan on any batch: "detail" in SQL tab would show up the
> plan as string. Plans from sequential batches would be much helpful - and
> streaming query status in these batch (especially watermark) should be
> helpful too.
> >
>
>
> --
> CU, Joe
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


best docker image to use

2019-06-11 Thread Marcelo Valle
Hi,

I would like to run spark shell + scala on a docker environment, just to
play with docker in development machine without having to install JVM + a
lot of things.

Is there something as an "official docker image" I am recommended to use? I
saw some on docker hub, but it seems they are all contributions from
pro-active individuals. I wonder whether the group maintaining Apache Spark
also maintains some docker images for use cases like this?

Thanks,
Marcelo.

This email is confidential [and may be protected by legal privilege]. If you 
are not the intended recipient, please do not copy or disclose its content but 
contact the sender immediately upon receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
Kingdom


AW: Getting driver logs in Standalone Cluster

2019-06-11 Thread Lourier, Jean-Michel (FIX1)
Hi Patrick,

I guess the easiest way is to use log aggregation: 
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

BR

Jean-Michel



Dr. Ing. h.c. F.  Porsche Aktiengesellschaft
Sitz der Gesellschaft: Stuttgart
Registergericht: Amtsgericht Stuttgart HRB-Nr. 730623
Vorsitzender des Aufsichtsrats: Dr. Wolfgang Porsche
Vorstand: Oliver Blume, Vorsitzender
Lutz Meschke, stv. Vorsitzender
Andreas Haffner, Detlev von Platen, Albrecht Reimold, Uwe-Karsten Städter, 
Michael Steiner

Informationen zum Umgang mit Ihren Daten finden Sie in unsere 
Datenschutzhinweisen unter https://www.porsche.com/germany/porscheag-privacy/

Die vorgenannten Angaben werden jeder E-Mail automatisch hinzugefügt. Dies ist 
kein Anerkenntnis,
dass es sich beim Inhalt dieser E-Mail um eine rechtsverbindliche Erklärung der 
Porsche AG handelt.
Erklärungen, die die Porsche AG verpflichten, bedürfen jeweils der Unterschrift 
durch zwei zeichnungs-
berechtigte Personen der AG.




-Ursprüngliche Nachricht-
Von: tkrol 
Gesendet: Freitag, 7. Juni 2019 16:22
An: user@spark.apache.org
Betreff: Getting driver logs in Standalone Cluster

Hey Guys,

I am wondering what is the best way to get logs for driver in the cluster mode 
on standalone cluster? Normally I used to run client mode so I could capture 
logs from the console.

Now I've started running jobs in cluster mode and obviously driver is running 
on worker and can't see the logs.

I would like to store logs (preferably in hdfs), any easy way to do that?

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-11 Thread Georg Heiler
For grouping with each: look into grouping sets
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html

Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah <
rishishah.s...@gmail.com>:

> Thank you both for your input!
>
> To calculate moving average of active users, could you comment on whether
> to go for RDD based implementation or dataframe? If dataframe, will window
> function work here?
>
> In general, how would spark behave when working with dataframe with date,
> week, month, quarter, year columns and groupie against each one one by one?
>
>
>
> On Sun, Jun 9, 2019 at 1:17 PM Jörn Franke  wrote:
>
>> Depending on what accuracy is needed, hyperloglogs can be an interesting
>> alternative
>> https://en.m.wikipedia.org/wiki/HyperLogLog
>>
>> Am 09.06.2019 um 15:59 schrieb big data :
>>
>> From m opinion, Bitmap is the best solution for active users calculation.
>> Other solution almost bases on count(distinct) calculation process, which
>> is more slower.
>>
>> If you 've implemented Bitmap solution including how to build Bitmap, how
>> to load Bitmap, then Bitmap is the best choice.
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>>
>> Hi All,
>>
>> Is there a best practice around calculating daily, weekly, monthly,
>> quarterly, yearly active users?
>>
>> One approach is to create a window of daily bitmap and aggregate it based
>> on period later. However I was wondering if anyone has a better approach to
>> tackling this problem..
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: Spark 2.2 With Column usage

2019-06-11 Thread Jacek Laskowski
Hi,

Why are you doing the following two lines?

.select("id",lit(referenceFiltered))
.selectexpr(
"id"
)

What are you trying to achieve? What's lit and what's referenceFiltered?
What's the difference between select and selectexpr? Please start at
http://spark.apache.org/docs/latest/sql-programming-guide.html and then hop
onto
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
to
know the Spark API better. I'm sure you'll quickly find out the answer(s).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Sat, Jun 8, 2019 at 12:53 PM anbutech  wrote:

> Thanks Jacek Laskowski Sir.but i didn't get the point here
>
> please advise the below one are you expecting:
>
> dataset1.as("t1)
>
> join(dataset3.as("t2"),
>
> col(t1.col1) === col(t2.col1), JOINTYPE.Inner )
>
> .join(dataset4.as("t3"), col(t3.col1) === col(t1.col1),
>
> JOINTYPE.Inner)
> .select("id",lit(referenceFiltered))
> .selectexpr(
> "id"
> )
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>