date:20190327

Udfs in spark

2019-03-27 Thread Achilleus 003

Couple of questions regarding udfs:
1) Is there a way to get all the registered UDFs in spark scala?
I couldn’t find any straight forward api. But found a pattern to get all the 
registered udfs. 
Spark.catalog.listfunctions.filter(_.className == null).collect

This does the trick but not sure it will hold true in all the cases.Is there a 
better way to get all the registered udfs?

2) is there way i can share my udfs across session when not using databricks 
notebook?
 

Sent from my iPhone
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to extract data in parallel from RDBMS tables

2019-03-27 Thread Surendra , Manchikanti

Hi All,

Is there any way to copy all the tables in parallel from RDBMS using Spark?
We are looking for a functionality similar to Sqoop.

Thanks,
Surendra

How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

2019-03-27 Thread Zili Chen

Hi guys,

Recently I open a question[1] on StackOverflow about leader election
with ZooKeeper high-availability backend. It puzzles me for some days
and it would be really help if you can take a look or even give some
thoughts.

Copy the content to mailing list:

Spark uses Curator#LeaderLatch for leader election. And PersistenceEngine
for persistent. I'd like to know what if an old leader lost its leadership
but before got notified, a new leader started to work, in which case both
of the two contenders regarded themselves as the leader and wrote on
zookeeper. Isn't it a race condition if the new leader just create a znode
on zookeeper but the old one, which was no longer the leader, delete it.

Best,
tison.

[1] https://stackoverflow.com/questions/55380498/how-spark-coordinates-
multi-contender-race-on-writing-zookeeper

Fwd: How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

2019-03-27 Thread Zili Chen

Hi guys,

Recently I open a question[1] on StackOverflow about leader election
with ZooKeeper high-availability backend. It puzzles me for some days
and it would be really help if you can take a look or even give some
thoughts.

Copy the content to mailing list:

Spark uses Curator#LeaderLatch for leader election. And PersistenceEngine
for persistent. I'd like to know what if an old leader lost its leadership
but before got notified, a new leader started to work, in which case both
of the two contenders regarded themselves as the leader and wrote on
zookeeper. Isn't it a race condition if the new leader just create a znode
on zookeeper but the old one, which was no longer the leader, delete it.

Best,
tison.

[1]
https://stackoverflow.com/questions/55380498/how-spark-coordinates-multi-contender-race-on-writing-zookeeper

Spark migration to Kubernetes

2019-03-27 Thread thrisha

Hi All,



We have Spark Streaming pipelines(written in java) currently running on yarn
in production. We are evaluating moving these streaming pipelines onto
Kubernetes. We had set up a working Kubernetes cluster. I have been reading
Spark documentation and a few other blogs on migrating them to Kubernetes.



1. But, it's not very clear on how to migrate existing pipelines to Spark on
Kubernetes. Any pointers on this would be helpful.



2. Also, I am trying to run sample wordcount example using the commands from
documentation(https://spark.apache.org/docs/2.4.0/running-on-kubernetes.html#cluster-mode).
 However,
I am not able to figure out a way to pass in Spark docker image as one of
the conf (spark.kubernetes.container.image). Our machines have no access to
the internet and so I have pre-loaded a spark docker image available at
gcr.io manually to our docker images. So, how should be my spark-submit
command?



3. Would specifying spark.kubernetes.container.image.pullPolicy=IfNotPresent
would only try to pull the docker image if it's not existing in the docker
list already?



Any help in answering the above questions would be appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Streaming data out of spark to a Kafka topic

2019-03-27 Thread Gabor Somogyi

Hi Mich,

Please take a look at how to write data into Kafka topic with DStreams:
https://github.com/gaborgsomogyi/spark-dstream-secure-kafka-sink-app/blob/62d64ce368bc07b385261f85f44971b32fe41327/src/main/scala/com/cloudera/spark/examples/DirectKafkaSinkWordCount.scala#L77
(DStreams has no native Kafka sink, if you need it use Structured Streaming)

BR,
G


On Wed, Mar 27, 2019 at 8:47 PM Mich Talebzadeh 
wrote:

> Hi,
>
> In a traditional we get data via Kafka into Spark streaming, do some work
> and write to a NoSQL database like Mongo, Hbase or Aerospike.
>
> That part can be done below and is best explained by the code as follows:
>
> Once a high value DF lookups is created I want send the data to a new
> topic for recipients!
>
> val kafkaParams = Map[String, String](
>   "bootstrap.servers" ->
> bootstrapServers,
>   "schema.registry.url" ->
> schemaRegistryURL,
>"zookeeper.connect" ->
> zookeeperConnect,
>"group.id" -> sparkAppName,
>"zookeeper.connection.timeout.ms"
> -> zookeeperConnectionTimeoutMs,
>"rebalance.backoff.ms" ->
> rebalanceBackoffMS,
>"zookeeper.session.timeout.ms" ->
> zookeeperSessionTimeOutMs,
>"auto.commit.interval.ms" ->
> autoCommitIntervalMS
>  )
> //val topicsSet = topics.split(",").toSet
> val topics = Set(topicsValue)
> val dstream = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
> // This returns a tuple of key and value (since messages in Kafka are
> optionally keyed). In this case it is of type (String, String)
> dstream.cache()
> //
> val topicsOut = Set(topicsValueOut)
> val dstreamOut = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsOut)
> dstreamOut.cache()
>
>
> dstream.foreachRDD
> { pricesRDD =>
>   if (!pricesRDD.isEmpty)  // data exists in RDD
>   {
> val op_time = System.currentTimeMillis.toString
> val spark =
> SparkSessionSingleton.getInstance(pricesRDD.sparkContext.getConf)
> val sc = spark.sparkContext
> import spark.implicits._
> var operation = new operationStruct(op_type, op_time)
> // Convert RDD[String] to RDD[case class] to DataFrame
> val RDDString = pricesRDD.map { case (_, value) =>
> value.split(',') }.map(p =>
> priceDocument(priceStruct(p(0).toString,p(1).toString,p(2).toString,p(3).toDouble,
> currency), operation))
> val df = spark.createDataFrame(RDDString)
> //df.printSchema
> var document = df.filter('priceInfo.getItem("price") > 90.0)
> MongoSpark.save(document, writeConfig)
>  println("Current time is: " + Calendar.getInstance.getTime)
>  totalPrices += document.count
>  var endTimeQuery = System.currentTimeMillis
>  println("Total Prices added to the collection so far: "
> +totalPrices+ " , Runnig for  " + (endTimeQuery -
> startTimeQuery)/(1000*60)+" Minutes")
>  // Check if running time > runTime exit
>  if( (endTimeQuery - startTimeQuery)/(10*60) > runTime)
>  {
>println("\nDuration exceeded " + runTime + " minutes exiting")
>System.exit(0)
>  }
>  // picking up individual arrays -->
> df.select('otherDetails.getItem("tickerQuotes")(0)) shows first element
>  //val lookups = df.filter('priceInfo.getItem("ticker") ===
> tickerWatch && 'priceInfo.getItem("price") > priceWatch)
>  val lookups = df.filter('priceInfo.getItem("price") > priceWatch)
>  if(lookups.count > 0) {
>println("High value tickers")
>lookups.select('priceInfo.getItem("timeissued").as("Time
> issued"), 'priceInfo.getItem("ticker").as("Ticker"),
> 'priceInfo.getItem("price").cast("Double").as("Latest price")).show
>
> // Now here I want to send the content of lookups DF to another kafka
> topic!!!
> //Note that above I have created a new dstreamOut with a new topic
> topicsOut
> How that can be done?
>  }
>   }
> }
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such

Re: Spark Profiler

2019-03-27 Thread Jack Kolokasis

Thanks for your reply.  Your help is very valuable and all these links 
are helpful (especially your example)


Best Regards

--Iacovos

On 3/27/19 10:42 PM, Luca Canali wrote:


I find that the Spark metrics system is quite useful to gather 
resource utilization metrics of Spark applications, including CPU, 
memory and I/O.


If you are interested an example how this works for us at: 
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark 

If instead you are rather looking at ways to instrument your Spark 
code with performance metrics, Spark task metrics and event listeners 
are quite useful for that. See also 
https://github.com/apache/spark/blob/master/docs/monitoring.md and 
https://github.com/LucaCanali/sparkMeasure


Regards,

Luca

*From:*manish ranjan 
*Sent:* Tuesday, March 26, 2019 15:24
*To:* Jack Kolokasis 
*Cc:* user 
*Subject:* Re: Spark Profiler

I have found ganglia very helpful in understanding network I/o , CPU 
and memory usage  for a given spark cluster.


I have not used , but have heard good things about Dr Elephant ( which 
I think was contributed by LinkedIn but not 100%sure).


On Tue, Mar 26, 2019, 5:59 AM Jack Kolokasis > wrote:


Hello all,

 I am looking for a spark profiler to trace my application to
find
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O
usage.

I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Spark Profiler

2019-03-27 Thread Luca Canali

I find that the Spark metrics system is quite useful to gather resource 
utilization metrics of Spark applications, including CPU, memory and I/O.
If you are interested an example how this works for us at: 
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
If instead you are rather looking at ways to instrument your Spark code with 
performance metrics, Spark task metrics and event listeners are quite useful 
for that. See also 
https://github.com/apache/spark/blob/master/docs/monitoring.md and 
https://github.com/LucaCanali/sparkMeasure

Regards,
Luca

From: manish ranjan 
Sent: Tuesday, March 26, 2019 15:24
To: Jack Kolokasis 
Cc: user 
Subject: Re: Spark Profiler

I have found ganglia very helpful in understanding network I/o , CPU and memory 
usage  for a given spark cluster.
I have not used , but have heard good things about Dr Elephant ( which I think 
was contributed by LinkedIn but not 100%sure).

On Tue, Mar 26, 2019, 5:59 AM Jack Kolokasis 
mailto:koloka...@ics.forth.gr>> wrote:
Hello all,

 I am looking for a spark profiler to trace my application to find
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O usage.

I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

Streaming data out of spark to a Kafka topic

2019-03-27 Thread Mich Talebzadeh

Hi,

In a traditional we get data via Kafka into Spark streaming, do some work
and write to a NoSQL database like Mongo, Hbase or Aerospike.

That part can be done below and is best explained by the code as follows:

Once a high value DF lookups is created I want send the data to a new topic
for recipients!

val kafkaParams = Map[String, String](
  "bootstrap.servers" ->
bootstrapServers,
  "schema.registry.url" ->
schemaRegistryURL,
   "zookeeper.connect" ->
zookeeperConnect,
   "group.id" -> sparkAppName,
   "zookeeper.connection.timeout.ms" ->
zookeeperConnectionTimeoutMs,
   "rebalance.backoff.ms" ->
rebalanceBackoffMS,
   "zookeeper.session.timeout.ms" ->
zookeeperSessionTimeOutMs,
   "auto.commit.interval.ms" ->
autoCommitIntervalMS
 )
//val topicsSet = topics.split(",").toSet
val topics = Set(topicsValue)
val dstream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
// This returns a tuple of key and value (since messages in Kafka are
optionally keyed). In this case it is of type (String, String)
dstream.cache()
//
val topicsOut = Set(topicsValueOut)
val dstreamOut = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsOut)
dstreamOut.cache()


dstream.foreachRDD
{ pricesRDD =>
  if (!pricesRDD.isEmpty)  // data exists in RDD
  {
val op_time = System.currentTimeMillis.toString
val spark =
SparkSessionSingleton.getInstance(pricesRDD.sparkContext.getConf)
val sc = spark.sparkContext
import spark.implicits._
var operation = new operationStruct(op_type, op_time)
// Convert RDD[String] to RDD[case class] to DataFrame
val RDDString = pricesRDD.map { case (_, value) => value.split(',')
}.map(p =>
priceDocument(priceStruct(p(0).toString,p(1).toString,p(2).toString,p(3).toDouble,
currency), operation))
val df = spark.createDataFrame(RDDString)
//df.printSchema
var document = df.filter('priceInfo.getItem("price") > 90.0)
MongoSpark.save(document, writeConfig)
 println("Current time is: " + Calendar.getInstance.getTime)
 totalPrices += document.count
 var endTimeQuery = System.currentTimeMillis
 println("Total Prices added to the collection so far: "
+totalPrices+ " , Runnig for  " + (endTimeQuery -
startTimeQuery)/(1000*60)+" Minutes")
 // Check if running time > runTime exit
 if( (endTimeQuery - startTimeQuery)/(10*60) > runTime)
 {
   println("\nDuration exceeded " + runTime + " minutes exiting")
   System.exit(0)
 }
 // picking up individual arrays -->
df.select('otherDetails.getItem("tickerQuotes")(0)) shows first element
 //val lookups = df.filter('priceInfo.getItem("ticker") ===
tickerWatch && 'priceInfo.getItem("price") > priceWatch)
 val lookups = df.filter('priceInfo.getItem("price") > priceWatch)
 if(lookups.count > 0) {
   println("High value tickers")
   lookups.select('priceInfo.getItem("timeissued").as("Time
issued"), 'priceInfo.getItem("ticker").as("Ticker"),
'priceInfo.getItem("price").cast("Double").as("Latest price")).show

// Now here I want to send the content of lookups DF to another kafka
topic!!!
//Note that above I have created a new dstreamOut with a new topic topicsOut

How that can be done?
 }
  }
}

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Matt Kuiper

Thanks Gabor - your comment helps me clarify my question.


Yes, I have maxFilesPerTrigger set to 1 on the Read Stream call.  I am also 
seeing the Streaming Query process the single input file, however a single file 
on input does not appear to result in the Streaming Query writing the output to 
the Parquet file.


Matt




From: Gabor Somogyi 
Sent: Wednesday, March 27, 2019 10:20:18 AM
To: Matt Kuiper
Cc: user@spark.apache.org
Subject: Re: Parquet File Output Sink - Spark Structured Streaming

Hi Matt,

Maybe you could set maxFilesPerTrigger to 1.

BR,
G


On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper 
mailto:matt.kui...@polarisalpha.com>> wrote:

Hello,

I am new to Spark and Structured Streaming and have the following File Output 
Sink question:

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query 
(with Parquet File output sink configured) to write data to the parquet files.  
I periodically feed the Stream input data (using Stream Reader to read in 
files), but it does not write output to Parquet file for each file provided as 
input.   Once I have given it a few files, it tends to write a Parquet file 
just fine.

I am wondering how to control the threshold to write.  I would like to be able 
force a new write to Parquet file for every new file provided as input (at 
least for intitial testing).   Any tips appreciated!

Thanks,
Matt

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Gabor Somogyi

Hi Matt,

Maybe you could set maxFilesPerTrigger to 1.

BR,
G


On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper 
wrote:

> Hello,
>
> I am new to Spark and Structured Streaming and have the following File
> Output Sink question:
>
> Wondering what (and how to modify) triggers a Spark Sturctured Streaming
> Query (with Parquet File output sink configured) to write data to the
> parquet files.  I periodically feed the Stream input data (using
> Stream Reader to read in files), but it does not write output to Parquet
> file for each file provided as input.   Once I have given it a few files,
> it tends to write a Parquet file just fine.
>
> I am wondering how to control the threshold to write.  I would like to be
> able force a new write to Parquet file for every new file provided as input
> (at least for intitial testing).   Any tips appreciated!
>
> Thanks,
> Matt
>
>

Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Matt Kuiper

Hello,

I am new to Spark and Structured Streaming and have the following File Output 
Sink question:

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query 
(with Parquet File output sink configured) to write data to the parquet files.  
I periodically feed the Stream input data (using Stream Reader to read in 
files), but it does not write output to Parquet file for each file provided as 
input.   Once I have given it a few files, it tends to write a Parquet file 
just fine.

I am wondering how to control the threshold to write.  I would like to be able 
force a new write to Parquet file for every new file provided as input (at 
least for intitial testing).   Any tips appreciated!

Thanks,
Matt

Spark Kafka Batch Write guarantees

2019-03-27 Thread hemant singh

We are using spark batch to write Dataframe to Kafka topic. The spark write
function with write.format(source = Kafka).
Does spark provide similar guarantee like it provides with saving dataframe
to disk; that partial data is not written to Kafka i.e. full dataframe is
saved or if job fails no data is written to Kafka topic.

Thanks.

Re: writing a small csv to HDFS is super slow

2019-03-27 Thread Gezim Sejdiu

Hi Lian,

many thanks for the detailed information and sharing the solution with us.
I will forward this to a student and hopefully will resolve the issue.

Best regards,

On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang  wrote:

> Hi Gezim,
>
> My execution plan of the data frame to write into HDFS is a union of 140
> children dataframes. All these children data frames are not materialized
> when writing to HDFS. It is not saving file taking time. Instead, it is
> materializing the dataframes taking time. My solution is to materialize all
> the children dataframe and save into HDFS. Then union the pre-existing
> children dataframes and saving to HDFS is very fast.
>
> Hope this helps!
>
> On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu  wrote:
>
>> Hi Lian,
>>
>> I was following the thread since one of my students had the same issue.
>> The problem was when trying to save a larger XML dataset into HDFS and due
>> to the connectivity timeout between Spark and HDFS, the output wasn't able
>> to be displayed.
>> I also suggested him to do the same as @Apostolos said in the previous
>> mail, using saveAsTextFile instead (haven't got any result/reply after my
>> suggestion).
>>
>> Seeing the last commit date "*Jan 10, 2017*" made
>> on databricks/spark-csv [1] project, not sure how much inline with Spark
>> 2.x is. Even though there is a *note* about it on the README file.
>>
>> Would it be possible that you share your solution (in case the project is
>> open-sourced already) with us and then we can have a look at it?
>>
>> Many thanks in advance.
>>
>> Best regards,
>> [1]. https://github.com/databricks/spark-csv
>>
>> On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang  wrote:
>>
>>> Thanks guys for reply.
>>>
>>> The execution plan shows a giant query. After divide and conquer, saving
>>> is quick.
>>>
>>> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama 
>>> wrote:
>>>
 Hi Lian,
 Since you using repartition(1), do you want to decrease the number of
 partitions? If so, have you tried to use coalesce instead?

 Kathleen

 On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang 
 wrote:

> Hi,
>
> Writing a csv to HDFS takes about 1 hour:
>
>
> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>
> The generated csv file is only about 150kb. The job uses 3 containers
> (13 cores, 23g mem).
>
> Other people have similar issues but I don't see a good explanation
> and solution.
>
> Any clue is highly appreciated! Thanks.
>
>
>
>>
>> --
>>
>> _
>>
>> *Gëzim Sejdiu*
>>
>>
>>
>> *PhD Student & Research Associate*
>>
>> *SDA, University of Bonn*
>>
>> *Endenicher Allee 19a, 53115 Bonn, Germany*
>>
>> *https://gezimsejdiu.github.io/ *
>>
>> GitHub  | Twitter
>>  | LinkedIn
>>  | Google Scholar
>> 
>>
>>

-- 

_

*Gëzim Sejdiu*



*PhD Student & Research Associate*

*SDA, University of Bonn*

*Endenicher Allee 19a, 53115 Bonn, Germany*

*https://gezimsejdiu.github.io/ *

GitHub  | Twitter
 | LinkedIn
 | Google Scholar

RE: RPC timeout error for AES based encryption between driver and executor

2019-03-27 Thread Sinha, Breeta (Nokia - IN/Bangalore)

Hi Vanzin,

"spark.authenticate" is working properly for our environment (Spark 2.4 on 
Kubernetes).
We have made few code changes through which secure communication between driver 
and executor is working fine using shared spark.authenticate.secret.

Even SASL encryption works but when we set, 
spark.network.crypto.enabled true
to enable AES based encryption, we see RPC timeout error message sporadically.

Kind Regards,
Breeta


-Original Message-
From: Marcelo Vanzin  
Sent: Tuesday, March 26, 2019 9:10 PM
To: Sinha, Breeta (Nokia - IN/Bangalore) 
Cc: user@spark.apache.org
Subject: Re: RPC timeout error for AES based encryption between driver and 
executor

I don't think "spark.authenticate" works properly with k8s in 2.4 (which would 
make it impossible to enable encryption since it requires authentication). I'm 
pretty sure I fixed it in master, though.

On Tue, Mar 26, 2019 at 2:29 AM Sinha, Breeta (Nokia - IN/Bangalore) 
 wrote:
>
> Hi All,
>
>
>
> We are trying to enable RPC encryption between driver and executor. Currently 
> we're working on Spark 2.4 on Kubernetes.
>
>
>
> According to Apache Spark Security document 
> (https://spark.apache.org/docs/latest/security.html) and our understanding on 
> the same, it is clear that Spark supports AES-based encryption for RPC 
> connections. There is also support for SASL-based encryption, although it 
> should be considered deprecated.
>
>
>
> spark.network.crypto.enabled true , will enable AES-based RPC encryption.
>
> However, when we enable AES based encryption between driver and executor, we 
> could observe a very sporadic behaviour in communication between driver and 
> executor in the logs.
>
>
>
> Follwing are the options and their default values, we used for 
> enabling encryption:-
>
>
>
> spark.authenticate true
>
> spark.authenticate.secret 
>
> spark.network.crypto.enabled true
>
> spark.network.crypto.keyLength 256
>
> spark.network.crypto.saslFallback false
>
>
>
> A snippet of the executor log is provided below:-
>
> Exception in thread "main" 19/02/26 07:27:08 ERROR RpcOutboxMessage: 
> Ask timeout before connecting successfully
>
> Caused by: java.util.concurrent.TimeoutException: Cannot receive any 
> reply from 
> sts-spark-thrift-server-1551165767426-driver-svc.default.svc:7078 in 
> 120 seconds
>
>
>
> But, there is no error message or any message from executor seen in the 
> driver log for the same timestamp.
>
>
>
> We also tried increasing spark.network.timeout, but no luck.
>
>
>
> This issue is seen sporadically, as the following observations were 
> noted:-
>
> 1) Sometimes, enabling AES encryption works completely fine.
>
> 2) Sometimes, enabling AES encryption works fine for around 10 consecutive 
> spark-submits but next trigger of spark-submit would go into hang state with 
> the above mentioned error in the executor log.
>
> 3) Also, there are times, when enabling AES encryption would not work at all, 
> as it would keep on spawnning more than 50 executors where the executors fail 
> with the above mentioned error.
>
> Even, setting spark.network.crypto.saslFallback to true didn't help.
>
>
>
> Things are working fine when we enable SASL encryption, that is, only 
> setting the following parameters:-
>
> spark.authenticate true
>
> spark.authenticate.secret 
>
>
>
> I have attached the log file containing detailed error message. Please let us 
> know if any configuration is missing or if any one has faced the same issue.
>
>
>
> Any leads would be highly appreciated!!
>
>
>
> Kind Regards,
>
> Breeta Sinha
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



--
Marcelo

Udfs in spark

How to extract data in parallel from RDBMS tables

How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

Fwd: How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

Spark migration to Kubernetes

Re: Streaming data out of spark to a Kafka topic

Re: Spark Profiler

RE: Spark Profiler

Streaming data out of spark to a Kafka topic

Re: Parquet File Output Sink - Spark Structured Streaming

Re: Parquet File Output Sink - Spark Structured Streaming

Parquet File Output Sink - Spark Structured Streaming

Spark Kafka Batch Write guarantees

Re: writing a small csv to HDFS is super slow

RE: RPC timeout error for AES based encryption between driver and executor

15 matches

Site Navigation

Mail list logo

Footer information