[Structured Streaming] Connecting to Kafka via a Custom Consumer / Producer

2020-04-22 Thread Patrick McGloin
Hi, The large international bank I work for has a custom Kafka implementation. The client libraries that are used to connect to Kafka have extra security steps. They implement the Kafka Consumer and Producer interfaces in this client library so once we use it to connect to Kafka, we can treat

Re: Spark Structured Streaming resource contention / memory issue

2018-10-15 Thread Patrick McGloin
> 1. https://issues.apache.org/jira/browse/SPARK-24441 > 2. https://issues.apache.org/jira/browse/SPARK-24637 > 3. https://issues.apache.org/jira/browse/SPARK-24717 > > > 2018년 10월 12일 (금) 오후 9:31, Patrick McGloin 님이 > 작성: > >> Hi allI sent this earlier but the screenshots were not a

Spark Structured Streaming resource contention / memory issue

2018-10-12 Thread Patrick McGloin
Hi all, We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread Patrick McGloin
You could use an object in Scala, of which only one instance will be created on each JVM / Executor. E.g. object MyDatabseSingleton { var dbConn = ??? } On Sat, 28 Jul 2018, 08:34 kant kodali, wrote: > Hi All, > > I understand creating a connection forEachPartition but I am wondering can >

Re: How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
Hi all, I tested this with a Date outside a map and it works fine so I think the issue is simply for Dates inside Maps. I will create a Jira for this unless there are objections. Best regards, Patrick On Thu, 28 Jun 2018, 11:53 Patrick McGloin, wrote: > Consider the following test, wh

How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
Consider the following test, which will fail on the final show: * case class *UnitTestCaseClassWithDateInsideMap(map: Map[Date, Int]) test(*"Test a Date as key in a Map"*) { *val *map = *UnitTestCaseClassWithDateInsideMap*(*Map*(Date.*valueOf*( *"2018-06-28"*) -> 1)) *val *options =

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Patrick McGloin
# Write key-value data from a DataFrame to a Kafka topic specified in an option query = df \ .selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value") \ .writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("topic",

Re: Structured Streaming + initialState

2017-05-06 Thread Patrick McGloin
ts in > a database? > If its in a database, then when initialize the GroupState, you can fetch > it from the database. > > On Fri, May 5, 2017 at 7:35 AM, Patrick McGloin <mcgloin.patr...@gmail.com > > wrote: > >> Hi all, >> >> With Spark Structure

Structured Streaming + initialState

2017-05-05 Thread Patrick McGloin
Hi all, With Spark Structured Streaming, is there a possibility to set an "initial state" for a query? Using a join between a streaming Dataset and a static Dataset does not support full joins. Using mapGroupsWithState to create a GroupState does not support an initialState (as the Spark

Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Patrick McGloin
Hi all, If I am using a Custom Receiver with Storage Level set to StorageLevel. MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs: 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level replication 2 is unnecessary when write ahead log is enabled, change to

Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
I am trying to understand what will happen when Spark has an exception during processing, especially while streaming. If I have a small code spinet like this: myDStream.foreachRDD { (rdd: RDD[String]) => println(s"processed => [${rdd.collect().toList}]") throw new Exception("User

Re: Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
ime it retries that app code. > > On Thu, Jan 28, 2016 at 8:51 AM, Patrick McGloin < > mcgloin.patr...@gmail.com> wrote: > >> I am trying to understand what will happen when Spark has an exception >> during processing, especially while streaming. >> >> If I

“java.io.IOException: Class not found” on long running Streaming application

2016-01-28 Thread Patrick McGloin
I am getting the exception below on a long running Spark Streaming application. The exception could occur after a few minutes, but it may also may not happen for days. This is with pretty consistent input data. I have seen this Jira ticket (

Re: Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-26 Thread Patrick McGloin
ing-programming-guide.html#how-to-configure-checkpointing > > On Thu, Jan 21, 2016 at 3:32 AM, Patrick McGloin < > mcgloin.patr...@gmail.com> wrote: > >> Hi all, >> >> To have a simple way of testing the Spark Streaming Write Ahead Log I >> created a very

Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-21 Thread Patrick McGloin
Hi all, To have a simple way of testing the Spark Streaming Write Ahead Log I created a very simple Custom Input Receiver, which will generate strings and store those: class InMemoryStringReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_SER) { val batchID =

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Patrick McGloin
> df.coalesce(1).repartition("entity", "year", "month", "day", > "status").write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") &

DataFrame partitionBy to a single Parquet file (per partition)

2016-01-14 Thread Patrick McGloin
Hi, I would like to reparation / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1).write.partitionBy("entity", "year", "month", "day",

Re: Data in one partition after reduceByKey

2015-11-23 Thread Patrick McGloin
h + dateTime.getMinuteOfDay + dateTime.getSecondOfDay sum % numPartitions case _ => 0 } } } On 20 November 2015 at 17:17, Patrick McGloin <mcgloin.patr...@gmail.com> wrote: > Hi, > > I have Spark application which contains the following segment: > > val reparition

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
Hi, I have Spark application which contains the following segment: val reparitioned = rdd.repartition(16) val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, endDate) val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2)) val reduced:

Re: Spark UI consuming lots of memory

2015-10-27 Thread Patrick McGloin
Hi Nicholas, I think you are right about the issue relating to Spark-11126, I'm seeing it as well. Did you find any workaround? Looking at the pull request for the fix it doesn't look possible. Best regards, Patrick On 15 October 2015 at 19:40, Nicholas Pritchard <

Fwd: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
the following error is logged by the worker who tries to use Akka Camel: -- Forwarded message -- From: Patrick McGloin mcgloin.patr...@gmail.com Date: 24 October 2014 15:09 Subject: Re: [akka-user] Akka Camel plus Spark Streaming To: akka-u...@googlegroups.com Hi Patrik, Thanks

Re: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
() } In local mode this works. When deployed to the Spark Cluster the following error is logged by the worker who tries to use Akka Camel: -- Forwarded message -- From: Patrick McGloin mcgloin.patr...@gmail.com Date: 24 October 2014 15:09 Subject: Re: [akka-user] Akka Camel

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin
FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin mcgloin.patr

Spark SQL + Hive + JobConf NoClassDefFoundError

2014-09-29 Thread Patrick McGloin
Hi, I have an error when submitting a Spark SQL application to our Spark cluster: 14/09/29 16:02:11 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoClassDefFoundError *java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf* at

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin
Hi Amit, I think the type of the data contained in your RDD needs to be a known case class and not abstract for createSchemaRDD. This makes sense when you think it needs to know about the fields in the object to create the schema. I had the same issue when I used an abstract base class for a

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
the table? This sounds like a configuration that needs to be changed on the impala side. On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin mcgloin.patr...@gmail.com wrote: Sorry, sent early, wasn't finished typing. CREATE EXTERNAL TABLE Then we can select the data using Impala

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
insert data from SparkSQL into a Parquet table which can be directly queried by Impala? Best regards, Patrick On 1 August 2014 16:18, Patrick McGloin mcgloin.patr...@gmail.com wrote: Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
Hi all, Is there a best practice for subscribing to JMS with Spark Streaming? I have searched but not found anything conclusive. In the absence of a standard practice the solution I was thinking of was to use Akka + Camel (akka.camel.Consumer) to create a subscription for a Spark Streaming