Here is a link to the JIRA for adding StructType support for scalar
pandas_udf https://issues.apache.org/jira/browse/SPARK-24579
On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi
wrote:
> Hey Holden,
> Thanks for your reply,
>
> We currently using a python function that produces a Row(TS=LongT
Hello,
I’m getting below error in spark driver pod logs and executor pods are
getting killed midway through while the job is running and even driver pod
Terminated with below intermittent error ,this happens if I run multiple
jobs in parallel.
Not able to see executor logs as executor pods a
Hello,
I’m getting below error in spark driver pod logs and executor pods are getting
killed midway through while the job is running and even driver pod Terminated
with below intermittent error ,this happens if I run multiple jobs in parallel.
Not able to see executor logs as executor pods are
object MyDatabseSingleton {
@transient
lazy val dbConn = DB.connect(…)
`transient` marks the variable to be excluded from serialization
and `lazy` would open connection only when it's needed and also makes
sure that the val is thread-safe
http://fdahms.com/2015/10/14/scala-and-the-transi
Hi Patrick,
This object must be serializable right? I wonder if I will access to this
object in my driver(since it is getting created on the executor side) so I
can close when I am done with my batch?
Thanks!
On Mon, Jul 30, 2018 at 7:37 AM, Patrick McGloin
wrote:
> You could use an object in
While working with larger datasets I run into out of memory issues.
Basically a hadoop sequence file is read, its contents are sorted and a
hadoop map file is written back. Code works fine for workloads greater
than 20gb. Than I changed one column in my dataset to store a large
object and size of r
Heres a proposal to a add - https://github.com/apache/spark/pull/21819
Its always good to set "maxOffsetsPerTrigger" unless you want spark to
process till the end of the stream in each micro batch. Even without
"maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch
completes.
If you don't set rate limiting through `maxOffsetsPerTrigger`, Structured
Streaming will always process until the end of the stream. So number of
records waiting to be processed should be 0 at the start of each trigger.
On Mon, Jul 30, 2018 at 8:03 AM, Kailash Kalahasti <
kailash.kalaha...@gmail.c
Is there any way to find out backlog on kafka topic while using spark
structured streaming ? I checked few consumer apis but that requires to
enable groupid for streaming, but seems it is not allowed.
Basically i want to know number of records waiting to be processed.
Any suggestions ?
You could use an object in Scala, of which only one instance will be
created on each JVM / Executor. E.g.
object MyDatabseSingleton {
var dbConn = ???
}
On Sat, 28 Jul 2018, 08:34 kant kodali, wrote:
> Hi All,
>
> I understand creating a connection forEachPartition but I am wondering can
>
Thanks guys, it really helps.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
We have a use case where there's a stream of events while every event has an
ID and its current state with a timestamp:
…
111,ready,1532949947
111,offline,1532949955
111,ongoing,1532949955
111,offline,1532949973
333,offline,1532949981
333,ongoing,1532949987
…
We want to ask questions about the
How to add a new source to exsting struct streaming application, like a
kafka source
I am trying to read csv in spark dataframe . My Os = Ubuntu 18.04,
spark-version 2.3.1, python -version 2.7.15
My code :
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf = con
14 matches
Mail list logo