You can do it with custom RDD implementation.
You will mainly implement "getPartitions" - the logic to split your input
into partitions and "compute" to compute and return the values from the
executors.
On Tue, 17 Sep 2019 at 08:47, Marcelo Valle wrote:
> Just to be more clear about my requireme
You can check out
https://github.com/hortonworks-spark/spark-atlas-connector/
On Wed, 15 May 2019 at 19:44, lk_spark wrote:
> hi,all:
> When I use spark , if I run some SQL to do ETL how can I get
> lineage info. I found that , CDH spark have some config about lineage :
> spark.l
Spark TaskMetrics[1] has a "jvmGCTime" metric that captures the amount of
time spent in GC. This is also available via the listener I guess.
Thanks,
Arun
[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L89
On Mon, 15 Apr 2019 at 09
I don't think its feasible with the current logic. Typically the query
planning time should be a tiny fraction unless you are processing tiny
micro-batches more frequently. You might want to consider adjusting the
trigger interval to processes more data per micro-batch and see if it
helps. The tiny
Read the link carefully,
This solution is available (*only*) in Databricks Runtime.
You can enable RockDB-based state management by setting the following
configuration in the SparkSession before starting the streaming query.
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"co
Yes, the script should be present on all the executor nodes.
You can pass your script via spark-submit (e.g. --files script.sh) and then
you should be able to refer that (e.g. "./script.sh") in rdd.pipe.
- Arun
On Thu, 17 Jan 2019 at 14:18, Mkal wrote:
> Hi, im trying to run an external script
Maybe you have spark listeners that are not processing the events fast
enough?
Do you have spark event logging enabled?
You might have to profile the built in and your custom listeners to see
whats going on.
- Arun
On Wed, 24 Oct 2018 at 16:08, karan alang wrote:
>
> Pls note - Spark version is
Heres a proposal to a add - https://github.com/apache/spark/pull/21819
Its always good to set "maxOffsetsPerTrigger" unless you want spark to
process till the end of the stream in each micro batch. Even without
"maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch
completes.
“activityQuery.awaitTermination()” is a blocking call.
You can just skip this line and run other commands in the same shell to query
the stream.
Running the query from a different shell won’t help since the memory sink where
the results are store is not shared between the two shells.
d basis (after deciding a record belongs
to which particular sink), where as in the current implementation all data
under a RDD partition gets committed to the sink atomically in one go. Please
correct me if I am wrong here.
Regards,
Chandan
On Thu, Jul 12, 2018 at 10:53 PM Arun Mahadevan wrot
Yes ForeachWriter [1] could be an option If you want to write to different
sinks. You can put your custom logic to split the data into different sinks.
The drawback here is that you cannot plugin existing sinks like Kafka and you
need to write the custom logic yourself and you cannot scale the p
I think you need to group by a window (tumbling) and define watermarks (put a
very low watermark or even 0) to discard the state. Here the window duration
becomes your logical batch.
- Arun
From: kant kodali
Date: Thursday, May 3, 2018 at 1:52 AM
To: "user @spark"
Subject: Re: question on
I guess you can wait for the termination, catch exception and then restart the
query in a loop. Something like…
while (true) {
try {
val query = df.writeStream().
…
.start()
query.awaitTermination()
} catch {
case e: Streaming
Mode(OutputMode.Update())
.start()
It still have a minor issue: the column "AMOUNT" is showing twice in result
table, but everything works like a charm.
-Jungtaek Lim (HeartSaVioR)
2018년 4월 19일 (목) 오전 9:43, Arun Mahadevan 님이 작성:
The below expr might work:
df.groupBy($"id")
her raw sql like
sparkSession.sql("sql query") or similar to raw sql but not something like
mapGroupWithState
On Wed, Apr 18, 2018 at 9:36 AM, Arun Mahadevan wrote:
Cant the “max” function used here ? Something like..
stream.groupBy($"id").max("amount").wri
Cant the “max” function used here ? Something like..
stream.groupBy($"id").max("amount").writeStream.outputMode(“complete”/“update")….
Unless the “stream” is already a grouped stream, in which case the above would
not work since the support for multiple aggregate operations is not there yet.
16 matches
Mail list logo