structured streaming- checkpoint metadata growing indefinetely

2022-04-28 Thread Wojciech Indyk
Hello!
I use spark struture streaming. I need to use s3 for storing checkpoint
metadata (I know, it's not optimal storage for checkpoint metadata).
Compaction interval is 10 (default) and I set
"spark.sql.streaming.minBatchesToRetain"=5. When the job was running for a
few weeks then checkpointing time increased significantly (cause a few
minutes dalay on processing). I looked at checkpoint metadata structure.
There is one heavy path there: checkpoint/source/0. Single .compact file
weights 25GB. I looked into its content and it contains all entries since
batch 0 (current batch is around 25000). I tried a few parameters to remove
already processed data from the compact file, namely:
"spark.cleaner.referenceTracking.cleanCheckpoints"=true - does not work. As
I've seen in the code it's related to previous version of streaming, isn't
it?
"spark.sql.streaming.fileSource.log.deletion"=true and
"spark.sql.streaming.fileSink.log.deletion"=true doesn't work
The compact file store full history even if all data were processed (except
for the most recent checkpoint), so I expect most of entries would be
deleted. Is there any parameter to remove entries from compact file or
remove compact file gracefully from time to time? Now I am testing scenario
when I stop the job, delete most of checkpoint/source/0/* files, keeping
just a few recent checkpoints (not compacted) and I rerun the job. The job
recovers correctly from recent checkpoint. It looks like possible
workaround of my problem, but this scenario with manual delete of
checkpoint files looks ugly, so I would prefer something managed by Spark.

--
Kind regards/ Pozdrawiam,
Wojciech Indyk


Unsubscribe

2022-04-28 Thread Sahil Bali
Unsubscribe


Re: Unsubscribe

2022-04-28 Thread wilson

please send the message to user-unsubscr...@spark.apache.org
to unsubscribe.


Ajay Thompson wrote:

Unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2022-04-28 Thread Ajay Thompson
Unsubscribe


Re: Reg: CVE-2020-9480

2022-04-28 Thread Sean Owen
It is not a real dependency, so should not be any issue. I am not sure why
your tool flags it at all.

On Thu, Apr 28, 2022 at 10:04 PM Sundar Sabapathi Meenakshi <
sun...@mcruncher.com> wrote:

> Hi all,
>
>   I am using spark-sql_2.12 dependency version 3.2.1 in my
> project. My dependency tracker highlights  the transitive dependency
> "unused"  from  spark-sql_2.12 as vulnerable. I check there is no update
> for these artifacts since 2014. Is the artifact used anywhere in spark ?
>
> To resolve this vulnerability,  can I exclude this "unused" artifact from
> spark-sql_2.12 ?  Will it cause any issues in my project ?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Reg: CVE-2020-9480

2022-04-28 Thread Sundar Sabapathi Meenakshi
Hi all,

  I am using spark-sql_2.12 dependency version 3.2.1 in my project.
My dependency tracker highlights  the transitive dependency  "unused"
from  spark-sql_2.12 as vulnerable. I check there is no update for these
artifacts since 2014. Is the artifact used anywhere in spark ?

To resolve this vulnerability,  can I exclude this "unused" artifact from
spark-sql_2.12 ?  Will it cause any issues in my project ?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2022-04-28 Thread Deepak Gajare
unsubscribe