Re: How to get logging right for Spark applications in the YARN ecosystem

2019-08-01 Thread Srinath C
Hi Raman,

Probably use the rolling file appender in log4j to compress the rotated log
file?

Regards.


On Fri, Aug 2, 2019 at 12:47 AM raman gugnani 
wrote:

> HI ,
>
> I am looking for right solution for logging the logs produced by the
> executors. Most of the places I have seen logging done by log4j properties,
> but no where people I have seen any solution where logs are being
> compressed.
>
> Is there anyway I can compress the logs, So that further those logs can be
> shipped to S3.
>
> --
> Raman Gugnani
>


Re: spark stream kafka wait for all data process done

2019-08-01 Thread 刘 勇
Hi,
You can set spark.streaming.kafka.backpressure.enable=true.
If your tasks can't process larger data that this variable can control the 
kafka data into streaming speed. And you can increment your streaming process 
time window.



Sent from my Samsung Galaxy smartphone.


 Original message 
From: zenglong chen 
Date: 8/2/19 09:59 (GMT+08:00)
To: user@spark.apache.org
Subject: spark stream kafka wait for all data process done

How can kafka wait for tasks process done then begin receive next batch?I want 
to process 5000 record once by pandas and it may take too long time to process.


spark stream kafka wait for all data process done

2019-08-01 Thread zenglong chen
How can kafka wait for tasks process done then begin receive next batch?I
want to process 5000 record once by pandas and it may take too long time to
process.


Announcing Delta Lake 0.3.0

2019-08-01 Thread Tathagata Das
Hello everyone,

We are excited to announce the availability of Delta Lake 0.3.0 which
introduces new programmatic APIs for manipulating and managing data in
Delta Lake tables.

Here are the main features:


   -

   Scala/Java APIs for DML commands - You can now modify data in Delta Lake
   tables using programmatic APIs for *Delete*, *Update* and *Merge*. These
   APIs mirror the syntax and semantics of their corresponding SQL commands
   and are great for many workloads, e.g., Slowly Changing Dimension (SCD)
   operations, merging change data for replication, and upserts from streaming
   queries. See the documentation
    for more details.



   -

   Scala/Java APIs for query commit history - You can now query a table’s
   commit history to see what operations modified the table. This enables you
   to audit data changes, time travel queries on specific versions, debug and
   recover data from accidental deletions, etc. See the documentation
    for
   more details.



   -

   Scala/Java APIs for vacuuming old files - Delta Lake uses MVCC to enable
   snapshot isolation and time travel. However, keeping all versions of a
   table forever can be prohibitively expensive. Stale snapshots (as well as
   other uncommitted files from aborted transactions) can be garbage collected
   by vacuuming the table. See the documentation
    for more details.


To try out Delta Lake 0.3.0, please follow the Delta Lake Quickstart:
https://docs.delta.io/0.3.0/quick-start.html

To view the release notes:
https://github.com/delta-io/delta/releases/tag/v0.3.0

We would like to thank all the community members for contributing to this
release.

TD


How to get logging right for Spark applications in the YARN ecosystem

2019-08-01 Thread raman gugnani
HI ,

I am looking for right solution for logging the logs produced by the
executors. Most of the places I have seen logging done by log4j properties,
but no where people I have seen any solution where logs are being
compressed.

Is there anyway I can compress the logs, So that further those logs can be
shipped to S3.

-- 
Raman Gugnani