I'm using version 2.02.
The difference I see between using latest and earliest is a series of jobs
that take less than a second vs. one job that goes on for over 24 hours.
On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu
wrote:
> Which Spark version are you using? If
Which Spark version are you using? If you are using 2.1.0, could you use
the monitoring APIs (
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)
to check the input rate and the processing rate? One possible issue is that
the Kafka source
I'm running my structured streaming jobs in EMR. We were thinking a worst
case scenario recovery situation would be to spin up another cluster and
set startingOffsets to earliest (our Kafka cluster has a retention policy
of 7 days).
My observation is that the job never catches up to latest. This
Also, do you know why this happen?
> On 2017年1月20日, at 18:23, Pavel Plotnikov
> wrote:
>
> Hi Yang,
> i have faced with the same problem on Mesos and to circumvent this issue i am
> usually increase partition number. On last step in your code you reduce
>
Hi,
Thank you for your suggestion. As I know If I set to bigger number I won’t get
the output number as one file, right? My task is design to combine all that
small files in one day to one big parquet file. THX again.
Best,
> On 2017年1月20日, at 18:23, Pavel Plotnikov