Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Timothy Chan
I'm using version 2.02. The difference I see between using latest and earliest is a series of jobs that take less than a second vs. one job that goes on for over 24 hours. On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu wrote: > Which Spark version are you using? If

Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Shixiong(Ryan) Zhu
Which Spark version are you using? If you are using 2.1.0, could you use the monitoring APIs ( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries) to check the input rate and the processing rate? One possible issue is that the Kafka source

Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Timothy Chan
I'm running my structured streaming jobs in EMR. We were thinking a worst case scenario recovery situation would be to spin up another cluster and set startingOffsets to earliest (our Kafka cluster has a retention policy of 7 days). My observation is that the job never catches up to latest. This

Re: physical memory usage keep increasing for spark app on Yarn

2017-01-22 Thread Yang Cao
Also, do you know why this happen? > On 2017年1月20日, at 18:23, Pavel Plotnikov > wrote: > > Hi Yang, > i have faced with the same problem on Mesos and to circumvent this issue i am > usually increase partition number. On last step in your code you reduce >

Re: physical memory usage keep increasing for spark app on Yarn

2017-01-22 Thread Yang Cao
Hi, Thank you for your suggestion. As I know If I set to bigger number I won’t get the output number as one file, right? My task is design to combine all that small files in one day to one big parquet file. THX again. Best, > On 2017年1月20日, at 18:23, Pavel Plotnikov