How do I deal with ever growing application log

2017-03-05 Thread Timothy Chan
I'm running a single worker EMR cluster for a Structured Streaming job. How
do I deal with my application log filling up HDFS?

/var/log/spark/apps/application_1487823545416_0021_1.inprogress

is currently 21.8 GB

*Sent with Shift
*


Counting things in Spark Structured Streaming

2017-02-08 Thread Timothy Chan
I would like to count running totals for events coming in since a given
date for a given user. How would I go about doing this?

For example, we have user data coming in, we'd like to score that data,
then keep running totals on that score, since a given date. Specifically, I
always want to score the data, but I only want to keep a running total if
the date is after a certain date (this would probably have to be looked up
each time data is scored).


Spark 2.1.0 and Shapeless

2017-01-30 Thread Timothy Chan
I'm using a library, https://github.com/guardian/scanamo, that uses
shapeless 2.3.2. What are my options if I want to use this with Spark
2.1.0?

Based on this:
http://apache-spark-developers-list.1001551.n3.nabble.com/shapeless-in-spark-2-1-0-tt20392.html

I'm guessing I would have to release my own version of scanamo with a
shaded shapeless?


Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Timothy Chan
I'm using version 2.02.

The difference I see between using latest and earliest is a series of jobs
that take less than a second vs. one job that goes on for over 24 hours.

On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu <shixi...@databricks.com>
wrote:

> Which Spark version are you using? If you are using 2.1.0, could you use
> the monitoring APIs (
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)
> to check the input rate and the processing rate? One possible issue is that
> the Kafka source launched a pretty large batch and it took too long to
> finish it. If so, you can use "maxOffsetsPerTrigger" option to limit the
> data size in a batch in order to observe the progress.
>
> On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan <tc...@lumoslabs.com>
> wrote:
>
> I'm running my structured streaming jobs in EMR. We were thinking a worst
> case scenario recovery situation would be to spin up another cluster and
> set startingOffsets to earliest (our Kafka cluster has a retention policy
> of 7 days).
>
> My observation is that the job never catches up to latest. This is not
> acceptable. I've set the number of partitions for the topic to 6. I've
> tried using a cluster of 4 in EMR.
>
> The producer rate for this topic is 4 events/second. Does anyone have any
> suggestions on what I can do to have my consumer catch up to latest faster?
>
>
>


Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Timothy Chan
I'm running my structured streaming jobs in EMR. We were thinking a worst
case scenario recovery situation would be to spin up another cluster and
set startingOffsets to earliest (our Kafka cluster has a retention policy
of 7 days).

My observation is that the job never catches up to latest. This is not
acceptable. I've set the number of partitions for the topic to 6. I've
tried using a cluster of 4 in EMR.

The producer rate for this topic is 4 events/second. Does anyone have any
suggestions on what I can do to have my consumer catch up to latest faster?


Re: structured streaming polling timeouts

2017-01-11 Thread Timothy Chan
We're currently using EMR and they are still on Spark 2.0.2.

Do you have any other suggestions for additional parameters to adjust
besides "kafkaConsumer.pollTimeoutMs"?

On Wed, Jan 11, 2017 at 11:17 AM Shixiong(Ryan) Zhu <shixi...@databricks.com>
wrote:

> You can increase the timeout using the option
> "kafkaConsumer.pollTimeoutMs". In addition, I would recommend you try Spark
> 2.1.0 as there are many improvements in Structured Streaming.
>
> On Wed, Jan 11, 2017 at 11:05 AM, Timothy Chan <tc...@lumoslabs.com>
> wrote:
>
> I'm using Spark 2.0.2 and running a structured streaming query. When I set 
> startingOffsets
> to earliest I get the following timeout errors:
>
> java.lang.AssertionError: assertion failed: Failed to get records for
> spark-kafka-source-be89d84c-f6e9-4d2b-b6cd-570942dc7d5d-185814897-executor
> my-favorite-topic-0 1127918 after polling for 2048
>
> I do not get these errors when I set startingOffsets to latest.
>
>
>
>


structured streaming polling timeouts

2017-01-11 Thread Timothy Chan
I'm using Spark 2.0.2 and running a structured streaming query. When I
set startingOffsets
to earliest I get the following timeout errors:

java.lang.AssertionError: assertion failed: Failed to get records for
spark-kafka-source-be89d84c-f6e9-4d2b-b6cd-570942dc7d5d-185814897-executor
my-favorite-topic-0 1127918 after polling for 2048

I do not get these errors when I set startingOffsets to latest.