Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Jeff Nadler
Yes we do something very similar and it's working well: Kafka -> Spark Streaming (write temp files, serialized RDDs) -> Spark Batch Application (build partitioned Parquet files on HDFS; this is needed because building Parquet files of a reasonable size is too slow for streaming) -> query with

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
: 88 > > Topic: CEQReceiver Partition: 4Leader: 89 Replicas: > 89Isr: 89 > > > > *From:* Jeff Nadler [mailto:jnad...@srcginc.com] > *Sent:* Wednesday, September 14, 2016 12:46 PM > *To:* Rachana Srivastava > *Cc:* user@spark.apache.org; d...@spark

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Have you checked your Kafka brokers to be certain that data is going to all 5 partitions?We use something very similar (but in Scala) and have no problems. Also you might not get the best response blasting both user+dev lists like this. Normally you'd want to use 'user' only. -Jeff On

Re: Streaming Backpressure with Multiple Streams

2016-09-14 Thread Jeff Nadler
2016 at 5:54 PM, Jeff Nadler <jnad...@srcginc.com> wrote: > Yes I'll test that next. > > On Sep 9, 2016 5:36 PM, "Cody Koeninger" <c...@koeninger.org> wrote: > >> Does the same thing happen if you're only using direct stream plus back >> pressure, not

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
Yes I'll test that next. On Sep 9, 2016 5:36 PM, "Cody Koeninger" <c...@koeninger.org> wrote: > Does the same thing happen if you're only using direct stream plus back > pressure, not the receiver stream? > > On Sep 9, 2016 6:41 PM, "Jeff Nadler" &

Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
it is eventually consuming 1 record / second / partition. This happens even though there's no scheduling delay, and the receiver-based stream does not appear to be throttled. Anyone ever see anything like this? Thanks! Jeff Nadler Aerohive Networks

Re: Storing object in spark streaming

2015-10-12 Thread Jeff Nadler
Your receiver must extend Receiver[String].Try changing it to extend Receiver[Message]? On Mon, Oct 12, 2015 at 2:03 PM, Something Something < mailinglist...@gmail.com> wrote: > In my custom receiver for Spark Streaming I've code such as this: > >

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Jeff Nadler
Gerard - any chance this is related to task locality waiting?Can you try (just as a diagnostic) something like this, does the unexpected delay go away? .set("spark.locality.wait", "0") On Tue, Oct 6, 2015 at 12:00 PM, Gerard Maas wrote: > Hi Cody, > > The job is

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
Spark standalone doesn't come with a UI for submitting jobs. Some Hadoop distros might, for example EMR in AWS has a job submit UI. Spark submit just calls a REST api, you could build any UI you want on top of that... On Tue, Oct 6, 2015 at 9:37 AM, shahid qadri

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
bmit pyspark job, can you >> point me to Spark submit REST api >> >> On Oct 6, 2015, at 10:25 PM, Jeff Nadler <jnad...@srcginc.com> wrote: >> >> >> Spark standalone doesn't come with a UI for submitting jobs. Some >> Hadoop distros might, for ex

Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Jeff Nadler
While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time. In StateDStream.scala, serialized persistence is required:

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Jeff Nadler
You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen so...@cloudera.com To: Akhil Das

Re: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-19 Thread Jeff Nadler
JavaDStreamLike is used for Java code, return a Scala DStream is not reasonable. You can fix this by submitting a PR, or I can help you to fix this. Thanks Jerry *From:* Jeff Nadler [mailto:jnad...@srcginc.com] *Sent:* Monday, January 19, 2015 2:04 PM *To:* user@spark.apache.org *Subject