Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase
We do - using Spark streaming, Kafka, HDFS all collocated on the same nodes. Works great so far. Spark picks up the location information and reads data from the partitions hosted by the local broker, showing up as NODE_LOCAL in the UI. You also need to look at the locality options in the

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Adrian Tanase
I've been using spray-json for general JSON ser/deser in scala (spark app), mostly for config files and data exchange. Haven't used it in conjunction with jobs that process large JSON data sources, so can't speak for those use cases. -adrian

Re: Deploying spark-streaming application on production

2015-09-21 Thread Adrian Tanase
I'm wondering, isn't this the canonical use case for WAL + reliable receiver? As far as I know you can tune Mqtt server to wait for ack on messages (qos level 2?). With some support from the client libray you could achieve exactly once semantics on the read side, if you ack message only after

Re: Reasonable performance numbers?

2015-09-25 Thread Adrian Tanase
It’s really hard to answer this, as the comparison is not really fair – Storm is much lower level than Spark and has less overhead when dealing with stateless operations. I’d be curious how is your colleague implementing the Average on a “batch” and what is the storm equivalent of a Batch.

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Adrian Tanase
Hi Radu, The problem itself is not checkpointing the data – if your operations are stateless then you are only checkpointing the kafka offsets, you are right. The problem is that you are also checkpointing metadata – including the actual Code and serialized java classes – that’s why you’ll see

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
, 2015 at 2:05 PM To: Adrian Tanase Subject: Re: Using Spark for portfolio manager app Hi Adrian, Thanks Cassandra seems to be good candidate too. I will give it a try. Do you know any stable connector that help Spark work with Cassandra? Or I should write it myself. Regards my second question, i

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
Re: DB I strongly encourage you to look at Cassandra – it’s almost as powerful as Hbase, a lot easier to setup and manage. Well suited for this type of usecase, with a combination of K/V store and time series data. For the second question, I’ve used this pattern all the time for “flash

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
1) yes, just use .repartition on the inbound stream, this will shuffle data across your whole cluster and process in parallel as specified. 2) yes, although I’m not sure how to do it for a totally custom receiver. Does this help as a starting point?

[REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-03 Thread Adrian Tanase
Hi all, Trying to repost this question from a colleague on my team, somehow his subscription is not active: http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html Appreciate any thoughts, -adrian

<    1   2