Re: real world streaming code

dachuan Mon, 27 Jan 2014 12:33:59 -0800

thanks, Ryan. I will study Algebird first and try to adapt TopKMonoid to
spark streaming program.



On Mon, Jan 27, 2014 at 2:54 PM, Ryan Weald <[email protected]> wrote:

> Hi dachuan,
>
> Getting top-k up and running using spark streaming is actually very easy
> using Twitter's Algebird project. I gave a presentation recently at a spark
> user meetup that wen through an example of using algebird in a spark
> streaming job. You can find the video and slides here -
> http://isurfsoftware.com/blog/2014/01/20/spark-meetup-monoids/
>
> Once you get the general idea of using monoids for aggregation it will be
> easy to drop in the 
> TopKMonoid<https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/TopKMonoid.scala>
>  from
> Algebird to solve your problem.
>
> As far as cluster configuration goes, at my old company Sharethrough, we
> set Spark to course grained mode on apache mesos with spark config to limit
> the number of CPUs per job. We also made some minor tweaks to JVM settings
> for bigger heap size and reduced RDD cache time.
>
> Cheers
>
> Ryan Weald
>
>
> On Fri, Jan 24, 2014 at 7:28 PM, dachuan <[email protected]> wrote:
>
>> Hello, community,
>>
>> I have three questions about spark streaming.
>>
>>  1,
>> I noticed that one streaming example (StatefulNetworkWordCount) has one
>> interesting phenomenon:
>> since this workload only prints the first 10 rows of the final RDD, this
>> means if the data influx rate is fast enough (much faster than hand typing
>> in keyboard), then the final RDD would have more than one partition, assume
>> it's 2 partitions, but the second partition won't be computed at all
>> because the first partition suffice to serve the first 10 rows. However,
>> these two workloads must make checkpoint to that RDD. This would lead to a
>> very time consuming checkpoint process because the checkpoint to the second
>> partition can only start before it is computed. So, is this workload only
>> designed for demonstration purpose, for example, only designed for one
>> partition RDD?
>>
>> (I have attached a figure to illustrate what I've said, please tell me if
>> mailing list doesn't welcome attachment.
>> A short description about the experiment
>> Hardware specs: 4 cores
>> Software specs: spark local cluster, 5 executors (workers), each one has
>> one core, each executor has 1G memory
>> Data influx speed: 3MB/s
>> Data source: one ServerSocket in local file
>> Streaming App's name: StatefulNetworkWordCount
>> Job generation frequency: one job per second
>> Checkpoint time: once per 10s
>> JobManager.numThreads = 2)
>>
>>
>>
>> (And another workload might have the same problem:
>> PageViewStream's slidingPageCounts)
>>
>> 2,
>> Does anybody have a Top-K wordcount streaming source code?
>>
>> 3,
>> Can anybody share your real world streaming example? for example,
>> including source code, and cluster configuration details?
>>
>> thanks,
>> dachuan.
>>
>> --
>> Dachuan Huang
>> Cellphone: 614-390-7234
>> 2015 Neil Avenue
>> Ohio State University
>> Columbus, Ohio
>> U.S.A.
>> 43210
>>
>
>


-- 
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

Re: real world streaming code

Reply via email to