thanks, Ryan. I will study Algebird first and try to adapt TopKMonoid to spark streaming program.
On Mon, Jan 27, 2014 at 2:54 PM, Ryan Weald <[email protected]> wrote: > Hi dachuan, > > Getting top-k up and running using spark streaming is actually very easy > using Twitter's Algebird project. I gave a presentation recently at a spark > user meetup that wen through an example of using algebird in a spark > streaming job. You can find the video and slides here - > http://isurfsoftware.com/blog/2014/01/20/spark-meetup-monoids/ > > Once you get the general idea of using monoids for aggregation it will be > easy to drop in the > TopKMonoid<https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/TopKMonoid.scala> > from > Algebird to solve your problem. > > As far as cluster configuration goes, at my old company Sharethrough, we > set Spark to course grained mode on apache mesos with spark config to limit > the number of CPUs per job. We also made some minor tweaks to JVM settings > for bigger heap size and reduced RDD cache time. > > Cheers > > Ryan Weald > > > On Fri, Jan 24, 2014 at 7:28 PM, dachuan <[email protected]> wrote: > >> Hello, community, >> >> I have three questions about spark streaming. >> >> 1, >> I noticed that one streaming example (StatefulNetworkWordCount) has one >> interesting phenomenon: >> since this workload only prints the first 10 rows of the final RDD, this >> means if the data influx rate is fast enough (much faster than hand typing >> in keyboard), then the final RDD would have more than one partition, assume >> it's 2 partitions, but the second partition won't be computed at all >> because the first partition suffice to serve the first 10 rows. However, >> these two workloads must make checkpoint to that RDD. This would lead to a >> very time consuming checkpoint process because the checkpoint to the second >> partition can only start before it is computed. So, is this workload only >> designed for demonstration purpose, for example, only designed for one >> partition RDD? >> >> (I have attached a figure to illustrate what I've said, please tell me if >> mailing list doesn't welcome attachment. >> A short description about the experiment >> Hardware specs: 4 cores >> Software specs: spark local cluster, 5 executors (workers), each one has >> one core, each executor has 1G memory >> Data influx speed: 3MB/s >> Data source: one ServerSocket in local file >> Streaming App's name: StatefulNetworkWordCount >> Job generation frequency: one job per second >> Checkpoint time: once per 10s >> JobManager.numThreads = 2) >> >> >> >> (And another workload might have the same problem: >> PageViewStream's slidingPageCounts) >> >> 2, >> Does anybody have a Top-K wordcount streaming source code? >> >> 3, >> Can anybody share your real world streaming example? for example, >> including source code, and cluster configuration details? >> >> thanks, >> dachuan. >> >> -- >> Dachuan Huang >> Cellphone: 614-390-7234 >> 2015 Neil Avenue >> Ohio State University >> Columbus, Ohio >> U.S.A. >> 43210 >> > > -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210
