Yes, that is certainly feasible. But Chapter 11<http://shop.oreilly.com/product/0636920028512.do>isn't written yet.
On Tue, Oct 22, 2013 at 11:02 AM, Timothy Perrigo <[email protected]>wrote: > As the newbie who started the conversation, I'd like to thank everyone for > the feedback and the subsequent discussion. I certainly understand the > point that there's no magic rule book that can take the place of learning > the ins-and-outs of distributed / cluster computing-- a certain amount of > pain is to be expected. I'd like to add, too, that so far, with Spark, > this pain has been surprisingly minimal, thanks in no small part to the > information I've gleaned (directly or indirectly) from this mailing list. > > However, any additional information is always welcome. In my own case, > what I think I would really benefit from would be a start-to-finish example > of a problem that works on a large-ish dataset. In particular, it would be > helpful to know what parameters have to be considered, what they are set > to, and the rationale behind how those values were obtained, as well as a > discussion about determining a "good" cluster size / configuration for the > example problem. (In fact, if anyone knows of such an example, I would be > very appreciative!). This certainly won't make everything completely > painless, but would be invaluable and certainly seems feasible. > > Thanks again everyone for you help and advice. > > Tim > > > On Tue, Oct 22, 2013 at 12:01 PM, Mark Hamstra <[email protected]>wrote: > >> Yes, there are certainly rough spots and sharp edges that we can work at >> polishing out and rounding over; and there are people working on such >> things. Don't get me wrong, feedback from users about what they are >> finding to difficult, opaque or impenetrable is useful; but I don't think >> that the expectation that working with a framework like Spark should be >> smooth and easy can be completely met. Even when all of the documentation, >> guidance, instrumentation and user interface are in place, there will still >> be a lot for users to come to terms with. >> >> >> On Tue, Oct 22, 2013 at 9:50 AM, Aaron Davidson <[email protected]>wrote: >> >>> On the other hand, I totally agree that memory usage in Spark is rather >>> opaque, and is one area where we could do a lot better at in terms of >>> communicating issues, through both docs and instrumentation. At least with >>> serialization and such, you can get meaningful exceptions (hopefully), but >>> OOMs are just blanket "something wasn't right somewhere." Debugging them >>> empirically would require deep diving into Spark's heap allocations, which >>> requires a lot more knowledge of Spark internals than should be required >>> for general usage. >>> >>> >>> On Tue, Oct 22, 2013 at 9:22 AM, Mark Hamstra >>> <[email protected]>wrote: >>> >>>> Yes, but that also illustrates the problem faced by anyone trying to >>>> write a "little white paper or guide lines" to make newbies' experience >>>> painless. Distributed computing clusters are necessarily complex things, >>>> and problems can crop up in multiple locations, layers or subsystems. It's >>>> just not feasible to quickly bring up to speed someone with no experience >>>> in distributed programming and cluster systems. It takes a lot of >>>> knowledge, both broad and deep. Very few people have the complete scope of >>>> knowledge and experience required, so creating, debugging and maintaining a >>>> cluster computing application almost always has to be a team effort. >>>> >>>> Support organizations and communities can replace some of the need for >>>> a knowledgeable and well-functioning team, but not all of it; and at some >>>> point you have to expect that debugging is going to take a considerable >>>> amount of painstaking, systematic effort -- including a close reading of >>>> the available docs. >>>> >>>> Several people are working on making more and better reference and >>>> training material available, and some of that will include trouble-shooting >>>> guidance, but that doesn't mean that there can ever be "one little paper" >>>> to solve newbies' (or more experienced developers') problems or provide >>>> adequate guidance. There's just too much to cover and too many different >>>> kinds or levels of initial-user knowledge to make that completely feasible. >>>> >>>> >>>> >>>> On Tue, Oct 22, 2013 at 8:50 AM, Shay Seng <[email protected]> wrote: >>>> >>>>> Hey Mark, I didn't mean to say that the information isn't out there -- >>>>> just that when something goes wrong with spark, the scope of what could be >>>>> wrong is so large - some bad setting with JVM, serializer, akka, badly >>>>> written scala code, algorithm wrong, check worker logs, check executor >>>>> stderrs, .... >>>>> >>>>> When I looked at this post this morning, my initial thought wasn't >>>>> that "countByValue" would be at fault. ...probably since I've only been >>>>> using Scala/Spark for a month or so. >>>>> >>>>> It was just a suggestion to help newbies come up to speed more quickly >>>>> and gain insights into how to debug issues. >>>>> >>>>> >>>>> On Tue, Oct 22, 2013 at 8:14 AM, Mark Hamstra <[email protected] >>>>> > wrote: >>>>> >>>>>> There's no need to guess at that. The docs tell you directly: >>>>>> >>>>>> def countByValue(): Map[T, Long] >>>>>> >>>>>> Return the count of each unique value in this RDD as a map of (value, >>>>>> count) pairs. The final combine step happens locally on the master, >>>>>> equivalent to running a single reduce task. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 22, 2013 at 7:22 AM, Shay Seng <[email protected]> wrote: >>>>>> >>>>>>> Hi Matei, >>>>>>> >>>>>>> I've seen several memory tuning queries on this mailing list, and >>>>>>> also heard the same kinds of queries at the spark meetup. In fact the >>>>>>> last >>>>>>> bullet point in Josh Carver(?) slides, the guy from Bizo, was "memory >>>>>>> tuning is still a mystery". >>>>>>> >>>>>>> I certainly had lots of issues in when I first started. From memory >>>>>>> issues to gc issues, things seem to run fine until you try something >>>>>>> with >>>>>>> 500GB of data etc. >>>>>>> >>>>>>> I was wondering if you could write up a little white paper or some >>>>>>> guide lines on how to set memory values, and what to look at when >>>>>>> something >>>>>>> goes wrong? Eg. I would never gave guessed that countByValue happens on >>>>>>> a >>>>>>> single machine etc. >>>>>>> On Oct 21, 2013 6:18 PM, "Matei Zaharia" <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi there, >>>>>>>> >>>>>>>> The problem is that countByValue happens in only a single reduce >>>>>>>> task -- this is probably something we should fix but it's basically not >>>>>>>> designed for lots of values. Instead, do the count in parallel as >>>>>>>> follows: >>>>>>>> >>>>>>>> val counts = mapped.map(str => (str, 1)).reduceByKey((a, b) => a + >>>>>>>> b) >>>>>>>> >>>>>>>> If this still has trouble, you can also increase the level of >>>>>>>> parallelism of reduceByKey by passing it a second parameter for the >>>>>>>> number >>>>>>>> of tasks (e.g. 100). >>>>>>>> >>>>>>>> BTW one other small thing with your code, flatMap should actually >>>>>>>> work fine if your function returns an Iterator to Traversable, so >>>>>>>> there's >>>>>>>> no need to call toList and return a Seq in ngrams; you can just return >>>>>>>> an >>>>>>>> Iterator[String]. >>>>>>>> >>>>>>>> Matei >>>>>>>> >>>>>>>> On Oct 21, 2013, at 1:05 PM, Timothy Perrigo <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> > Hi everyone, >>>>>>>> > I am very new to Spark, so as a learning exercise I've set up a >>>>>>>> small cluster consisting of 4 EC2 m1.large instances (1 master, 3 >>>>>>>> slaves), >>>>>>>> which I'm hoping to use to calculate ngram frequencies from text files >>>>>>>> of >>>>>>>> various sizes (I'm not doing anything with them; I just thought this >>>>>>>> would >>>>>>>> be slightly more interesting than the usual 'word count' example). >>>>>>>> Currently, I'm trying to work with a 1GB text file, but running into >>>>>>>> memory issues. I'm wondering what parameters I should be setting (in >>>>>>>> spark-env.sh) in order to properly utilize the cluster. Right now, >>>>>>>> I'd be >>>>>>>> happy just to have the process complete successfully with the 1 gig >>>>>>>> file, >>>>>>>> so I'd really appreciate any suggestions you all might have. >>>>>>>> > >>>>>>>> > Here's a summary of the code I'm running through the spark shell >>>>>>>> on the master: >>>>>>>> > >>>>>>>> > def ngrams(s: String, n: Int = 3): Seq[String] = { >>>>>>>> > (s.split("\\s+").sliding(n)).filter(_.length == >>>>>>>> n).map(_.mkString(" ")).map(_.trim).toList >>>>>>>> > } >>>>>>>> > >>>>>>>> > val text = sc.textFile("s3n://my-bucket/my-1gb-text-file") >>>>>>>> > >>>>>>>> > val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3)) >>>>>>>> > >>>>>>>> > So far so good; the problems come during the reduce phase. With >>>>>>>> small files, I was able to issue the following to calculate the most >>>>>>>> frequently occurring trigram: >>>>>>>> > >>>>>>>> > val topNgram = (mapped countByValue) reduce((a:(String, Long), >>>>>>>> b:(String, Long)) => if (a._2 > b._2) a else b) >>>>>>>> > >>>>>>>> > With the 1 gig file, though, I've been running into OutOfMemory >>>>>>>> errors, so I decided to split the reduction to several steps, starting >>>>>>>> with >>>>>>>> simply issuing countByValue of my "mapped" RDD, but I have yet to get >>>>>>>> it to >>>>>>>> complete successfully. >>>>>>>> > >>>>>>>> > SPARK_MEM is currently set to 6154m. I also bumped up the >>>>>>>> spark.akka.framesize setting to 500 (though at this point, I was >>>>>>>> grasping >>>>>>>> at straws; I'm not sure what a "proper" value would be). What >>>>>>>> properties >>>>>>>> should I be setting for a job of this size on a cluster of 3 m1.large >>>>>>>> slaves? (The cluster was initially configured using the spark-ec2 >>>>>>>> scripts). >>>>>>>> Also, programmatically, what should I be doing differently? (For >>>>>>>> example, >>>>>>>> should I be setting the minimum number of splits when reading the text >>>>>>>> file? If so, what would be a good default?). >>>>>>>> > >>>>>>>> > I apologize for what I'm sure are very naive questions. I think >>>>>>>> Spark is a fantastic project and have enjoyed working with it, but I'm >>>>>>>> still very much a newbie and would appreciate any help you all can >>>>>>>> provide >>>>>>>> (as well as any 'rules-of-thumb' or best practices I should be >>>>>>>> following). >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Tim Perrigo >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >
