I think it would very handy to be able to specify that you want sorting during a partitioning stage.
On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoo...@gmail.com> wrote: > Hi Aaron, > > When you say that sorting is being worked on, can you elaborate a little > more please? > > If particular, I want to sort the items within each partition (not > globally) without necessarily bringing them all into memory at once. > > Thanks, > > Roger > > > On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> There is no fundamental issue if you're running on data that is larger >> than cluster memory size. Many operations can stream data through, and thus >> memory usage is independent of input data size. Certain operations require >> an entire *partition* (not dataset) to fit in memory, but there are not >> many instances of this left (sorting comes to mind, and this is being >> worked on). >> >> In general, one problem with Spark today is that you *can* OOM under >> certain configurations, and it's possible you'll need to change from the >> default configuration if you're using doing very memory-intensive jobs. >> However, there are very few cases where Spark would simply fail as a matter >> of course *-- *for instance, you can always increase the number of >> partitions to decrease the size of any given one. or repartition data to >> eliminate skew. >> >> Regarding impact on performance, as Mayur said, there may absolutely be >> an impact depending on your jobs. If you're doing a join on a very large >> amount of data with few partitions, then we'll have to spill to disk. If >> you can't cache your working set of data in memory, you will also see a >> performance degradation. Spark enables the use of memory to make things >> fast, but if you just don't have enough memory, it won't be terribly fast. >> >> >> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com> >> wrote: >> >>> Clearly thr will be impact on performance but frankly depends on what >>> you are trying to achieve with the dataset. >>> >>> Mayur Rustagi >>> Ph: +1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com> >>> wrote: >>> >>>> Some inputs will be really helpful. >>>> >>>> Thanks, >>>> -Vibhor >>>> >>>> >>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I am planning to use spark with HBase, where I generate RDD by reading >>>>> data from HBase Table. >>>>> >>>>> I want to know that in the case when the size of HBase Table grows >>>>> larger than the size of RAM available in the cluster, will the application >>>>> fail, or will there be an impact in performance ? >>>>> >>>>> Any thoughts in this direction will be helpful and are welcome. >>>>> >>>>> Thanks, >>>>> -Vibhor >>>>> >>>> >>>> >>>> >>>> -- >>>> Vibhor Banga >>>> Software Development Engineer >>>> Flipkart Internet Pvt. Ltd., Bangalore >>>> >>>> >>> >> >