And I didn't mean to skip over you, Koert.  I'm just more familiar with
what Oscar said on the subject than with your opinion.



On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <[email protected]>wrote:

> Hmmm... I was unaware of this concept that Spark is for medium to large
>> datasets but not for very large datasets.
>
>
> It is in the opinion of some at Twitter.  That doesn't make it true or a
> universally held opinion.
>
>
>
> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <[email protected]>wrote:
>
>> Hmmm... I was unaware of this concept that Spark is for medium to large
>> datasets but not for very large datasets. What size is very large?
>>
>> Can someone please elaborate on why this would be the case and what stops
>> Spark, as it is today, to be successfully run on very large datasets? I'll
>> appreciate it.
>>
>> I would think that Spark should be able to pull off Hadoop level
>> throughput in worst case with DISK_ONLY caching.
>>
>> Thanks
>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <[email protected]> wrote:
>>
>>> i would say scaling (cascading + DSL for scala) offers similar
>>> functionality to spark, and a similar syntax.
>>> the main difference between spark and scalding is target jobs:
>>> scalding is for long running jobs on very large data. the data is read
>>> from and written to disk between steps. jobs run from minutes to days.
>>> spark is for faster jobs on medium to large data. the data is primarily
>>> held in memory. jobs run from a few seconds to a few hours. although spark
>>> can work with data on disks it still makes assumptions that data needs to
>>> fit in memory for certain steps (although less and less with every
>>> release). spark also makes iterative designs much easier.
>>>
>>> i have found them both great to program in and complimentary. we use
>>> scalding for overnight batch processes and spark for more realtime
>>> processes. at this point i would trust scalding a lot more due to the
>>> robustness of the stack, but spark is getting better every day.
>>>
>>>
>>>
>>>
>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <[email protected]> wrote:
>>>
>>>> Hi Philip,
>>>>
>>>> Cascading is relatively agnostic about the distributed topology
>>>> underneath it, especially as of the 2.0 release over a year ago. There's
>>>> been some discussion about writing a flow planner for Spark -- e.g., which
>>>> would replace the Hadoop flow planner. Not sure if there's active work on
>>>> that yet.
>>>>
>>>> There are a few commercial workflow abstraction layers (probably what
>>>> was meant by "application layer" ?), in terms of the Cascading family
>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the
>>>> Py data stack.
>>>>
>>>> Spark would not be at the same level of abstraction as Cascading
>>>> (business logic, effectively); however, something like MLbase is ostensibly
>>>> intended for that http://www.mlbase.org/
>>>>
>>>> With respect to Spark, two other things to watch... One would
>>>> definitely be the Py data stack and ability to integrate with PySpark,
>>>> which is turning out to be very power abstraction -- quite close to a large
>>>> segment of industry needs.  The other project to watch, on the Scala
>>>> side, is Summingbird and it's evolution at Twitter:
>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>>>>
>>>> Paco
>>>> http://amazon.com/dp/1449358721/
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <[email protected]
>>>> > wrote:
>>>>
>>>>>
>>>>> My team is investigating a number of technologies in the Big Data
>>>>> space.  A team member recently got turned on to 
>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application 
>>>>> layer for orchestrating complex workflows/scenarios.  He
>>>>> asked me if Spark had an "application layer"?  My initial reaction is "no"
>>>>> that Spark would not have a separate orchestration/application layer.
>>>>> Instead, the core Spark API (along with Streaming) would compete directly
>>>>> with Cascading for this kind of functionality and that the two would not
>>>>> likely be all that complementary.  I realize that I am exposing my
>>>>> ignorance here and could be way off.  Is there anyone who knows a bit 
>>>>> about
>>>>> both of these technologies who could speak to this in broad strokes?
>>>>>
>>>>> Thanks!
>>>>> Philip
>>>>>
>>>>>
>>>>
>>>
>

Reply via email to