Spark can handle this true, but it is optimized for the idea that it works it 
works on the same full dataset in-memory due to the underlying nature of 
machine learning algorithms (iterative). Of course, you can spill over, but 
that you should avoid.

That being said you should have read my final sentence about this. Both systems 
develop and change.


> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote:
> 
> 
>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Spark is more for machine learning working iteravely over the whole same 
>> dataset in memory. Additionally it has streaming and graph processing 
>> capabilities that can be used together. 
> 
> Hi Jörn,
> 
> The first part is actually no true. Spark can handle data far greater than 
> the aggregate memory available on a cluster. The more recent versions (1.3+) 
> of Spark have external operations for almost all built-in operators, and 
> while things may not be perfect, those external operators are becoming more 
> and more robust with each version of Spark.
> 
> 
> 
> 
> 

Reply via email to