Re: Loading RDDs in a streaming fashion

2014-12-02 Thread Ashish Rangole
This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, "Andy Twigg"  wrote:

> You may be able to construct RDDs directly from an iterator - not sure
> - you may have to subclass your own.
>
> On 1 December 2014 at 18:40, Keith Simmons  wrote:
> > Yep, that's definitely possible.  It's one of the workarounds I was
> > considering.  I was just curious if there was a simpler (and perhaps more
> > efficient) approach.
> >
> > Keith
> >
> > On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg  wrote:
> >>
> >> Could you modify your function so that it streams through the files
> record
> >> by record and outputs them to hdfs, then read them all in as RDDs and
> take
> >> the union? That would only use bounded memory.
> >>
> >> On 1 December 2014 at 17:19, Keith Simmons  wrote:
> >>>
> >>> Actually, I'm working with a binary format.  The api allows reading
> out a
> >>> single record at a time, but I'm not sure how to get those records into
> >>> spark (without reading everything into memory from a single file at
> once).
> >>>
> >>>
> >>>
> >>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg 
> wrote:
> >
> > file => tranform file into a bunch of records
> 
> 
>  What does this function do exactly? Does it load the file locally?
>  Spark supports RDDs exceeding global RAM (cf the terasort example),
> but
>  if your example just loads each file locally, then this may cause
> problems.
>  Instead, you should load each file into an rdd with
> context.textFile(),
>  flatmap that and union these rdds.
> 
>  also see
> 
> 
> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
> 
> 
>  On 1 December 2014 at 16:50, Keith Simmons  wrote:
> >
> > This is a long shot, but...
> >
> > I'm trying to load a bunch of files spread out over hdfs into an RDD,
> > and in most cases it works well, but for a few very large files, I
> exceed
> > available memory.  My current workflow basically works like this:
> >
> > context.parallelize(fileNames).flatMap { file =>
> >   tranform file into a bunch of records
> > }
> >
> > I'm wondering if there are any APIs to somehow "flush" the records
> of a
> > big dataset so I don't have to load them all into memory at once.  I
> know
> > this doesn't exist, but conceptually:
> >
> > context.parallelize(fileNames).streamMap { (file, stream) =>
> >  for every 10K records write records to stream and flush
> > }
> >
> > Keith
> 
> 
> >>>
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
You may be able to construct RDDs directly from an iterator - not sure
- you may have to subclass your own.

On 1 December 2014 at 18:40, Keith Simmons  wrote:
> Yep, that's definitely possible.  It's one of the workarounds I was
> considering.  I was just curious if there was a simpler (and perhaps more
> efficient) approach.
>
> Keith
>
> On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg  wrote:
>>
>> Could you modify your function so that it streams through the files record
>> by record and outputs them to hdfs, then read them all in as RDDs and take
>> the union? That would only use bounded memory.
>>
>> On 1 December 2014 at 17:19, Keith Simmons  wrote:
>>>
>>> Actually, I'm working with a binary format.  The api allows reading out a
>>> single record at a time, but I'm not sure how to get those records into
>>> spark (without reading everything into memory from a single file at once).
>>>
>>>
>>>
>>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg  wrote:
>
> file => tranform file into a bunch of records


 What does this function do exactly? Does it load the file locally?
 Spark supports RDDs exceeding global RAM (cf the terasort example), but
 if your example just loads each file locally, then this may cause problems.
 Instead, you should load each file into an rdd with context.textFile(),
 flatmap that and union these rdds.

 also see

 http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files


 On 1 December 2014 at 16:50, Keith Simmons  wrote:
>
> This is a long shot, but...
>
> I'm trying to load a bunch of files spread out over hdfs into an RDD,
> and in most cases it works well, but for a few very large files, I exceed
> available memory.  My current workflow basically works like this:
>
> context.parallelize(fileNames).flatMap { file =>
>   tranform file into a bunch of records
> }
>
> I'm wondering if there are any APIs to somehow "flush" the records of a
> big dataset so I don't have to load them all into memory at once.  I know
> this doesn't exist, but conceptually:
>
> context.parallelize(fileNames).streamMap { (file, stream) =>
>  for every 10K records write records to stream and flush
> }
>
> Keith


>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Yep, that's definitely possible.  It's one of the workarounds I was
considering.  I was just curious if there was a simpler (and perhaps more
efficient) approach.

Keith

On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg  wrote:

> Could you modify your function so that it streams through the files record
> by record and outputs them to hdfs, then read them all in as RDDs and take
> the union? That would only use bounded memory.
>
> On 1 December 2014 at 17:19, Keith Simmons  wrote:
>
>> Actually, I'm working with a binary format.  The api allows reading out a
>> single record at a time, but I'm not sure how to get those records into
>> spark (without reading everything into memory from a single file at once).
>>
>>
>>
>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg  wrote:
>>
>>> file => tranform file into a bunch of records
>>>
>>>
>>> What does this function do exactly? Does it load the file locally?
>>> Spark supports RDDs exceeding global RAM (cf the terasort example), but
>>> if your example just loads each file locally, then this may cause problems.
>>> Instead, you should load each file into an rdd with context.textFile(),
>>> flatmap that and union these rdds.
>>>
>>> also see
>>>
>>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>>>
>>>
>>> On 1 December 2014 at 16:50, Keith Simmons  wrote:
>>>
 This is a long shot, but...

 I'm trying to load a bunch of files spread out over hdfs into an RDD,
 and in most cases it works well, but for a few very large files, I exceed
 available memory.  My current workflow basically works like this:

 context.parallelize(fileNames).flatMap { file =>
   tranform file into a bunch of records
 }

 I'm wondering if there are any APIs to somehow "flush" the records of a
 big dataset so I don't have to load them all into memory at once.  I know
 this doesn't exist, but conceptually:

 context.parallelize(fileNames).streamMap { (file, stream) =>
  for every 10K records write records to stream and flush
 }

 Keith

>>>
>>>
>>
>


Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
Could you modify your function so that it streams through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would only use bounded memory.

On 1 December 2014 at 17:19, Keith Simmons  wrote:

> Actually, I'm working with a binary format.  The api allows reading out a
> single record at a time, but I'm not sure how to get those records into
> spark (without reading everything into memory from a single file at once).
>
>
>
> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg  wrote:
>
>> file => tranform file into a bunch of records
>>
>>
>> What does this function do exactly? Does it load the file locally?
>> Spark supports RDDs exceeding global RAM (cf the terasort example), but
>> if your example just loads each file locally, then this may cause problems.
>> Instead, you should load each file into an rdd with context.textFile(),
>> flatmap that and union these rdds.
>>
>> also see
>>
>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>>
>>
>> On 1 December 2014 at 16:50, Keith Simmons  wrote:
>>
>>> This is a long shot, but...
>>>
>>> I'm trying to load a bunch of files spread out over hdfs into an RDD,
>>> and in most cases it works well, but for a few very large files, I exceed
>>> available memory.  My current workflow basically works like this:
>>>
>>> context.parallelize(fileNames).flatMap { file =>
>>>   tranform file into a bunch of records
>>> }
>>>
>>> I'm wondering if there are any APIs to somehow "flush" the records of a
>>> big dataset so I don't have to load them all into memory at once.  I know
>>> this doesn't exist, but conceptually:
>>>
>>> context.parallelize(fileNames).streamMap { (file, stream) =>
>>>  for every 10K records write records to stream and flush
>>> }
>>>
>>> Keith
>>>
>>
>>
>


Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Actually, I'm working with a binary format.  The api allows reading out a
single record at a time, but I'm not sure how to get those records into
spark (without reading everything into memory from a single file at once).



On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg  wrote:

> file => tranform file into a bunch of records
>
>
> What does this function do exactly? Does it load the file locally?
> Spark supports RDDs exceeding global RAM (cf the terasort example), but if
> your example just loads each file locally, then this may cause problems.
> Instead, you should load each file into an rdd with context.textFile(),
> flatmap that and union these rdds.
>
> also see
>
> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>
>
> On 1 December 2014 at 16:50, Keith Simmons  wrote:
>
>> This is a long shot, but...
>>
>> I'm trying to load a bunch of files spread out over hdfs into an RDD, and
>> in most cases it works well, but for a few very large files, I exceed
>> available memory.  My current workflow basically works like this:
>>
>> context.parallelize(fileNames).flatMap { file =>
>>   tranform file into a bunch of records
>> }
>>
>> I'm wondering if there are any APIs to somehow "flush" the records of a
>> big dataset so I don't have to load them all into memory at once.  I know
>> this doesn't exist, but conceptually:
>>
>> context.parallelize(fileNames).streamMap { (file, stream) =>
>>  for every 10K records write records to stream and flush
>> }
>>
>> Keith
>>
>
>


Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
>
> file => tranform file into a bunch of records


What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file into an rdd with context.textFile(),
flatmap that and union these rdds.

also see
http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files


On 1 December 2014 at 16:50, Keith Simmons  wrote:

> This is a long shot, but...
>
> I'm trying to load a bunch of files spread out over hdfs into an RDD, and
> in most cases it works well, but for a few very large files, I exceed
> available memory.  My current workflow basically works like this:
>
> context.parallelize(fileNames).flatMap { file =>
>   tranform file into a bunch of records
> }
>
> I'm wondering if there are any APIs to somehow "flush" the records of a
> big dataset so I don't have to load them all into memory at once.  I know
> this doesn't exist, but conceptually:
>
> context.parallelize(fileNames).streamMap { (file, stream) =>
>  for every 10K records write records to stream and flush
> }
>
> Keith
>


Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
This is a long shot, but...

I'm trying to load a bunch of files spread out over hdfs into an RDD, and
in most cases it works well, but for a few very large files, I exceed
available memory.  My current workflow basically works like this:

context.parallelize(fileNames).flatMap { file =>
  tranform file into a bunch of records
}

I'm wondering if there are any APIs to somehow "flush" the records of a big
dataset so I don't have to load them all into memory at once.  I know this
doesn't exist, but conceptually:

context.parallelize(fileNames).streamMap { (file, stream) =>
 for every 10K records write records to stream and flush
}

Keith