RE: Re: RDD to InputStream

2022-12-25 Thread ayuio5799
الجنرالOn 2015/03/18 17:20:54 Ayoub wrote:> In case it would interest other 
peoples, here is what I come up with and it> seems to work fine:> >   case 
class RDDAsInputStream(private val rdd: RDD[String]) extends> 
java.io.InputStream {> var bytes = 
rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator> > def read(): Int = {>
   if(bytes.hasNext) bytes.next.toInt>   else -1> }> override def 
markSupported(): Boolean = false>   }> > > 2015-03-13 13:56 GMT+01:00 Sean Owen 
:> > > OK, then you do not want to collect() the RDD. You 
can get an iterator,> > yes.> > There is no such thing as making an Iterator 
into an InputStream. An> > Iterator is a sequence of arbitrary objects; an 
InputStream is a> > channel to a stream of bytes.> > I think you can employ 
similar Guava / Commons utilities to make an> > Iterator of Streams in a stream 
of Readers, join the Readers, and> > encode the result as bytes in an 
InputStream.> >> > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub > > 
wrote:> > > Thanks Sean,> > >> > > I forgot to mention that the data is too big 
to be collected on the> > driver.> > >> > > So yes your proposition would work 
in theory but in my case I cannot hold> > > all the data in the driver memory, 
therefore it wouldn't work.> > >> > > I guess the crucial point to to do the 
collect in a lazy way and in that> > > subject I noticed that we can get a 
local iterator from an RDD but that> > > rises two questions:> > >> > > - does 
that involves an mediate collect just like with "collect()" or is> > it> > > 
lazy process ?> > > - how to go from an iterator to an InputStream ?> > >> > >> 
> > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:> > >>> > >> These 
are quite different creatures. You have a distributed set of> > >> Strings, but 
want a local stream of bytes, which involves three> > >> conversions:> > >>> > 
>> - collect data to driver> > >> - concatenate strings in some way> > >> - 
encode strings as bytes according to an encoding> > >>> > >> Your approach is 
OK but might be faster to avoid disk, if you have> > >> enough memory:> > >>> > 
>> - collect() to a Array[String] locally> > >> - use Guava utilities to turn a 
bunch of Strings into a Reader> > >> - Use the Apache Commons ReaderInputStream 
to read it as encoded bytes> > >>> > >> I might wonder if that's all really 
what you want to do though.> > >>> > >>> > >> On Fri, Mar 13, 2015 at 9:54 AM, 
Ayoub <[hidden email]> wrote:> > >> > Hello,> > >> >> > >> > I need to convert 
an RDD[String] to a java.io.InputStream but I didn't> > >> > find> > >> > an 
east way to do it.> > >> > Currently I am saving the RDD as temporary file and 
then opening an> > >> > inputstream on the file but that is not really 
optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > 
>> > Thanks,> > >> > Ayoub.> > >> >> > >> >> > >> >> > >> > --> > >> > View 
this message in context:> > >> >> > 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html>
 > >> > Sent from the Apache Spark User List mailing list archive at> > 
Nabble.com.> > >> >> > >> > 
-> > >> > 
To unsubscribe, e-mail: [hidden email]> > >> > For additional commands, e-mail: 
[hidden email]> > >> >> > >> > >> > >> > > > > 
> View this message in context: Re: RDD to InputStream> > >> > > Sent from the 
Apache Spark User List mailing list archive at Nabble.com.> >> > > > > --> View 
this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html>
 Sent from the Apache Spark User List mailing list archive at Nabble.com.مرسل 
من هاتف Samsung Galaxy الذكي.

Re: RDD to InputStream

2015-03-18 Thread Ayoub
In case it would interest other peoples, here is what I come up with and it
seems to work fine:

  case class RDDAsInputStream(private val rdd: RDD[String]) extends
java.io.InputStream {
var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator

def read(): Int = {
  if(bytes.hasNext) bytes.next.toInt
  else -1
}
override def markSupported(): Boolean = false
  }


2015-03-13 13:56 GMT+01:00 Sean Owen :

> OK, then you do not want to collect() the RDD. You can get an iterator,
> yes.
> There is no such thing as making an Iterator into an InputStream. An
> Iterator is a sequence of arbitrary objects; an InputStream is a
> channel to a stream of bytes.
> I think you can employ similar Guava / Commons utilities to make an
> Iterator of Streams in a stream of Readers, join the Readers, and
> encode the result as bytes in an InputStream.
>
> On Fri, Mar 13, 2015 at 10:33 AM, Ayoub 
> wrote:
> > Thanks Sean,
> >
> > I forgot to mention that the data is too big to be collected on the
> driver.
> >
> > So yes your proposition would work in theory but in my case I cannot hold
> > all the data in the driver memory, therefore it wouldn't work.
> >
> > I guess the crucial point to to do the collect in a lazy way and in that
> > subject I noticed that we can get a local iterator from an RDD but that
> > rises two questions:
> >
> > - does that involves an mediate collect just like with "collect()" or is
> it
> > lazy process ?
> > - how to go from an iterator to an InputStream ?
> >
> >
> > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:
> >>
> >> These are quite different creatures. You have a distributed set of
> >> Strings, but want a local stream of bytes, which involves three
> >> conversions:
> >>
> >> - collect data to driver
> >> - concatenate strings in some way
> >> - encode strings as bytes according to an encoding
> >>
> >> Your approach is OK but might be faster to avoid disk, if you have
> >> enough memory:
> >>
> >> - collect() to a Array[String] locally
> >> - use Guava utilities to turn a bunch of Strings into a Reader
> >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
> >>
> >> I might wonder if that's all really what you want to do though.
> >>
> >>
> >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:
> >> > Hello,
> >> >
> >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
> >> > find
> >> > an east way to do it.
> >> > Currently I am saving the RDD as temporary file and then opening an
> >> > inputstream on the file but that is not really optimal.
> >> >
> >> > Does anybody know a better way to do that ?
> >> >
> >> > Thanks,
> >> > Ayoub.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > -
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >
> >
> >
> > 
> > View this message in context: Re: RDD to InputStream
> >
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
OK, then you do not want to collect() the RDD. You can get an iterator, yes.
There is no such thing as making an Iterator into an InputStream. An
Iterator is a sequence of arbitrary objects; an InputStream is a
channel to a stream of bytes.
I think you can employ similar Guava / Commons utilities to make an
Iterator of Streams in a stream of Readers, join the Readers, and
encode the result as bytes in an InputStream.

On Fri, Mar 13, 2015 at 10:33 AM, Ayoub  wrote:
> Thanks Sean,
>
> I forgot to mention that the data is too big to be collected on the driver.
>
> So yes your proposition would work in theory but in my case I cannot hold
> all the data in the driver memory, therefore it wouldn't work.
>
> I guess the crucial point to to do the collect in a lazy way and in that
> subject I noticed that we can get a local iterator from an RDD but that
> rises two questions:
>
> - does that involves an mediate collect just like with "collect()" or is it
> lazy process ?
> - how to go from an iterator to an InputStream ?
>
>
> 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:
>>
>> These are quite different creatures. You have a distributed set of
>> Strings, but want a local stream of bytes, which involves three
>> conversions:
>>
>> - collect data to driver
>> - concatenate strings in some way
>> - encode strings as bytes according to an encoding
>>
>> Your approach is OK but might be faster to avoid disk, if you have
>> enough memory:
>>
>> - collect() to a Array[String] locally
>> - use Guava utilities to turn a bunch of Strings into a Reader
>> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
>>
>> I might wonder if that's all really what you want to do though.
>>
>>
>> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:
>> > Hello,
>> >
>> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
>> > find
>> > an east way to do it.
>> > Currently I am saving the RDD as temporary file and then opening an
>> > inputstream on the file but that is not really optimal.
>> >
>> > Does anybody know a better way to do that ?
>> >
>> > Thanks,
>> > Ayoub.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>
>
>
> 
> View this message in context: Re: RDD to InputStream
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD to InputStream

2015-03-13 Thread Ayoub
Thanks Sean,

I forgot to mention that the data is too big to be collected on the driver.

So yes your proposition would work in theory but in my case I cannot hold
all the data in the driver memory, therefore it wouldn't work.

I guess the crucial point to to do the collect in a lazy way and in that
subject I noticed that we can get a local iterator from an RDD but that
rises two questions:

- does that involves an mediate collect just like with "collect()" or is it
lazy process ?
- how to go from an iterator to an InputStream ?


2015-03-13 11:17 GMT+01:00 Sean Owen :

> These are quite different creatures. You have a distributed set of
> Strings, but want a local stream of bytes, which involves three
> conversions:
>
> - collect data to driver
> - concatenate strings in some way
> - encode strings as bytes according to an encoding
>
> Your approach is OK but might be faster to avoid disk, if you have
> enough memory:
>
> - collect() to a Array[String] locally
> - use Guava utilities to turn a bunch of Strings into a Reader
> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
>
> I might wonder if that's all really what you want to do though.
>
>
> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub 
> wrote:
> > Hello,
> >
> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
> find
> > an east way to do it.
> > Currently I am saving the RDD as temporary file and then opening an
> > inputstream on the file but that is not really optimal.
> >
> > Does anybody know a better way to do that ?
> >
> > Thanks,
> > Ayoub.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
These are quite different creatures. You have a distributed set of
Strings, but want a local stream of bytes, which involves three
conversions:

- collect data to driver
- concatenate strings in some way
- encode strings as bytes according to an encoding

Your approach is OK but might be faster to avoid disk, if you have
enough memory:

- collect() to a Array[String] locally
- use Guava utilities to turn a bunch of Strings into a Reader
- Use the Apache Commons ReaderInputStream to read it as encoded bytes

I might wonder if that's all really what you want to do though.


On Fri, Mar 13, 2015 at 9:54 AM, Ayoub  wrote:
> Hello,
>
> I need to convert an RDD[String] to a java.io.InputStream but I didn't find
> an east way to do it.
> Currently I am saving the RDD as temporary file and then opening an
> inputstream on the file but that is not really optimal.
>
> Does anybody know a better way to do that ?
>
> Thanks,
> Ayoub.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org