RE: Re: RDD to InputStream
الجنرالOn 2015/03/18 17:20:54 Ayoub wrote:> In case it would interest other peoples, here is what I come up with and it> seems to work fine:> > case class RDDAsInputStream(private val rdd: RDD[String]) extends> java.io.InputStream {> var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator> > def read(): Int = {> if(bytes.hasNext) bytes.next.toInt> else -1> }> override def markSupported(): Boolean = false> }> > > 2015-03-13 13:56 GMT+01:00 Sean Owen :> > > OK, then you do not want to collect() the RDD. You can get an iterator,> > yes.> > There is no such thing as making an Iterator into an InputStream. An> > Iterator is a sequence of arbitrary objects; an InputStream is a> > channel to a stream of bytes.> > I think you can employ similar Guava / Commons utilities to make an> > Iterator of Streams in a stream of Readers, join the Readers, and> > encode the result as bytes in an InputStream.> >> > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub > > wrote:> > > Thanks Sean,> > >> > > I forgot to mention that the data is too big to be collected on the> > driver.> > >> > > So yes your proposition would work in theory but in my case I cannot hold> > > all the data in the driver memory, therefore it wouldn't work.> > >> > > I guess the crucial point to to do the collect in a lazy way and in that> > > subject I noticed that we can get a local iterator from an RDD but that> > > rises two questions:> > >> > > - does that involves an mediate collect just like with "collect()" or is> > it> > > lazy process ?> > > - how to go from an iterator to an InputStream ?> > >> > >> > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:> > >>> > >> These are quite different creatures. You have a distributed set of> > >> Strings, but want a local stream of bytes, which involves three> > >> conversions:> > >>> > >> - collect data to driver> > >> - concatenate strings in some way> > >> - encode strings as bytes according to an encoding> > >>> > >> Your approach is OK but might be faster to avoid disk, if you have> > >> enough memory:> > >>> > >> - collect() to a Array[String] locally> > >> - use Guava utilities to turn a bunch of Strings into a Reader> > >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes> > >>> > >> I might wonder if that's all really what you want to do though.> > >>> > >>> > >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:> > >> > Hello,> > >> >> > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't> > >> > find> > >> > an east way to do it.> > >> > Currently I am saving the RDD as temporary file and then opening an> > >> > inputstream on the file but that is not really optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > >> > Thanks,> > >> > Ayoub.> > >> >> > >> >> > >> >> > >> > --> > >> > View this message in context:> > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html> > >> > Sent from the Apache Spark User List mailing list archive at> > Nabble.com.> > >> >> > >> > -> > >> > To unsubscribe, e-mail: [hidden email]> > >> > For additional commands, e-mail: [hidden email]> > >> >> > >> > >> > >> > > > > > View this message in context: Re: RDD to InputStream> > >> > > Sent from the Apache Spark User List mailing list archive at Nabble.com.> >> > > > > --> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html> Sent from the Apache Spark User List mailing list archive at Nabble.com.مرسل من هاتف Samsung Galaxy الذكي.
Re: RDD to InputStream
In case it would interest other peoples, here is what I come up with and it seems to work fine: case class RDDAsInputStream(private val rdd: RDD[String]) extends java.io.InputStream { var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator def read(): Int = { if(bytes.hasNext) bytes.next.toInt else -1 } override def markSupported(): Boolean = false } 2015-03-13 13:56 GMT+01:00 Sean Owen : > OK, then you do not want to collect() the RDD. You can get an iterator, > yes. > There is no such thing as making an Iterator into an InputStream. An > Iterator is a sequence of arbitrary objects; an InputStream is a > channel to a stream of bytes. > I think you can employ similar Guava / Commons utilities to make an > Iterator of Streams in a stream of Readers, join the Readers, and > encode the result as bytes in an InputStream. > > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub > wrote: > > Thanks Sean, > > > > I forgot to mention that the data is too big to be collected on the > driver. > > > > So yes your proposition would work in theory but in my case I cannot hold > > all the data in the driver memory, therefore it wouldn't work. > > > > I guess the crucial point to to do the collect in a lazy way and in that > > subject I noticed that we can get a local iterator from an RDD but that > > rises two questions: > > > > - does that involves an mediate collect just like with "collect()" or is > it > > lazy process ? > > - how to go from an iterator to an InputStream ? > > > > > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>: > >> > >> These are quite different creatures. You have a distributed set of > >> Strings, but want a local stream of bytes, which involves three > >> conversions: > >> > >> - collect data to driver > >> - concatenate strings in some way > >> - encode strings as bytes according to an encoding > >> > >> Your approach is OK but might be faster to avoid disk, if you have > >> enough memory: > >> > >> - collect() to a Array[String] locally > >> - use Guava utilities to turn a bunch of Strings into a Reader > >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes > >> > >> I might wonder if that's all really what you want to do though. > >> > >> > >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote: > >> > Hello, > >> > > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't > >> > find > >> > an east way to do it. > >> > Currently I am saving the RDD as temporary file and then opening an > >> > inputstream on the file but that is not really optimal. > >> > > >> > Does anybody know a better way to do that ? > >> > > >> > Thanks, > >> > Ayoub. > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> > > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html > >> > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >> > > >> > - > >> > To unsubscribe, e-mail: [hidden email] > >> > For additional commands, e-mail: [hidden email] > >> > > > > > > > > > > > View this message in context: Re: RDD to InputStream > > > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD to InputStream
OK, then you do not want to collect() the RDD. You can get an iterator, yes. There is no such thing as making an Iterator into an InputStream. An Iterator is a sequence of arbitrary objects; an InputStream is a channel to a stream of bytes. I think you can employ similar Guava / Commons utilities to make an Iterator of Streams in a stream of Readers, join the Readers, and encode the result as bytes in an InputStream. On Fri, Mar 13, 2015 at 10:33 AM, Ayoub wrote: > Thanks Sean, > > I forgot to mention that the data is too big to be collected on the driver. > > So yes your proposition would work in theory but in my case I cannot hold > all the data in the driver memory, therefore it wouldn't work. > > I guess the crucial point to to do the collect in a lazy way and in that > subject I noticed that we can get a local iterator from an RDD but that > rises two questions: > > - does that involves an mediate collect just like with "collect()" or is it > lazy process ? > - how to go from an iterator to an InputStream ? > > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>: >> >> These are quite different creatures. You have a distributed set of >> Strings, but want a local stream of bytes, which involves three >> conversions: >> >> - collect data to driver >> - concatenate strings in some way >> - encode strings as bytes according to an encoding >> >> Your approach is OK but might be faster to avoid disk, if you have >> enough memory: >> >> - collect() to a Array[String] locally >> - use Guava utilities to turn a bunch of Strings into a Reader >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes >> >> I might wonder if that's all really what you want to do though. >> >> >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote: >> > Hello, >> > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't >> > find >> > an east way to do it. >> > Currently I am saving the RDD as temporary file and then opening an >> > inputstream on the file but that is not really optimal. >> > >> > Does anybody know a better way to do that ? >> > >> > Thanks, >> > Ayoub. >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: [hidden email] >> > For additional commands, e-mail: [hidden email] >> > > > > > > View this message in context: Re: RDD to InputStream > > Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD to InputStream
Thanks Sean, I forgot to mention that the data is too big to be collected on the driver. So yes your proposition would work in theory but in my case I cannot hold all the data in the driver memory, therefore it wouldn't work. I guess the crucial point to to do the collect in a lazy way and in that subject I noticed that we can get a local iterator from an RDD but that rises two questions: - does that involves an mediate collect just like with "collect()" or is it lazy process ? - how to go from an iterator to an InputStream ? 2015-03-13 11:17 GMT+01:00 Sean Owen : > These are quite different creatures. You have a distributed set of > Strings, but want a local stream of bytes, which involves three > conversions: > > - collect data to driver > - concatenate strings in some way > - encode strings as bytes according to an encoding > > Your approach is OK but might be faster to avoid disk, if you have > enough memory: > > - collect() to a Array[String] locally > - use Guava utilities to turn a bunch of Strings into a Reader > - Use the Apache Commons ReaderInputStream to read it as encoded bytes > > I might wonder if that's all really what you want to do though. > > > On Fri, Mar 13, 2015 at 9:54 AM, Ayoub > wrote: > > Hello, > > > > I need to convert an RDD[String] to a java.io.InputStream but I didn't > find > > an east way to do it. > > Currently I am saving the RDD as temporary file and then opening an > > inputstream on the file but that is not really optimal. > > > > Does anybody know a better way to do that ? > > > > Thanks, > > Ayoub. > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22032.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD to InputStream
These are quite different creatures. You have a distributed set of Strings, but want a local stream of bytes, which involves three conversions: - collect data to driver - concatenate strings in some way - encode strings as bytes according to an encoding Your approach is OK but might be faster to avoid disk, if you have enough memory: - collect() to a Array[String] locally - use Guava utilities to turn a bunch of Strings into a Reader - Use the Apache Commons ReaderInputStream to read it as encoded bytes I might wonder if that's all really what you want to do though. On Fri, Mar 13, 2015 at 9:54 AM, Ayoub wrote: > Hello, > > I need to convert an RDD[String] to a java.io.InputStream but I didn't find > an east way to do it. > Currently I am saving the RDD as temporary file and then opening an > inputstream on the file but that is not really optimal. > > Does anybody know a better way to do that ? > > Thanks, > Ayoub. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org