回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,
 Thanks for your suggestions, I will try to use foreachpartition later.
 
Best regards,
maqy

发件人: Tang Jinxin
发送时间: 2020年4月23日 7:31
收件人: maqy
抄送: Andrew Melo; user@spark.apache.org
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  
follow:firstly,try not collect to driver if not nessessary,instead (use 
foreachpartition)send data from ececutors;secondly,if not use some high 
performance  ser/deser like kryo, we could have a try.As a summary,I recommend 
the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack
邮箱:xiaoxingst...@gmail.com
签名由 网易邮箱大师 定制
在2020年04月22日 23:24,maqy 写道:
 Hi Andrew,
    Thank you for your reply, I am using the scala api of spark, and the 
tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid 
in this situation?
 In addition, the current bottleneck of the application is that the amount of 
data transferred through the network(use collect()) is too large, and the 
deserialization seems to take some time.
 
Best wishes,
maqy

发件人: Andrew Melo
发送时间: 2020年4月22日 21:02
收件人: maqy
抄送: Michael Artz; user@spark.apache.org
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray 
[Row]?

On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow 
> through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  
> to the driver, but toLocalIterator() will create a lot of jobs and will bring 
> a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: user@spark.apache.org
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array 
> [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very 
> small.
>
> When I use collect () to collect data to the driver, each line of the dataset 
> will be converted to Row and stored in an array, which will bring great 
> memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

r




回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,
 Thanks for your suggestions, I will try to use foreachpartition later.
 
Best regards,
maqy

发件人: Tang Jinxin
发送时间: 2020年4月23日 7:31
收件人: maqy
抄送: Andrew Melo; user@spark.apache.org
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  
follow:firstly,try not collect to driver if not nessessary,instead (use 
foreachpartition)send data from ececutors;secondly,if not use some high 
performance  ser/deser like kryo, we could have a try.As a summary,I recommend 
the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack
邮箱:xiaoxingst...@gmail.com
签名由 网易邮箱大师 定制
在2020年04月22日 23:24,maqy 写道:
 Hi Andrew,
    Thank you for your reply, I am using the scala api of spark, and the 
tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid 
in this situation?
 In addition, the current bottleneck of the application is that the amount of 
data transferred through the network(use collect()) is too large, and the 
deserialization seems to take some time.
 
Best wishes,
maqy

发件人: Andrew Melo
发送时间: 2020年4月22日 21:02
收件人: maqy
抄送: Michael Artz; user@spark.apache.org
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray 
[Row]?

On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow 
> through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  
> to the driver, but toLocalIterator() will create a lot of jobs and will bring 
> a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: user@spark.apache.org
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array 
> [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very 
> small.
>
> When I use collect () to collect data to the driver, each line of the dataset 
> will be converted to Row and stored in an array, which will bring great 
> memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

r




回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,
 Thanks for your suggestions, I will try to use foreachpartition later.
 
Best regards,
maqy

发件人: Tang Jinxin
发送时间: 2020年4月23日 7:31
收件人: maqy
抄送: Andrew Melo; user@spark.apache.org
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  
follow:firstly,try not collect to driver if not nessessary,instead (use 
foreachpartition)send data from ececutors;secondly,if not use some high 
performance  ser/deser like kryo, we could have a try.As a summary,I recommend 
the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack
邮箱:xiaoxingst...@gmail.com
签名由 网易邮箱大师 定制
在2020年04月22日 23:24,maqy 写道:
 Hi Andrew,
    Thank you for your reply, I am using the scala api of spark, and the 
tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid 
in this situation?
 In addition, the current bottleneck of the application is that the amount of 
data transferred through the network(use collect()) is too large, and the 
deserialization seems to take some time.
 
Best wishes,
maqy

发件人: Andrew Melo
发送时间: 2020年4月22日 21:02
收件人: maqy
抄送: Michael Artz; user@spark.apache.org
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray 
[Row]?

On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow 
> through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  
> to the driver, but toLocalIterator() will create a lot of jobs and will bring 
> a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: user@spark.apache.org
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array 
> [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very 
> small.
>
> When I use collect () to collect data to the driver, each line of the dataset 
> will be converted to Row and stored in an array, which will bring great 
> memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

r




回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-22 Thread Tang Jinxin
Hi maqy, Thanks for your question.Through consideration,I have some ideas as  
follow:firstly,try not collect to driver if not nessessary,instead (use 
foreachpartition)send data from ececutors;secondly,if not use some high 
performance  ser/deser like kryo, we could have a try.As a summary,I recommend 
the first point(desigh a more efficient way when the data is too large). Best 
wishes, littlestar xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 
在2020年04月22日 23:24,maqy 写道: Hi Andrew,     Thank you for your reply, I am using 
the scala api of spark, and the tensorflow machine is not in the spark cluster. 
Is this JIRA / PR still valid in this situation? In addition, the current 
bottleneck of the application is that the amount of data transferred through 
the network(use collect()) is too large, and the deserialization seems to take 
some time.   Best wishes, maqy   发件人: Andrew Melo 发送时间: 2020年4月22日 21:02 收件人: 
maqy 抄送: Michael Artz; user@spark.apache.org 主题: Re: Can I collect Dataset[Row] 
to driver without converting it toArray [Row]?   On Wed, Apr 22, 2020 at 3:24 
AM maqy <454618...@qq.com> wrote: >  > I will traverse this Dataset to convert 
it to Arrow and send it to Tensorflow through Socket.   (I presume you're using 
the python tensorflow API, if you're not, please ignore)   There is a JIRA/PR 
([1] [2]) which proposes to add native support for Arrow serialization to 
python,   Under the hood, Spark is already serializing into Arrow format to 
transmit to python, it's just additionally doing an unconditional conversion to 
pandas once it reaches the python runner -- which is good if you're using 
pandas, not so great if pandas isn't what you operate on. The JIRA above would 
let you receive the arrow buffers (that already exist) directly.   Cheers, 
Andrew [1] https://issues.apache.org/jira/browse/SPARK-30153 [2] 
https://github.com/apache/spark/pull/26783   >  > I tried to use 
toLocalIterator() to traverse the dataset instead of collect  to the driver, 
but toLocalIterator() will create a lot of jobs and will bring a lot of time 
consumption. >  >  >  > Best regards, >  > maqy >  >  >  > 发件人: Michael Artz > 
发送时间: 2020年4月22日 16:09 > 收件人: maqy > 抄送: user@spark.apache.org > 主题: Re: Can I 
collect Dataset[Row] to driver without converting it to Array [Row]? >  >  >  > 
What would you do with it once you get it into driver in a Dataset[Row]? >  > 
Sent from my iPhone >  >  >  > On Apr 22, 2020, at 3:06 AM, maqy 
<454618...@qq.com> wrote: >  >  >  > When the data is stored in the Dataset 
[Row] format, the memory usage is very small. >  > When I use collect () to 
collect data to the driver, each line of the dataset will be converted to Row 
and stored in an array, which will bring great memory overhead. >  > So, can I 
collect Dataset[Row] to driver and keep its data format? >  >  >  > Best 
regards, >  > maqy >  >  >  >    
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org   r  

回复: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-22 Thread maqy
 Hi Andrew,
Thank you for your reply, I am using the scala api of spark, and the 
tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid 
in this situation?
 In addition, the current bottleneck of the application is that the amount of 
data transferred through the network(use collect()) is too large, and the 
deserialization seems to take some time.
 
Best wishes,
maqy

发件人: Andrew Melo
发送时间: 2020年4月22日 21:02
收件人: maqy
抄送: Michael Artz; user@spark.apache.org
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray 
[Row]?

On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow 
> through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  
> to the driver, but toLocalIterator() will create a lot of jobs and will bring 
> a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: user@spark.apache.org
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array 
> [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very 
> small.
>
> When I use collect () to collect data to the driver, each line of the dataset 
> will be converted to Row and stored in an array, which will bring great 
> memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

r