Re: question about pyarrow.Table to pyspark.DataFrame conversion

2020-10-24 Thread shouheng
Hi Bryan,

I came across  SPARK-29040
   and I'm very excited
that others are looking for such feature as well. It will be tremendously
useful if we can implement this feature.

Currently, my workaround is to serialize `pyarrow.Table` to a parquet file,
then let Spark to read that parquet file. I avoided using `pd.Dataframe`,
same as what Artem mentioned above.

Do you think this ticket has a chance to get prioritized?

Thank you very much.

Best,
Shouheng



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: question about pyarrow.Table to pyspark.DataFrame conversion

2019-09-10 Thread Bryan Cutler
Hi Artem,

I don't believe this is currently possible, but it could be a great
addition to PySpark since this would offer a convenient and efficient way
to parallelize nested column data. I created the JIRA
https://issues.apache.org/jira/browse/SPARK-29040 for this.

On Tue, Aug 27, 2019 at 7:55 PM Artem Kozhevnikov <
kozhevnikov.ar...@gmail.com> wrote:

> I wonder if there's some recommended method to convert in memory
> pyarrow.Table (or pyarrow.BatchRecord) to pyspark.Dataframe without using
> pandas ?
> My motivation is about converting nested data (like List[int]) that have
> an efficient representation in pyarrow which is not possible with Pandas (I
> don't want to pass by python list of int ...).
>
> Thanks in advance !
> Artem
>
>
>


question about pyarrow.Table to pyspark.DataFrame conversion

2019-08-27 Thread Artem Kozhevnikov
I wonder if there's some recommended method to convert in memory
pyarrow.Table (or pyarrow.BatchRecord) to pyspark.Dataframe without using
pandas ?
My motivation is about converting nested data (like List[int]) that have an
efficient representation in pyarrow which is not possible with Pandas (I
don't want to pass by python list of int ...).

Thanks in advance !
Artem