You have a pyspark dataframe and you want to convert it to pandas?

Convert it first to pandas api on spark


pf01 = f01.to_pandas_on_spark()


Then convert it to pandas


pf01 = f01.to_pandas()

Or?

tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mich.talebza...@gmail.com>:

> Thanks everyone.
>
> I want to do the following in pandas and numpy without using spark.
>
> This is what I do in spark to generate some random data using class
> UsedFunctions (not important).
>
> class UsedFunctions:
>   def randomString(self,length):
>     letters = string.ascii_letters
>     result_str = ''.join(random.choice(letters) for i in range(length))
>     return result_str
>   def clustered(self,x,numRows):
>     return math.floor(x -1)/numRows
>   def scattered(self,x,numRows):
>     return abs((x -1 % numRows))* 1.0
>   def randomised(self,seed,numRows):
>     random.seed(seed)
>     return abs(random.randint(0, numRows) % numRows) * 1.0
>   def padString(self,x,chars,length):
>     n = int(math.log10(x) + 1)
>     result_str = ''.join(random.choice(chars) for i in range(length-n)) +
> str(x)
>     return result_str
>   def padSingleChar(self,chars,length):
>     result_str = ''.join(chars for i in range(length))
>     return result_str
>   def println(self,lst):
>     for ll in lst:
>       print(ll[0])
>
>
> usedFunctions = UsedFunctions()
>
> start = 1
> end = start + 9
> print ("starting at ID = ",start, ",ending on = ",end)
> Range = range(start, end)
> rdd = sc.parallelize(Range). \
>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>                            usedFunctions.scattered(x,numRows), \
>                            usedFunctions.randomised(x,numRows), \
>                            usedFunctions.randomString(50), \
>                            usedFunctions.padString(x," ",50), \
>                            usedFunctions.padSingleChar("x",4000)))
> df = rdd.toDF()
>
> OK how can I create a panda DataFrame df without using Spark?
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> Hi Andrew. Mitch asked, and I answered transpose()
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>> .
>>
>> And now you are asking in the same thread about pandas API on spark and
>> the transform().
>>
>> Apache Spark have pandas API on Spark.
>>
>> Which means that spark has an API call for pandas functions, and when you
>> use pandas API on spark it is spark you are using then.
>>
>> Add this line in yours import
>>
>> from pyspark import pandas as ps
>>
>>
>> Now you can pass yours dataframe back and forward to pandas API on spark
>> by using
>>
>> pf01 = f01.to_pandas_on_spark()
>>
>>
>> f01 = pf01.to_spark()
>>
>>
>> Note that I have changed pd to ps here.
>>
>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>
>> df.transform(lambda x: x + 1)
>>
>> You will now see that all numbers are +1
>>
>> You can find more information about pandas API on spark transform
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>> or in yours notbook
>> df.transform?
>>
>> Signature:
>> df.transform(
>>     func: Callable[..., ForwardRef('Series')],
>>     axis: Union[int, str] = 0,
>>     *args: Any,
>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>> Call ``func`` on self producing a Series with transformed values
>> and that has the same length as its input.
>>
>> See also `Transform and apply a function
>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>
>> .. note:: this API executes the function once to infer the type which is
>>      potentially expensive, for instance, when the dataset is created after
>>      aggregations or sorting.
>>
>>      To avoid this, specify return type in ``func``, for instance, as below:
>>
>>      >>> def square(x) -> ps.Series[np.int32]:
>>      ...     return x ** 2
>>
>>      pandas-on-Spark uses return type hint and does not try to infer the 
>> type.
>>
>> .. note:: the series within ``func`` is actually multiple pandas series as 
>> the
>>     segments of the whole pandas-on-Spark series; therefore, the length of 
>> each series
>>     is not guaranteed. As an example, an aggregation against each series
>>     does work as a global aggregation but an aggregation of each segment. See
>>     below:
>>
>>     >>> def func(x) -> ps.Series[np.int32]:
>>     ...     return x + sum(x)
>>
>> Parameters
>> ----------
>> func : function
>>     Function to use for transforming the data. It must work when pandas 
>> Series
>>     is passed.
>> axis : int, default 0 or 'index'
>>     Can only be set to 0 at the moment.
>> *args
>>     Positional arguments to pass to func.
>> **kwargs
>>     Keyword arguments to pass to func.
>>
>> Returns
>> -------
>> DataFrame
>>     A DataFrame that must have the same length as self.
>>
>> Raises
>> ------
>> Exception : If the returned DataFrame has a different length than self.
>>
>> See Also
>> --------
>> DataFrame.aggregate : Only perform aggregating type operations.
>> DataFrame.apply : Invoke function on DataFrame.
>> Series.transform : The equivalent function for Series.
>>
>> Examples
>> --------
>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>> >>> df
>>    A  B
>> 0  0  1
>> 1  1  2
>> 2  2  3
>>
>> >>> def square(x) -> ps.Series[np.int32]:
>> ...     return x ** 2
>> >>> df.transform(square)
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> You can omit the type hint and let pandas-on-Spark infer its type.
>>
>> >>> df.transform(lambda x: x ** 2)
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> For multi-index columns:
>>
>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>    X
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>    X
>>    A  B
>> 0  0  1
>> 1  1  2
>> 2  2  3
>>
>> You can also specify extra arguments.
>>
>> >>> def calculation(x, y, z) -> ps.Series[int]:
>> ...     return x ** y + z
>> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>>       X
>>       A      B
>> 0    20     21
>> 1    21   1044
>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      
>> method
>>
>>
>>
>>
>>
>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedav...@ucsc.edu>:
>>
>>> Hi Bjorn
>>>
>>>
>>>
>>> I have been looking for spark transform for a while. Can you send me a
>>> link to the pyspark function?
>>>
>>>
>>>
>>> I assume pandas transform is not really an option. I think it will try
>>> to pull the entire dataframe into the drivers memory.
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> p.s. My real problem is that spark does not allow you to bind columns.
>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>> using union().transform()
>>>
>>>
>>>
>>> *From: *Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>> *To: *Mich Talebzadeh <mich.talebza...@gmail.com>
>>> *Cc: *"user @spark" <user@spark.apache.org>
>>> *Subject: *Re: pivoting panda dataframe
>>>
>>>
>>>
>>>
>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
>>>  we
>>> have that transpose in pandas api for spark to.
>>>
>>>
>>>
>>> You also have stack() and multilevel
>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
>>>
>>> hi,
>>>
>>>
>>>
>>> Is it possible to pivot a panda dataframe by making the row column
>>> heading?
>>>
>>>
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>  [image: Image removed by sender.]  view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Reply via email to