You have a pyspark dataframe and you want to convert it to pandas? Convert it first to pandas api on spark
pf01 = f01.to_pandas_on_spark() Then convert it to pandas pf01 = f01.to_pandas() Or? tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mich.talebza...@gmail.com>: > Thanks everyone. > > I want to do the following in pandas and numpy without using spark. > > This is what I do in spark to generate some random data using class > UsedFunctions (not important). > > class UsedFunctions: > def randomString(self,length): > letters = string.ascii_letters > result_str = ''.join(random.choice(letters) for i in range(length)) > return result_str > def clustered(self,x,numRows): > return math.floor(x -1)/numRows > def scattered(self,x,numRows): > return abs((x -1 % numRows))* 1.0 > def randomised(self,seed,numRows): > random.seed(seed) > return abs(random.randint(0, numRows) % numRows) * 1.0 > def padString(self,x,chars,length): > n = int(math.log10(x) + 1) > result_str = ''.join(random.choice(chars) for i in range(length-n)) + > str(x) > return result_str > def padSingleChar(self,chars,length): > result_str = ''.join(chars for i in range(length)) > return result_str > def println(self,lst): > for ll in lst: > print(ll[0]) > > > usedFunctions = UsedFunctions() > > start = 1 > end = start + 9 > print ("starting at ID = ",start, ",ending on = ",end) > Range = range(start, end) > rdd = sc.parallelize(Range). \ > map(lambda x: (x, usedFunctions.clustered(x,numRows), \ > usedFunctions.scattered(x,numRows), \ > usedFunctions.randomised(x,numRows), \ > usedFunctions.randomString(50), \ > usedFunctions.padString(x," ",50), \ > usedFunctions.padSingleChar("x",4000))) > df = rdd.toDF() > > OK how can I create a panda DataFrame df without using Spark? > > Thanks > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> Hi Andrew. Mitch asked, and I answered transpose() >> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html >> . >> >> And now you are asking in the same thread about pandas API on spark and >> the transform(). >> >> Apache Spark have pandas API on Spark. >> >> Which means that spark has an API call for pandas functions, and when you >> use pandas API on spark it is spark you are using then. >> >> Add this line in yours import >> >> from pyspark import pandas as ps >> >> >> Now you can pass yours dataframe back and forward to pandas API on spark >> by using >> >> pf01 = f01.to_pandas_on_spark() >> >> >> f01 = pf01.to_spark() >> >> >> Note that I have changed pd to ps here. >> >> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}) >> >> df.transform(lambda x: x + 1) >> >> You will now see that all numbers are +1 >> >> You can find more information about pandas API on spark transform >> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform >> or in yours notbook >> df.transform? >> >> Signature: >> df.transform( >> func: Callable[..., ForwardRef('Series')], >> axis: Union[int, str] = 0, >> *args: Any, >> **kwargs: Any,) -> 'DataFrame'Docstring: >> Call ``func`` on self producing a Series with transformed values >> and that has the same length as its input. >> >> See also `Transform and apply a function >> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_. >> >> .. note:: this API executes the function once to infer the type which is >> potentially expensive, for instance, when the dataset is created after >> aggregations or sorting. >> >> To avoid this, specify return type in ``func``, for instance, as below: >> >> >>> def square(x) -> ps.Series[np.int32]: >> ... return x ** 2 >> >> pandas-on-Spark uses return type hint and does not try to infer the >> type. >> >> .. note:: the series within ``func`` is actually multiple pandas series as >> the >> segments of the whole pandas-on-Spark series; therefore, the length of >> each series >> is not guaranteed. As an example, an aggregation against each series >> does work as a global aggregation but an aggregation of each segment. See >> below: >> >> >>> def func(x) -> ps.Series[np.int32]: >> ... return x + sum(x) >> >> Parameters >> ---------- >> func : function >> Function to use for transforming the data. It must work when pandas >> Series >> is passed. >> axis : int, default 0 or 'index' >> Can only be set to 0 at the moment. >> *args >> Positional arguments to pass to func. >> **kwargs >> Keyword arguments to pass to func. >> >> Returns >> ------- >> DataFrame >> A DataFrame that must have the same length as self. >> >> Raises >> ------ >> Exception : If the returned DataFrame has a different length than self. >> >> See Also >> -------- >> DataFrame.aggregate : Only perform aggregating type operations. >> DataFrame.apply : Invoke function on DataFrame. >> Series.transform : The equivalent function for Series. >> >> Examples >> -------- >> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B']) >> >>> df >> A B >> 0 0 1 >> 1 1 2 >> 2 2 3 >> >> >>> def square(x) -> ps.Series[np.int32]: >> ... return x ** 2 >> >>> df.transform(square) >> A B >> 0 0 1 >> 1 1 4 >> 2 4 9 >> >> You can omit the type hint and let pandas-on-Spark infer its type. >> >> >>> df.transform(lambda x: x ** 2) >> A B >> 0 0 1 >> 1 1 4 >> 2 4 9 >> >> For multi-index columns: >> >> >>> df.columns = [('X', 'A'), ('X', 'B')] >> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE >> X >> A B >> 0 0 1 >> 1 1 4 >> 2 4 9 >> >> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE >> X >> A B >> 0 0 1 >> 1 1 2 >> 2 2 3 >> >> You can also specify extra arguments. >> >> >>> def calculation(x, y, z) -> ps.Series[int]: >> ... return x ** y + z >> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE >> X >> A B >> 0 20 21 >> 1 21 1044 >> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: >> method >> >> >> >> >> >> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedav...@ucsc.edu>: >> >>> Hi Bjorn >>> >>> >>> >>> I have been looking for spark transform for a while. Can you send me a >>> link to the pyspark function? >>> >>> >>> >>> I assume pandas transform is not really an option. I think it will try >>> to pull the entire dataframe into the drivers memory. >>> >>> >>> >>> Kind regards >>> >>> >>> >>> Andy >>> >>> >>> >>> p.s. My real problem is that spark does not allow you to bind columns. >>> You can use union() to bind rows. I could get the equivalent of cbind() >>> using union().transform() >>> >>> >>> >>> *From: *Bjørn Jørgensen <bjornjorgen...@gmail.com> >>> *Date: *Tuesday, March 15, 2022 at 10:37 AM >>> *To: *Mich Talebzadeh <mich.talebza...@gmail.com> >>> *Cc: *"user @spark" <user@spark.apache.org> >>> *Subject: *Re: pivoting panda dataframe >>> >>> >>> >>> >>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html >>> we >>> have that transpose in pandas api for spark to. >>> >>> >>> >>> You also have stack() and multilevel >>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html >>> >>> >>> >>> >>> >>> >>> >>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh < >>> mich.talebza...@gmail.com>: >>> >>> >>> hi, >>> >>> >>> >>> Is it possible to pivot a panda dataframe by making the row column >>> heading? >>> >>> >>> >>> thanks >>> >>> >>> >>> >>> >>> [image: Image removed by sender.] view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Bjørn Jørgensen >>> Vestre Aspehaug 4, 6010 Ålesund >>> Norge >>> >>> +47 480 94 297 >>> >> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >