The standalone koalas project should have the same functionality for older Spark versions: https://koalas.readthedocs.io/en/latest/
You should be moving to Spark 3 though; 2.x is EOL. On Wed, Feb 23, 2022 at 9:06 AM Sid <flinkbyhe...@gmail.com> wrote: > Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which > supports 2.4.3 of Spark and I don't think so this distributed support was > added for pandas in that version. AFMKIC, it has been added in 3.2 version. > > So how can I do it in spark 2.4.3? Correct me if I'm wrong. > > > On Wed, Feb 23, 2022 at 8:28 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> You will. Pandas API on spark that `imported with from pyspark import >> pandas as ps` is not pandas but an API that is using pyspark under. >> >> ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>: >> >>> Hi Bjørn, >>> >>> Thanks for your reply. This doesn't help while loading huge datasets. >>> Won't be able to achieve spark functionality while loading the file in >>> distributed manner. >>> >>> Thanks, >>> Sid >>> >>> On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen < >>> bjornjorgen...@gmail.com> wrote: >>> >>>> from pyspark import pandas as ps >>>> >>>> >>>> ps.read_excel? >>>> "Support both `xls` and `xlsx` file extensions from a local filesystem >>>> or URL" >>>> >>>> pdf = ps.read_excel("file") >>>> >>>> df = pdf.to_spark() >>>> >>>> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>: >>>> >>>>> Hi Gourav, >>>>> >>>>> Thanks for your time. >>>>> >>>>> I am worried about the distribution of data in case of a huge dataset >>>>> file. Is Koalas still a better option to go ahead with? If yes, how can I >>>>> use it with Glue ETL jobs? Do I have to pass some kind of external jars >>>>> for >>>>> it? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta < >>>>> gourav.sengu...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> this looks like a very specific and exact problem in its scope. >>>>>> >>>>>> Do you think that you can load the data into panda dataframe and load >>>>>> it back to SPARK using PANDAS UDF? >>>>>> >>>>>> Koalas is now natively integrated with SPARK, try to see if you can >>>>>> use those features. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Gourav >>>>>> >>>>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>>> >>>>>>> I have an excel file which unfortunately cannot be converted to CSV >>>>>>> format and I am trying to load it using pyspark shell. >>>>>>> >>>>>>> I tried invoking the below pyspark session with the jars provided. >>>>>>> >>>>>>> pyspark --jars >>>>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar >>>>>>> >>>>>>> and below is the code to read the excel file: >>>>>>> >>>>>>> df = spark.read.format("excel") \ >>>>>>> .option("dataAddress", "'Sheet1'!") \ >>>>>>> .option("header", "true") \ >>>>>>> .option("inferSchema", "true") \ >>>>>>> .load("/home/.../Documents/test_excel.xlsx") >>>>>>> >>>>>>> It is giving me the below error message: >>>>>>> >>>>>>> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >>>>>>> >>>>>>> I tried several Jars for this error but no luck. Also, what would be >>>>>>> the efficient way to load it? >>>>>>> >>>>>>> Thanks, >>>>>>> Sid >>>>>>> >>>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >