from pyspark import pandas as ps
ps.read_excel? "Support both `xls` and `xlsx` file extensions from a local filesystem or URL" pdf = ps.read_excel("file") df = pdf.to_spark() ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>: > Hi Gourav, > > Thanks for your time. > > I am worried about the distribution of data in case of a huge dataset > file. Is Koalas still a better option to go ahead with? If yes, how can I > use it with Glue ETL jobs? Do I have to pass some kind of external jars for > it? > > Thanks, > Sid > > On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> >> this looks like a very specific and exact problem in its scope. >> >> Do you think that you can load the data into panda dataframe and load it >> back to SPARK using PANDAS UDF? >> >> Koalas is now natively integrated with SPARK, try to see if you can use >> those features. >> >> >> Regards, >> Gourav >> >> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> I have an excel file which unfortunately cannot be converted to CSV >>> format and I am trying to load it using pyspark shell. >>> >>> I tried invoking the below pyspark session with the jars provided. >>> >>> pyspark --jars >>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar >>> >>> and below is the code to read the excel file: >>> >>> df = spark.read.format("excel") \ >>> .option("dataAddress", "'Sheet1'!") \ >>> .option("header", "true") \ >>> .option("inferSchema", "true") \ >>> .load("/home/.../Documents/test_excel.xlsx") >>> >>> It is giving me the below error message: >>> >>> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >>> >>> I tried several Jars for this error but no luck. Also, what would be the >>> efficient way to load it? >>> >>> Thanks, >>> Sid >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297