You will. Pandas API on spark that `imported with from pyspark import pandas as ps` is not pandas but an API that is using pyspark under.
ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>: > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve spark functionality while loading the file in > distributed manner. > > Thanks, > Sid > > On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> from pyspark import pandas as ps >> >> >> ps.read_excel? >> "Support both `xls` and `xlsx` file extensions from a local filesystem or >> URL" >> >> pdf = ps.read_excel("file") >> >> df = pdf.to_spark() >> >> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>: >> >>> Hi Gourav, >>> >>> Thanks for your time. >>> >>> I am worried about the distribution of data in case of a huge dataset >>> file. Is Koalas still a better option to go ahead with? If yes, how can I >>> use it with Glue ETL jobs? Do I have to pass some kind of external jars for >>> it? >>> >>> Thanks, >>> Sid >>> >>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> this looks like a very specific and exact problem in its scope. >>>> >>>> Do you think that you can load the data into panda dataframe and load >>>> it back to SPARK using PANDAS UDF? >>>> >>>> Koalas is now natively integrated with SPARK, try to see if you can use >>>> those features. >>>> >>>> >>>> Regards, >>>> Gourav >>>> >>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> I have an excel file which unfortunately cannot be converted to CSV >>>>> format and I am trying to load it using pyspark shell. >>>>> >>>>> I tried invoking the below pyspark session with the jars provided. >>>>> >>>>> pyspark --jars >>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar >>>>> >>>>> and below is the code to read the excel file: >>>>> >>>>> df = spark.read.format("excel") \ >>>>> .option("dataAddress", "'Sheet1'!") \ >>>>> .option("header", "true") \ >>>>> .option("inferSchema", "true") \ >>>>> .load("/home/.../Documents/test_excel.xlsx") >>>>> >>>>> It is giving me the below error message: >>>>> >>>>> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >>>>> >>>>> I tried several Jars for this error but no luck. Also, what would be >>>>> the efficient way to load it? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297