Re: Loading .xlsx and .xlx files using pyspark

Sean Owen Wed, 23 Feb 2022 07:07:56 -0800

The standalone koalas project should have the same functionality for older
Spark versions:
https://koalas.readthedocs.io/en/latest/


You should be moving to Spark 3 though; 2.x is EOL.

On Wed, Feb 23, 2022 at 9:06 AM Sid <flinkbyhe...@gmail.com> wrote:

> Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which
> supports 2.4.3 of Spark and I don't think so this distributed support was
> added for pandas in that version. AFMKIC, it has been added in 3.2 version.
>
> So how can I do it in spark 2.4.3? Correct me if I'm wrong.
>
>
> On Wed, Feb 23, 2022 at 8:28 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> You will. Pandas API on spark that `imported with from pyspark import
>> pandas as ps` is not pandas but an API that is using pyspark under.
>>
>> ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> Hi Bjørn,
>>>
>>> Thanks for your reply. This doesn't help while loading huge datasets.
>>> Won't be able to achieve spark functionality while loading the file in
>>> distributed manner.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
>>>> from pyspark import pandas as ps
>>>>
>>>>
>>>> ps.read_excel?
>>>> "Support both `xls` and `xlsx` file extensions from a local filesystem
>>>> or URL"
>>>>
>>>> pdf = ps.read_excel("file")
>>>>
>>>> df = pdf.to_spark()
>>>>
>>>> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>:
>>>>
>>>>> Hi Gourav,
>>>>>
>>>>> Thanks for your time.
>>>>>
>>>>> I am worried about the distribution of data in case of a huge dataset
>>>>> file. Is Koalas still a better option to go ahead with? If yes, how can I
>>>>> use it with Glue ETL jobs? Do I have to pass some kind of external jars 
>>>>> for
>>>>> it?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <
>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> this looks like a very specific and exact problem in its scope.
>>>>>>
>>>>>> Do you think that you can load the data into panda dataframe and load
>>>>>> it back to SPARK using PANDAS UDF?
>>>>>>
>>>>>> Koalas is now natively integrated with SPARK, try to see if you can
>>>>>> use those features.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Gourav
>>>>>>
>>>>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>
>>>>>>> I have an excel file which unfortunately cannot be converted to CSV
>>>>>>> format and I am trying to load it using pyspark shell.
>>>>>>>
>>>>>>> I tried invoking the below pyspark session with the jars provided.
>>>>>>>
>>>>>>> pyspark --jars
>>>>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>>>>>>
>>>>>>> and below is the code to read the excel file:
>>>>>>>
>>>>>>> df = spark.read.format("excel") \
>>>>>>>      .option("dataAddress", "'Sheet1'!") \
>>>>>>>      .option("header", "true") \
>>>>>>>      .option("inferSchema", "true") \
>>>>>>> .load("/home/.../Documents/test_excel.xlsx")
>>>>>>>
>>>>>>> It is giving me the below error message:
>>>>>>>
>>>>>>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>>>>>>
>>>>>>> I tried several Jars for this error but no luck. Also, what would be
>>>>>>> the efficient way to load it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Loading .xlsx and .xlx files using pyspark

Reply via email to