Re: Loading .xlsx and .xlx files using pyspark

Bjørn Jørgensen Wed, 23 Feb 2022 06:59:12 -0800

You will. Pandas API on spark that `imported with from pyspark import
pandas as ps` is not pandas but an API that is using pyspark under.


ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>:

> Hi Bjørn,
>
> Thanks for your reply. This doesn't help while loading huge datasets.
> Won't be able to achieve spark functionality while loading the file in
> distributed manner.
>
> Thanks,
> Sid
>
> On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> from pyspark import pandas as ps
>>
>>
>> ps.read_excel?
>> "Support both `xls` and `xlsx` file extensions from a local filesystem or
>> URL"
>>
>> pdf = ps.read_excel("file")
>>
>> df = pdf.to_spark()
>>
>> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> Hi Gourav,
>>>
>>> Thanks for your time.
>>>
>>> I am worried about the distribution of data in case of a huge dataset
>>> file. Is Koalas still a better option to go ahead with? If yes, how can I
>>> use it with Glue ETL jobs? Do I have to pass some kind of external jars for
>>> it?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> this looks like a very specific and exact problem in its scope.
>>>>
>>>> Do you think that you can load the data into panda dataframe and load
>>>> it back to SPARK using PANDAS UDF?
>>>>
>>>> Koalas is now natively integrated with SPARK, try to see if you can use
>>>> those features.
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> I have an excel file which unfortunately cannot be converted to CSV
>>>>> format and I am trying to load it using pyspark shell.
>>>>>
>>>>> I tried invoking the below pyspark session with the jars provided.
>>>>>
>>>>> pyspark --jars
>>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>>>>
>>>>> and below is the code to read the excel file:
>>>>>
>>>>> df = spark.read.format("excel") \
>>>>>      .option("dataAddress", "'Sheet1'!") \
>>>>>      .option("header", "true") \
>>>>>      .option("inferSchema", "true") \
>>>>> .load("/home/.../Documents/test_excel.xlsx")
>>>>>
>>>>> It is giving me the below error message:
>>>>>
>>>>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>>>>
>>>>> I tried several Jars for this error but no luck. Also, what would be
>>>>> the efficient way to load it?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Loading .xlsx and .xlx files using pyspark

Reply via email to