Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Jörn Franke Thu, 17 Aug 2017 23:55:04 -0700

You have forgotten a y:
It must be MM/did/yyyy


> On 17. Aug 2017, at 21:30, Aakash Basu <aakash.spark....@gmail.com> wrote:
> 
> Hi Palwell,
> 
> Tried doing that, but its becoming null for all the dates after the 
> transformation with functions.
> 
> df2 = dflead.select('Enter_Date',f.to_date(df2.Enter_Date))
> 
> 
> <image.png>
> 
> Any insight?
> 
> Thanks,
> Aakash.
> 
>> On Fri, Aug 18, 2017 at 12:23 AM, Patrick Alwell <palw...@hortonworks.com> 
>> wrote:
>> Aakash,
>> 
>> I’ve had similar issues with date-time formatting. Try using the functions 
>> library from pyspark.sql and the DF withColumns() method.
>> 
>> ——————————————————————————————
>> 
>> from pyspark.sql import functions as f
>> 
>> lineitem_df = 
>> lineitem_df.withColumn('shipdate',f.to_date(lineitem_df.shipdate))
>> 
>> ——————————————————————————————
>> 
>> You should have first ingested the column as a string; and then leveraged 
>> the DF api to make the conversion to dateType.
>> 
>> That should work.
>> 
>> Kind Regards
>> 
>> -Pat Alwell
>> 
>> 
>>> On Aug 17, 2017, at 11:48 AM, Aakash Basu <aakash.spark....@gmail.com> 
>>> wrote:
>>> 
>>> Hey all,
>>> 
>>> Thanks! I had a discussion with the person who authored that package and 
>>> informed about this bug, but in the meantime with the same thing, found a 
>>> small tweak to ensure the job is done.
>>> 
>>> Now that is fine, I'm getting the date as a string by predefining the 
>>> Schema but I want to later convert it to a datetime format, which is making 
>>> it this -
>>> 
>>> >>> from pyspark.sql.functions import from_unixtime, unix_timestamp
>>> >>> df2 = dflead.select('Enter_Date', 
>>> >>> from_unixtime(unix_timestamp('Enter_Date', 'MM/dd/yyy')).alias('date'))
>>> 
>>> 
>>> >>> df2.show()
>>> 
>>> <image.png>
>>> 
>>> Which is not correct (as it is converting the 15 to 0015 instead of 2015. 
>>> Do you guys think using the DateUtil package will solve this? Or any other 
>>> solution with this built-in package?
>>> 
>>> Please help!
>>> 
>>> Thanks,
>>> Aakash.
>>> 
>>>> On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> You can use Apache POI DateUtil to convert double to Date 
>>>> (https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
>>>>  Alternatively you can try HadoopOffice 
>>>> (https://github.com/ZuInnoTe/hadoopoffice/wiki), it supports Spark 1.x or 
>>>> Spark 2.0 ds.
>>>> 
>>>> On 16. Aug 2017, at 20:15, Aakash Basu <aakash.spark....@gmail.com> wrote:
>>>> 
>>>>> Hey Irving,
>>>>> 
>>>>> Thanks for a quick revert. In Excel that column is purely string, I 
>>>>> actually want to import that as a String and later play around the DF to 
>>>>> convert it back to date type, but the API itself is not allowing me to 
>>>>> dynamically assign a Schema to the DF and I'm forced to inferSchema, 
>>>>> where itself, it is converting all numeric columns to double (Though, I 
>>>>> don't know how then the date column is getting converted to double if it 
>>>>> is string in the Excel source).
>>>>> 
>>>>> Thanks,
>>>>> Aakash.
>>>>> 
>>>>> 
>>>>> On 16-Aug-2017 11:39 PM, "Irving Duran" <irving.du...@gmail.com> wrote:
>>>>> I think there is a difference between the actual value in the cell and 
>>>>> what Excel formats that cell.  You probably want to import that field as 
>>>>> a string or not have it as a date format in Excel.
>>>>> 
>>>>> Just a thought....
>>>>> 
>>>>> 
>>>>> Thank You,
>>>>> 
>>>>> Irving Duran
>>>>> 
>>>>>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
>>>>>> <aakash.spark....@gmail.com> wrote:
>>>>>> Hey all,
>>>>>> 
>>>>>> Forgot to attach the link to the overriding Schema through external 
>>>>>> package's discussion.
>>>>>> 
>>>>>> https://github.com/crealytics/spark-excel/pull/13
>>>>>> 
>>>>>> You can see my comment there too.
>>>>>> 
>>>>>> Thanks,
>>>>>> Aakash.
>>>>>> 
>>>>>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
>>>>>>> <aakash.spark....@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to 
>>>>>>> fetch data from an excel file using 
>>>>>>> spark.read.format("com.crealytics.spark.excel"), but it is inferring 
>>>>>>> double for a date type column.
>>>>>>> 
>>>>>>> The detailed description is given here (the question I posted) -
>>>>>>> 
>>>>>>> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>>>>>> 
>>>>>>> 
>>>>>>> Found it is a probable bug with the crealytics excel read package.
>>>>>>> 
>>>>>>> Can somebody help me with a workaround for this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Aakash.
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Reply via email to