Re: CTAS and save as parquet last column values are shown as null

Abhishek Girish Mon, 24 Jul 2017 08:43:26 -0700

Filed DRILL-5684 <https://issues.apache.org/jira/browse/DRILL-5684> to
track the doc issue.


On Mon, Jul 24, 2017 at 8:33 AM, Abhishek Girish <agir...@apache.org> wrote:

> Glad to know that it worked!
>
> As you are using Drill on Windows, the new line delimiter in text files
> can be different from that on Linux / Mac. We could see \r\n as the
> lineDelimiter (carriage return & new line) and hence when we set the same
> in the format plugin, the issue gets resolved. This particular one doesn't
> seem to be documented - however ideally you should be able to see this in
> [1], [2].
>
> Coming to your question on date formats, you can refer to [3]. It has
> clear examples on how to convert to a different date format. Do let us know
> if that helped.
>
> [1] https://drill.apache.org/docs/plugin-configuration-
> basics/#list-of-attributes-and-definitions
> [2] https://drill.apache.org/docs/text-files-csv-tsv-psv/#
> configuring-drill-to-read-text-files
> [3] https://drill.apache.org/docs/data-type-conversion/#to_date
>
> On Sun, Jul 23, 2017 at 11:52 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Thank you so much it worked
>>
>> Can you please provide me the pointer to the documentation where updation
>> for different format type are mentioned .
>>
>> As I am facing another issue with date type as the data which I receive
>> in csv format has the format of 15/1/2016 when I try to cast or convert
>> to_date it throws me error
>>
>>
>>
>>
>> Thanks ,
>> Divya
>>
>> On 24 July 2017 at 14:17, Abhishek Girish <agir...@apache.org> wrote:
>>
>>> Can you update your csv format plugin as shown below and retry your
>>> query?
>>>
>>>     "csv": {
>>>       "type": "text",
>>>       "extensions": [
>>>         "csv"
>>>       ],
>>>       "lineDelimiter": "\r\n",
>>>       "extractHeader": true,
>>>       "delimiter": ","
>>>     }
>>>
>>> On Sun, Jul 23, 2017 at 10:37 PM, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>>
>>> > 0: jdbc:drill:zk=local> select * FROM
>>> >  dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.
>>> > 0.tar/apache-drill-1.10.0/sample-data/jll/data/mapping/
>>> > PublicHoliday/PublicHoliday.csv`
>>> > limit 10 ;
>>> > +-----------------------------------------+
>>> > |                 columns                 |
>>> > +-----------------------------------------+
>>> > | ["Day","Date","Area\r"]                 |
>>> > | ["Friday","15/1/2016","Karnataka\r"]    |
>>> > | ["Tuesday","26/1/2016","Karnataka\r"]   |
>>> > | ["Monday","7/3/2016","Karnataka\r"]     |
>>> > | ["Friday","25/3/2016","Karnataka\r"]    |
>>> > | ["Friday","1/4/2016","Karnataka\r"]     |
>>> > | ["Friday","8/4/2016","Karnataka\r"]     |
>>> > | ["Thursday","14/4/2016","Karnataka\r"]  |
>>> > | ["Tuesday","19/4/2016","Karnataka\r"]   |
>>> > | ["Sunday","1/5/2016","Karnataka\r"]     |
>>> > +-----------------------------------------+
>>> > 10 rows selected (0.122 seconds)
>>> > 0: jdbc:drill:zk=local> select * from
>>> > `dfs`.`tmp`.`installedsoftwares/ApacheDrill/apache-drill-1.10.
>>> > 0.tar/apache-drill-1.10.0/sample-data/jll/publicholiday.parquet`
>>> > limit 10 ;
>>> > +-----------+------------+-------+
>>> > |    Day    |    Date    | Area  |
>>> > +-----------+------------+-------+
>>> > | Friday    | 15/1/2016  | null  |
>>> > | Tuesday   | 26/1/2016  | null  |
>>> > | Monday    | 7/3/2016   | null  |
>>> > | Friday    | 25/3/2016  | null  |
>>> > | Friday    | 1/4/2016   | null  |
>>> > | Friday    | 8/4/2016   | null  |
>>> > | Thursday  | 14/4/2016  | null  |
>>> > | Tuesday   | 19/4/2016  | null  |
>>> > | Sunday    | 1/5/2016   | null  |
>>> > | Monday    | 9/5/2016   | null  |
>>> > +-----------+------------+-------+
>>> > 10 rows selected (0.1 seconds)
>>> > 0: jdbc:drill:zk=local>
>>> >
>>> >
>>> > *Drill set up* : Aapche drill is set up on Windows machine in embedded
>>> mode
>>> > .
>>> >
>>> > On 24 July 2017 at 13:30, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>> >
>>> > >
>>> > > Pasting the result set in text format
>>> > >
>>> > > *Reading parquet file format :*
>>> > >
>>> > >> Day       Date         Area
>>> > >> Friday 15/1/2016 null
>>> > >> Tuesday 26/1/2016 null
>>> > >> Monday 7/3/2016 null
>>> > >> Friday 25/3/2016 null
>>> > >> Friday 1/4/2016 null
>>> > >> Friday 8/4/2016 null
>>> > >
>>> > >
>>> > >
>>> > > *Reading csv file format *
>>> > >
>>> > >> columns
>>> > >> ["Day","Date","Area\r"]
>>> > >> ["Friday","1/4/2016","Karnataka\r"]
>>> > >> ["Friday","15/1/2016","Karnataka\r"]
>>> > >> ["Friday","25/3/2016","Karnataka\r"]
>>> > >> ["Friday","8/4/2016","Karnataka\r"]
>>> > >> ["Monday","7/3/2016","Karnataka\r"]
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > *CTAS query csv to parquet :*
>>> > >
>>> > > Create table `dfs`.`tmp`.`publicholiday.parquet` AS
>>> > >> SELECT
>>> > >> CASE WHEN `Day` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Day` AS
>>> > >> VARCHAR(100)) END AS `Day`,
>>> > >> CASE WHEN `Date` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Date`
>>> > AS
>>> > >> VARCHAR(100)) END AS `Date`,
>>> > >> CASE WHEN `Area` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Area`
>>> > AS
>>> > >> VARCHAR(100)) END AS `Area`
>>> > >> FROM TABLE (dfs.`PublicHoliday.csv`(type => 'text',fieldDelimiter =>
>>> > ',',
>>> > >> extractHeader => true))
>>> > >
>>> > >
>>> > >
>>> > > Thanks,
>>> > > Divya
>>> > >
>>> > > On 24 July 2017 at 13:20, Abhishek Girish <agir...@apache.org>
>>> wrote:
>>> > >
>>> > >> Unfortunately, the attachments / pictures haven't come through.
>>> Mailing
>>> > >> lists sometimes do not support these. Can you paste as text or share
>>> > links
>>> > >> to it instead?
>>> > >>
>>> > >> On Sun, Jul 23, 2017 at 9:14 PM, Divya Gehlot <
>>> divya.htco...@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >> > yes it shows the proper values when I query the csv file.
>>> > >> > CTAS query csv to parquet :
>>> > >> > Create table `dfs`.`tmp`.`publicholiday.parquet` AS
>>> > >> > SELECT
>>> > >> > CASE WHEN `Day` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Day`
>>> > AS
>>> > >> > VARCHAR(100)) END AS `Day`,
>>> > >> > CASE WHEN `Date` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Date`
>>> > >> AS
>>> > >> > VARCHAR(100)) END AS `Date`,
>>> > >> > CASE WHEN `Area` = '' THEN CAST(NULL AS VARCHAR(100)) ELSE
>>> CAST(`Area`
>>> > >> AS
>>> > >> > VARCHAR(100)) END AS `Area`
>>> > >> > FROM TABLE (dfs.`PublicHoliday.csv`(type => 'text',fieldDelimiter
>>> =>
>>> > >> ',',
>>> > >> > extractHeader => true))
>>> > >> >
>>> > >> > CSV File
>>> > >> >
>>> > >> > Parquet File
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > Appreciate the help !
>>> > >> >
>>> > >> > Thanks,
>>> > >> > Divya 
>>> > >> >
>>> > >> > On 24 July 2017 at 11:52, Abhishek Girish <agir...@apache.org>
>>> wrote:
>>> > >> >
>>> > >> >> Can you share a sample row from the CSV and the CTAS query? Also
>>> test
>>> > >> if a
>>> > >> >> select columns[n] query on the CSV file works as expected [1] ?
>>> > >> >>
>>> > >> >> It could be an issue with delimiters.
>>> > >> >>
>>> > >> >> [1]
>>> > >> >> https://drill.apache.org/docs/querying-plain-text-files/#col
>>> > >> >> umns[n]-syntax
>>> > >> >> On Sun, Jul 23, 2017 at 8:44 PM Divya Gehlot <
>>> > divya.htco...@gmail.com>
>>> > >> >> wrote:
>>> > >> >>
>>> > >> >> > Hi ,
>>> > >> >> > I am facing as weird issue when I CTAS and save the csv file as
>>> > >> parquet
>>> > >> >> it
>>> > >> >> > displays the last column values as null .
>>> > >> >> > This is not the case with one file .
>>> > >> >> > If I take any csv file with even with any data type and do a
>>> > >> >> > select column1,column2,column3 from table.parquet
>>> > >> >> > it shows the column3 values as null.
>>> > >> >> >
>>> > >> >> > Appreciate the help.
>>> > >> >> >
>>> > >> >> > Thanks,
>>> > >> >> > Divya
>>> > >> >> >
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: CTAS and save as parquet last column values are shown as null

Reply via email to