[jira] [Assigned] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

Mayank Kunwar (Jira) Thu, 23 May 2024 08:34:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mayank Kunwar reassigned HIVE-27498:
------------------------------------

    Assignee: Mayank Kunwar

> Support custom delimiter in SkippingTextInputFormat
> ---------------------------------------------------
>
>                 Key: HIVE-27498
>                 URL: https://issues.apache.org/jira/browse/HIVE-27498
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Taraka Rama Rao Lethavadla
>            Assignee: Mayank Kunwar
>            Priority: Major
>
> Simple select is returning results as expected when there are configs
> {noformat}
> 'skip.header.line.count'='1',                    
> 'textinputformat.record.delimiter'='|'{noformat}
> but if we execute select count(*) or any query that launches a tez job is 
> considering the whole text as single line
> *Test case*
> data.csv
> {noformat}
> Code    Name|A AAAA|B BBBB
> CCCC|C  DDDD{noformat}
> DDL
> {noformat}
> create external table test(code string,name string)
> ROW FORMAT SERDE
>    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>  WITH SERDEPROPERTIES (
>    'field.delim'='\t')
>  STORED AS INPUTFORMAT
>    'org.apache.hadoop.mapred.TextInputFormat'
>  OUTPUTFORMAT
>    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>    location '${system:test.tmp.dir}/test'
>  TBLPROPERTIES (
>    'skip.header.line.count'='1',
>    'textinputformat.record.delimiter'='|');{noformat}
> Query result
> select code,name from test;
> {noformat}
> A AAAA
> B BBBB
> CCCC
> C DDDD{noformat}
> *Problem:* But query _+select count(*) from test+_  is returning 1 instead of 
> 3
> It used to work in older hive versions.
> The difference in behaviour started to happen after the introduction of 
> feature https://issues.apache.org/jira/browse/HIVE-21924
> The feature aims at splitting the text files while reading even though the 
> table has configuration to skip headers. There by increasing the number of 
> mappers to process the query there by improving throughput of the query.
> The actual problem lies in how new feature is reading a file. It does not 
> consider 'textinputformat.record.delimiter' property and tries to read the 
> file looking for new line characters. Since the input file does not have a 
> new line for every record, it is reading the whole file as single line and 
> count is returned as 1
> Ref: 
> [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
>  
>  *Workaround*
> If we can remove headers in the data and skip header config in table 
> properties or compress the files, then we will not get into this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

Reply via email to