[jira] [Commented] (HIVE-25765) skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

2024-04-17 Thread Miklos Szurap (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838024#comment-17838024
 ] 

Miklos Szurap commented on HIVE-25765:
--

We've also faced this recently, and it's even more apparent when using S3 as a 
storage, since the [block size in 
S3|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html]
 is 32 MB:
{code}
fs.s3a.block.size=32M
{code}
Can somebody reopen the pull request and help with the commit?

> skip.header.line.count property skips rows of each block in FetchOperator 
> when file size is larger
> --
>
> Key: HIVE-25765
> URL: https://issues.apache.org/jira/browse/HIVE-25765
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.2, 4.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
>  Labels: pull-request-available
> Attachments: data.txt.gz
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When _skip.header.line.count_ property is set in table properties, simple 
> select queries that gets converted into FetchTask skip rows of each block 
> instead of skipping header lines of each file. This happens when the file 
> size is larger and file is read in blocks. This issue doesn't exist when 
> select query is converted into map only job by setting 
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped 
> only for the first block because of [this 
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
>  We should have similar check in FetchOperator to avoid this issue. 
>  
> *Steps to reproduce:* 
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached 
> in this ticket
> CREATE EXTERNAL TABLE test_table(
>   col1 string,
>   col2 string,
>   col3 string,
>   col4 string,
>   col5 string,
>   col6 string,
>   col7 string,
>   col8 string,
>   col9 string,
>   col10 string,
>   col11 string,
>   col12 string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line 
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of 
> blocks configured in underlying filesystem. 3 rows are skipped when the file 
> is read in 3 blocks. 
> select * from test_table;
> .
> .
> Fetched 3145724 rows
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-25765) skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

2021-12-04 Thread Ganesha Shreedhara (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453520#comment-17453520
 ] 

Ganesha Shreedhara commented on HIVE-25765:
---

[~pgaref] Yes, this issue is reproducible in the latest master branch. 

> skip.header.line.count property skips rows of each block in FetchOperator 
> when file size is larger
> --
>
> Key: HIVE-25765
> URL: https://issues.apache.org/jira/browse/HIVE-25765
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
>  Labels: pull-request-available
> Attachments: data.txt.gz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When _skip.header.line.count_ property is set in table properties, simple 
> select queries that gets converted into FetchTask skip rows of each block 
> instead of skipping header lines of each file. This happens when the file 
> size is larger and file is read in blocks. This issue doesn't exist when 
> select query is converted into map only job by setting 
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped 
> only for the first block because of [this 
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
>  We should have similar check in FetchOperator to avoid this issue. 
>  
> *Steps to reproduce:* 
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached 
> in this ticket
> CREATE EXTERNAL TABLE test_table(
>   col1 string,
>   col2 string,
>   col3 string,
>   col4 string,
>   col5 string,
>   col6 string,
>   col7 string,
>   col8 string,
>   col9 string,
>   col10 string,
>   col11 string,
>   col12 string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line 
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of 
> blocks configured in underlying filesystem. 3 rows are skipped when the file 
> is read in 3 blocks. 
> select * from test_table;
> .
> .
> Fetched 3145724 rows
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-25765) skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

2021-12-03 Thread Panagiotis Garefalakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453204#comment-17453204
 ] 

Panagiotis Garefalakis commented on HIVE-25765:
---

Hey [~ganeshas]  – thanks for reporting this! 
Is this bug also visible in the latest master branch?

> skip.header.line.count property skips rows of each block in FetchOperator 
> when file size is larger
> --
>
> Key: HIVE-25765
> URL: https://issues.apache.org/jira/browse/HIVE-25765
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
>  Labels: pull-request-available
> Attachments: data.txt.gz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When _skip.header.line.count_ property is set in table properties, simple 
> select queries that gets converted into FetchTask skip rows of each block 
> instead of skipping header lines of each file. This happens when the file 
> size is larger and file is read in blocks. This issue doesn't exist when 
> select query is converted into map only job by setting 
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped 
> only for the first block because of [this 
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
>  We should have similar check in FetchOperator to avoid this issue. 
>  
> *Steps to reproduce:* 
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached 
> in this ticket
> CREATE EXTERNAL TABLE test_table(
>   col1 string,
>   col2 string,
>   col3 string,
>   col4 string,
>   col5 string,
>   col6 string,
>   col7 string,
>   col8 string,
>   col9 string,
>   col10 string,
>   col11 string,
>   col12 string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line 
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of 
> blocks configured in underlying filesystem. 3 rows are skipped when the file 
> is read in 3 blocks. 
> select * from test_table;
> .
> .
> Fetched 3145724 rows
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)