Goden Yao created HAWQ-56:
-----------------------------

             Summary: Non-deterministic header results with "HEADER" option 
from external table
                 Key: HAWQ-56
                 URL: https://issues.apache.org/jira/browse/HAWQ-56
             Project: Apache HAWQ
          Issue Type: Bug
          Components: External Tables & PXF
            Reporter: Goden Yao
            Priority: Minor


** Repro Steps**
External table definition
{code:sql}
drop external table if exists testtbl;
create external table testtbl(i text, j text)
location 
('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
format 'TEXT' (delimiter ',' header);

select * from testtbl;
{code}

example with 4 segment servers and 4 test files with headers in hdfs
{code:sql}
gpadmin=# select * from testtbl ;
 i | j
---+---
 3 | c
 2 | b
 1 | a
 4 | d
(4 rows)
{code}

With 5 test files
{code:sql}
gpadmin=# select * from testtbl ;
 i  |   j
----+-------
 5  | e
 2  | b
 ID | Value
 3  | c
 1  | a
 ID | VALUE
 4  | d
(7 rows)
{code}

**Analysis**
When using HEADER option, header line is removed only once per segment.
As a result there will be different results depending on the number of 
segments/fragments are scanned.

The reason for this behavior is that header line handling is done by the 
external protocol code (fileam.c) which checks if the header_line flag is true, 
and if so skips the first line and sets the flag to false. This code calls the 
custom protocol code (in our case pxf) to get the next tuples, and so doesn't 
know if the tuples are from the same resource or not.
>From what I can see, in gpfdist protocol the problem is solved by letting the 
>custom protocol code handle this and marking the flag as false for the 
>external protocol infrastructure in fileam.c.

** Proposed Solution **
Add a check in pxf validator function (a function that is being called as part 
of external table creation).
This check will error out if HEADER option is used in a PXF table.
Currently the validation function API only allows access to the table's URLs 
(in the LOCATION part of the table), and not the format options. In order to 
add the check an API change in the external protocol is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to