Liu Weizheng created HIVE-28262:
-----------------------------------

             Summary: Single column use MultiDelimitSerDe parse column error
                 Key: HIVE-28262
                 URL: https://issues.apache.org/jira/browse/HIVE-28262
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2
    Affects Versions: 3.1.3, 4.1.0
         Environment: Hive version: 3.1.3
            Reporter: Liu Weizheng
            Assignee: Liu Weizheng
             Fix For: 4.1.0
         Attachments: CleanShot 2024-05-16 at 15.13...@2x.png, CleanShot 
2024-05-16 at 15.17...@2x.png

ENV:

Hive: 3.1.3/4.1.0

HDFS: 3.3.1

--------------------------

Create a text file for external table load,(e.g:/tmp/data):

 
{code:java}
1|@|
2|@|
3|@| {code}
 

 

Create external table:

 
{code:java}
CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT 
SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location 
'/tmp/test_split_tmp'; {code}
 

put text file to external table path:

 
{code:java}
hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
 

 

query this table and cast column id to long type:

 
{code:java}
select UDFToLong(`id`) from test_split_tmp; {code}
*why use UDFToLong function? because  it will get NULL result in this 
condition,but string type '1' use this function should get  type long 1 result.*


{code:java}
+--------+
| id     |
+--------+
| NULL   |
| NULL   |
| NULL   |
+--------+ {code}
Therefore, I speculate that there is an issue with the field splitting in 
MultiDelimitSerde.

when I debug this issue, I found some problem below:
 * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes

           *when fields.length=1 can't find the delimit index*

 
{code:java}
private int[] findIndexes(byte[] array, byte[] target) {
  if (fields.length <= 1) {  // bug
    return new int[0];
  }
  ...
  for (int i = 1; i < indexes.length; i++) {  // bug
    array = Arrays.copyOfRange(array, indexInNewArray + target.length, 
array.length);
    indexInNewArray = Bytes.indexOf(array, target);
    if (indexInNewArray == -1) {
      break;
    }
    indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
  }
  return indexes;
}{code}
 
 * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit

           *when fields.length=1 can't find the column startPosition*

 
{code:java}
public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
  ...
  int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
  ...
    if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
      int start = delimitIndexes[i - 1] + fieldDelimit.length;
      startPosition[i] = start - i * diff;
    } else {
      startPosition[i] = length + 1;
    }
  }
  Arrays.fill(fieldInited, false);
  parsed = true;
}{code}
 

 


Multi delimit Process:

*Actual:*  1|@| -> 1^A  id column start 0 ,next column start 1

*Expected:*  1|@| -> 1^A  id column start 0 ,next column start 2

 

Fix:
 # fields.length=1 should  find multi delimit index
 # fields.length=1 should  calculate column start position correct

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to