RE: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-03 Thread Залеский Александр Андреевич
Yes, we storing data in ORC files correctly, the problem appears when we 
reading it via jdbc. We generate ORC files through org.apache.orc library and 
load into hive via load data inpath command. But, then we read them, jdbc does 
that awful split

From: Owen O'Malley [mailto:owen.omal...@gmail.com]
Sent: Thursday, November 02, 2017 6:21 PM
To: user@hive.apache.org
Subject: Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER 
PRODUCES DIRTY DATA

ORC stores the data in UTF-8 with the length of the value stored explicitly. 
Therefore, it doesn't do any parsing of newlines.

You can see the contents of an ORC file by using:

% hive --orcfiledump -d 

from https://orc.apache.org/docs/hive-ddl.html . How did you load the data into 
Hive?

... Owen

On Thu, Nov 2, 2017 at 5:29 AM, Залеский Александр Андреевич 
<aazal...@mts.ru<mailto:aazal...@mts.ru>> wrote:
My problem is to read data with “newline” character from ORC via jdbc. Standard 
behavior for reading string – split row for every newline symbol, and that 
seems like a bug. Why I couldn’t store any symbols in my data? Why jdbc read 
them as control symbols? I have created issue to terradata 
(https://tays.teradata.com/home/?language=en_US=RECHDBRVV) and 
they give me advice to write own SerDe. Perhaps, that is not unique task, and 
you already wrote such SerDe, can I ask for it?



Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Gopal Vijayaraghavan

> Why jdbc read them as control symbols?

Most likely this is already fixed by 

https://issues.apache.org/jira/browse/HIVE-1608

That pretty much makes the default as

set hive.query.result.fileformat=SequenceFile;

Cheers,
Gopal 





Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Owen O'Malley
ORC stores the data in UTF-8 with the length of the value stored
explicitly. Therefore, it doesn't do any parsing of newlines.

You can see the contents of an ORC file by using:

% hive --orcfiledump -d 

from https://orc.apache.org/docs/hive-ddl.html . How did you load the data
into Hive?

... Owen

On Thu, Nov 2, 2017 at 5:29 AM, Залеский Александр Андреевич <
aazal...@mts.ru> wrote:

> My problem is to read data with “newline” character from ORC via jdbc.
> Standard behavior for reading string – split row for every newline symbol,
> and that seems like a bug. Why I couldn’t store any symbols in my data? Why
> jdbc read them as control symbols? I have created issue to terradata (
> https://tays.teradata.com/home/?language=en_US=RECHDBRVV)
> and they give me advice to write own SerDe. Perhaps, that is not unique
> task, and you already wrote such SerDe, can I ask for it?
>