Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Gopal Vijayaraghavan

> Why jdbc read them as control symbols?

Most likely this is already fixed by 

https://issues.apache.org/jira/browse/HIVE-1608

That pretty much makes the default as

set hive.query.result.fileformat=SequenceFile;

Cheers,
Gopal 





Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Owen O'Malley
ORC stores the data in UTF-8 with the length of the value stored
explicitly. Therefore, it doesn't do any parsing of newlines.

You can see the contents of an ORC file by using:

% hive --orcfiledump -d 

from https://orc.apache.org/docs/hive-ddl.html . How did you load the data
into Hive?

... Owen

On Thu, Nov 2, 2017 at 5:29 AM, Залеский Александр Андреевич <
aazal...@mts.ru> wrote:

> My problem is to read data with “newline” character from ORC via jdbc.
> Standard behavior for reading string – split row for every newline symbol,
> and that seems like a bug. Why I couldn’t store any symbols in my data? Why
> jdbc read them as control symbols? I have created issue to terradata (
> https://tays.teradata.com/home/?language=en_US=RECHDBRVV)
> and they give me advice to write own SerDe. Perhaps, that is not unique
> task, and you already wrote such SerDe, can I ask for it?
>


READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Залеский Александр Андреевич
My problem is to read data with "newline" character from ORC via jdbc. Standard 
behavior for reading string - split row for every newline symbol, and that 
seems like a bug. Why I couldn't store any symbols in my data? Why jdbc read 
them as control symbols? I have created issue to terradata 
(https://tays.teradata.com/home/?language=en_US=RECHDBRVV) and 
they give me advice to write own SerDe. Perhaps, that is not unique task, and 
you already wrote such SerDe, can I ask for it?


Hive LIMIT clause slows query

2017-11-02 Thread Igor Kuzmenko
I'm using HDP 2.5.0 with 1.2.1 Hive.
Performing some tests I noticed that my query works better if I don't use
limit clause.

My query is:

insert into table *results_table *partition (task_id=xxx)
select * from *data_table *
where dt=20171102
and .
limit 100


This query runs in about 30 seconds, but without limit clause I can get
about 20 seconds.

Query execution plan with limit <https://pastebin.com/Cmp2rPNr>, and without
<https://pastebin.com/z1ps2EhG>.


I can't remove limit clause because in some cases there's to much results
and I don't whant to store them all in result table.
Why limit affects performance so much?  Intuitively, it seems that with
limit clause it should work faster. What can I do to improve prefomance?


Retry the failed stage.

2017-11-02 Thread Piyush Mukati
Hi,
I am working with hive running on MapReduce.
Our query creates multiple stages of MR jobs. if any of the stages fails
due to an intermittent issue. we have to retry the full query.
Is there any config in the hive so that only failed stage is retired before
failing full query.
Thanks.
-piyush