Re: Network throughput from HiveServer2 to JDBC client too low

Mich Talebzadeh Tue, 21 Jun 2016 00:00:07 -0700

is the underlying table partitioned i.e.

'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6
AND day=1 AND hour=10)'


and also what is the RS size it is expected.

JDBC on its own should work. Is this an ORC table?

What version of Hive are you using?

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 07:52, David Nies <david.n...@adition.com> wrote:

> In my test case below, I’m using `beeline` as the Java application
> receiving the JDBC stream. As I understand, this is the reference command
> line interface to Hive. Are you saying that the reference command line
> interface is not efficiently implemented? :)
>
> -David Nies
>
> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com>:
>
> Aside from this the low network performance could also stem from the Java
> application receiving the JDBC stream (not threaded / not efficiently
> implemented etc). However that being said, do not use jdbc for this.
>
> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Hallo,
>
> For no databases (including traditional ones) it is advisable to fetch
> this amount through jdbc. Jdbc is not designed for this (neither for import
> nor for export of large data volumes). It is a highly questionable approach
> from a reliability point of view.
>
> Export it as file to HDFS and fetch it from there or use oozie to dump the
> file from HDFS to a sftp or other server. There are alternatives depending
> on your use case.
>
> Best regards
>
> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote:
>
> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.n...@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>
>

Re: Network throughput from HiveServer2 to JDBC client too low

Reply via email to