Well to get the partitions you can execute a 'show partitions table_name', then 
you can use the SplitRecord with an AvroReader and JSON Writer to generate a 
flow file for partition. That flow file can then be read with EvaluateJsonPath 
to pull the partition_name into an attribute on the flow file. Then finally a 
ReplaceText to actual write out the select statement substituting the partition 
variable.


Thanks

Shawn

________________________________
From: Mohit <[email protected]>
Sent: Wednesday, June 27, 2018 8:40:20 AM
To: [email protected]
Subject: RE: SelectHiveQl gets stuck when query table containning 12 Billion 
rows


Hi,



Yes I tried to fetch around 40 million rows which took time but it was 
executed. I’ll try with the Avro thing.



How to break the  select into multiple part? Can you explain in brief the 
partition flow to start with?



Thanks,

Mohit



From: Shawn Weeks <[email protected]>
Sent: 27 June 2018 18:51
To: [email protected]
Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion 
rows



It's probably not stuck doing nothing, using a JDBC connection to fetch 12 
Billion rows is going to be painful no matter what you do. At those kind of 
sizes you're probably better off having Hive create a temporary table in Avro 
format and then consuming the Avro files from HDFS into NiFi. The largest 
number of rows I've pulled into NiFi via JDBC in a single query is around 10-20 
Million and that took a long time. You can also try breaking the select into 
multiple parts and running them simultaneously. I've done something similar 
where I first ran a query to get all of the partitions and then I executed a 
select for each partition in parallel.



Thanks

Shawn

________________________________

From: Mohit 
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, June 27, 2018 8:14:25 AM
To: [email protected]<mailto:[email protected]>
Subject: SelectHiveQl gets stuck when query table containning 12 Billion rows



Hi all,



I’m trying to fetch data from hive using SelectHiveQL. It works fine for small 
to medium sized tables, but when I try to fetch data from large table with 
around 12 billion rows it gets stuck for hours but do nothing.  I have set the 
Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each.

Is there any configuration which is to be manipulated to make this work?



Regards,

Mohit

Reply via email to