Re: Hive query taking a lot of time just to launch map-reduce jobs

David Morel Tue, 26 Nov 2013 02:54:46 -0800

On 26 Nov 2013, at 7:02, Sreenath wrote:

Hey David,
Thanks for the swift reply. Each id will have exactly one file. and
regarding the volume on an average each file would be 100MB ofcompressed
data with the maximum going upto around 200MB compressed data.

And how will RC files be an advantage here?


I was thinking RCFiles would be an advantage but was confusing their
utility with that of indexes used on non-partitioned data, my bad.
ORCFiles (or indexes) would be an advantage as it would allow you to not
use partitions and regroup your files, thus reducing the overall number
greatly. You could additionally specify a greater block size (say 512MB)
so the number of files to read is divided by 5.

I guess the real issue is having a hive instance communicating with
remote storage on a large number of files, as the metastore only keeps
memory of the directories, not the files. As a result, in order to
assemble all your file paths, which are needed for query execution, it
takes a long time and a large number of I/O on the data store, which
happens to be remote and possibly slow to poll.

This is only a wild guess (and I could be completely wrong on my
understanding on Hive altogether) and there might be a bug and/or
something to optimize; the error you're seeing is maybe the key to the

issue but then it is for more knowledgeable people than me to commmenton.


Sorry (and good luck)

David

On Mon, Nov 25, 2013 at 5:50 PM, David Morel <dmore...@gmail.com>wrote:
On 25 Nov 2013, at 11:50, Sreenath wrote:

hi all,
We are using hive for Ad-hoc querying and have a hive table which is
partitioned on two fields (date,id).Now for each date there arearound
1400
ids so on a single day around that many partitions are added.Theactualdata is residing in s3. now the issue we are facing is suppose we doaselect count(*) for a month from the table then it takes quite alongamount of time(approx : 1hrs 52 min) just to launch the map reducejob.when i ran the query in hive verbose mode i can see that itsspending thistime actually deciding how many number of mappers tospawn(calculatingsplits). Is there any means by which i can reduce this lag time forthe
launch of map-reduce job.
this is one of the log messages that is being logged during this lagtime
13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to
process
: 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response
'/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404,
expected
200
Anyone has a quick fix for this ?
So we're talking about 30 days x 1400 ids x number of files per ID
(usually more than 1)

this is at least 42,000 file paths, and (regardless of the error you
posted) hive won't perform well on this many files when making thequery.
It is IMHO a typical case of over-partitioning. I'd use RCFile andkeep
IDs unpartitioned.
What volume of data are we talking about here? What's the volume ofthe
biggest ID for a day, and the average?

David

Re: Hive query taking a lot of time just to launch map-reduce jobs

Reply via email to