RE: Accessing Table Properies from InputFormat

Peter Marron Wed, 29 May 2013 02:17:52 -0700

Hi,

I am a newbie and I don't want to break any layered abstractions.


I am in the situation where I want to be able to examine
the predicate in the query and if it's a filter that I recognize
then I would like to use it to cut down on the number of records
processed. In particular I would like to make sure that in such a case the 
records
aren't even read, so they don't need to be filtered. At the moment
I can see that by providing my own InputFormat I can arrange that
the splits that I return are filtered down to the subset that I want.
However this means that the InputFormat needs to know something
about the table to be able to parse the predicate and see if it matches
the filtering criteria. But I learn that the InputFormat doesn't have access
to the table properties. So I have a problem.

OK, the serde has access to the table properties but it's in no position
to be able to perform the filtering. By the time it sees a record it's too late.

Similarly by the time the recordReader is invoked the record has been read.

I would use a facility like indexing, but I want this to work when the query
does not perform a Map/Reduce and my understanding is that Hive will
not invoke an indexes if there is no Map/Reduce. So indexing is a non-starter.
Also there are cases where creating an index seems massive overkill for
what I am trying to achieve.

So where is the Hive hook that allows me to do what I would like to do?
Which of the layers allows me to examine the table properties and
the predicate and to (pre-)filter the records returned?

Or are you saying that what I am trying to do doesn't make sense?

Z


From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
Sent: 28 May 2013 16:45
To: user@hive.apache.org
Cc: Peter Marron
Subject: Re: Accessing Table Properies from InputFormat

That does not really make sense. Your breaking the layered approache. 
InputFormats read/write data, serdes interpret data based on the table 
definition. its like asking "Why can't my input format run assembly code?"

On Tue, May 28, 2013 at 11:42 AM, Owen O'Malley 
<omal...@apache.org<mailto:omal...@apache.org>> wrote:


On Tue, May 28, 2013 at 7:59 AM, Peter Marron 
<peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>> 
wrote:
Hi,

Hive 0.10.0 over Hadoop 1.0.4.

Further to my filtering questions of before.
I would like to be able to access the table properties from inside my custom 
InputFormat.
I've done searches and there seem to be some other people who have had a 
similar problem.
The closest I can see to a solution is to use
                MapredWork mrwork = Utilities.getMapRedWork(configuration);
but this fails for me with the error below.
I'm not truly surprised because I and trying to make sure that my query
runs without a map/reduce and some of the e-mails suggest that in this case:

"...no mapred job is
run, so this trick doesn't work (and instead, the Configuration object
can be used, since it's local)."

Any pointers would be very much appreciated.

Yeah, as you discovered, that only works in the MapReduce case and breaks on 
cases like "select count(*)" that don't run in MapReduce.

I haven't tried it, but it looks like the best you can do with the current 
interface is to implement a SerDe which is passed the table properties in 
initialize. In terms of passing it to the InputFormat, I'd try a thread local 
variable. It looks like the getRecordReader is called soon after the 
serde.initialize although I didn't do a very deep search of the code.

-- Owen

RE: Accessing Table Properies from InputFormat

Reply via email to