Hi, I am a newbie and I don't want to break any layered abstractions.
I am in the situation where I want to be able to examine the predicate in the query and if it's a filter that I recognize then I would like to use it to cut down on the number of records processed. In particular I would like to make sure that in such a case the records aren't even read, so they don't need to be filtered. At the moment I can see that by providing my own InputFormat I can arrange that the splits that I return are filtered down to the subset that I want. However this means that the InputFormat needs to know something about the table to be able to parse the predicate and see if it matches the filtering criteria. But I learn that the InputFormat doesn't have access to the table properties. So I have a problem. OK, the serde has access to the table properties but it's in no position to be able to perform the filtering. By the time it sees a record it's too late. Similarly by the time the recordReader is invoked the record has been read. I would use a facility like indexing, but I want this to work when the query does not perform a Map/Reduce and my understanding is that Hive will not invoke an indexes if there is no Map/Reduce. So indexing is a non-starter. Also there are cases where creating an index seems massive overkill for what I am trying to achieve. So where is the Hive hook that allows me to do what I would like to do? Which of the layers allows me to examine the table properties and the predicate and to (pre-)filter the records returned? Or are you saying that what I am trying to do doesn't make sense? Z From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: 28 May 2013 16:45 To: user@hive.apache.org Cc: Peter Marron Subject: Re: Accessing Table Properies from InputFormat That does not really make sense. Your breaking the layered approache. InputFormats read/write data, serdes interpret data based on the table definition. its like asking "Why can't my input format run assembly code?" On Tue, May 28, 2013 at 11:42 AM, Owen O'Malley <omal...@apache.org<mailto:omal...@apache.org>> wrote: On Tue, May 28, 2013 at 7:59 AM, Peter Marron <peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>> wrote: Hi, Hive 0.10.0 over Hadoop 1.0.4. Further to my filtering questions of before. I would like to be able to access the table properties from inside my custom InputFormat. I've done searches and there seem to be some other people who have had a similar problem. The closest I can see to a solution is to use MapredWork mrwork = Utilities.getMapRedWork(configuration); but this fails for me with the error below. I'm not truly surprised because I and trying to make sure that my query runs without a map/reduce and some of the e-mails suggest that in this case: "...no mapred job is run, so this trick doesn't work (and instead, the Configuration object can be used, since it's local)." Any pointers would be very much appreciated. Yeah, as you discovered, that only works in the MapReduce case and breaks on cases like "select count(*)" that don't run in MapReduce. I haven't tried it, but it looks like the best you can do with the current interface is to implement a SerDe which is passed the table properties in initialize. In terms of passing it to the InputFormat, I'd try a thread local variable. It looks like the getRecordReader is called soon after the serde.initialize although I didn't do a very deep search of the code. -- Owen