Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread Alex Boisvert
Hi John,

Just to clarify where I was going with my line of questioning.   There's no
Apache policy that prevents dependencies on incubator project, whether it's
releases, snapshots or even home-made hacked-together packaging of an
incubator project.It's been done before and as long as the incubator
code's IP has been cleared and the packaging isn't represented as an
official release if it isn't so, there's no wrong in doing that.

Now, whether the project choses to use and release with an incubator
dependency is a matter of judgment (and ultimately a vote by committers if
there is no consensus).   I just wanted to make sure there were no incorrect
assumptions made.

alex


On Thu, Feb 3, 2011 at 4:07 PM, John Sichi jsi...@fb.com wrote:

 I was going off of what I read in HADOOP-3676 (which lacks a reference as
 well).  But I guess if a release can be made from the incubator, then it's
 not a blocker.

 JVS

 On Feb 3, 2011, at 3:29 PM, Alex Boisvert wrote:

  On Thu, Feb 3, 2011 at 11:38 AM, John Sichi jsi...@fb.com wrote:
  Besides the fact that the refactoring required is significant, I don't
 think this is possible to do quickly since:
 
  1) Hive (unlike Pig) requires a metastore
 
  2) Hive releases can't depend on an incubator project
 
  I'm not sure what you mean by can't depend on an incubator project
 here.  AFAIK, there is no policy at Apache that projects should not depend
 on incubator projects.  Can you clarify what you mean and why you think such
 a restriction exists?
 
  alex
 




Re: Help with last 30 day unique user query

2010-10-15 Thread Alex Boisvert
As far as I know, Hive has no built-in support for sliding-window analytics.
 There is an enhancement request here:
https://issues.apache.org/jira/browse/HIVE-896

https://issues.apache.org/jira/browse/HIVE-896Without such support, the
brute force way of doing things is,

SELECT COUNT(DISTINCT user_id) FROM events WHERE event_date  start_date
and event_date = end_date;

(repeated N times to cover each day of your time window).

alex

On Thu, Oct 14, 2010 at 11:36 PM, Vijay tec...@gmail.com wrote:

 Hi, I need help with this scenario. We have a table of events which has
 columns date, event (not important for this discussion), and user_id. It is
 obviously easy to find number of unique users for each day. I also need to
 find number of unique users in the last 30 days for each day. This is also
 quite simple to do for one day. However, I cannot figure out how to do this
 for a range of days. Something like this is pretty straightforward in most
 RDBMS but with HiveQL has I'm finding this hard. I might be missing
 something simple though. Any help is appreciated. Ideally the query should
 also be as optimized as possible as this table could be huge.

 Thanks,
 Vijay




UDAF modes

2010-10-15 Thread Alex Boisvert
Hi,

I'm writing a UDAF and I'm a little unclear about the PARTIAL1, PARTIAL2,
FINAL and COMPLETE modes.

I've read the extent of the Javadoc ;) and looked at some of the built-in
UDAFs in the Hive source tree and I'm still unclear about the properties of
the input data in each aggregation step.

Could anybody elaborate a little on the input data in each mode?  Say, what
are the safe assumptions for each mode assuming, e.g., CLUSTERED BY clause?

thanks!
alex