https://bugzilla.wikimedia.org/show_bug.cgi?id=65420

--- Comment #31 from Andrew Otto <[email protected]> ---
Not sure what you mean by old and new partitions.  Do you mean the single table
vs the old 4 tables?

There is a difference, yes, in that you query much more data by default with
the webrequest table.  For example, bits is very large.  If are pretty sure you
don't want bits data, add a "where webrequest_source != 'bits'" into the query.
 That will cut down the data size down a lot.

I'm googling for ways to make these large queries run and am learning things,
but am not yet sure.  I'm also looking for errors in the logs to find out why
they died.

See also:
http://mail-archives.apache.org/mod_mbox/hive-user/201212.mbox/%[email protected]%3E

Also, since we were talking about HADOOP_HEAPSIZE and Hive CLI earlier, this is
the documentation on HADOOP_HEAPSIZE for Hive CLI:

  # Larger heap size may be required when running queries over large number of
files or partitions. 
  # By default hive shell scripts use a heap size of 256 (MB).  Larger heap
size would also be 
  # appropriate for hive server (hwi etc).


So it seems the Hive CLI itself needs to have larger heapsizes when running
over larger datasets, as we were assuming.  I'm still not sure why that would
be.  I suppose it looks at the data before submitting the job to Hadoop?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to