Re: save several 64MB files in Pig Latin

2013-06-10 Thread Bertrand Dechoux
I wasn't clear. Specifying the size of the files is not your real aim, I guess. But you think that's what is needed in order to solve your problem that we don't know about. 500MB is not a really big file in itself and is not an issue for HDFS and MapReduce. There is no absolute way to know how

GROUP BY Issue

2013-06-10 Thread Gourav Sengupta
Hi, On running the following query I am getting multiple records with same value of F1 SELECT F1, COUNT(*) FROM ( SELECT F1, F2, COUNT(*) FROM TABLE1 GROUP BY F1, F2 ) a GROUP BY F1; As per what I understand there are multiple number of records based on number of reducers. Replicating the test

Re: GROUP BY Issue

2013-06-10 Thread Gourav Sengupta
Hi Shahab, It will be great if someone can delete this email from PIG group. I am aware of this mistake and had posted this issue to HIVE group almost immediately. Regards, Gourav On Mon, Jun 10, 2013 at 5:28 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Gourav, this is not a HIVE mailing

Re: problems with .gz

2013-06-10 Thread Alan Crosswell
Ignore what I said and see https://forums.aws.amazon.com/thread.jspa?threadID=51232 bzip2 was documented somewhere as being splittable but this appears to not actually be implemented at least in AWS S3. /a On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell a...@crosswell.us wrote: Suggest that

Re: problems with .gz

2013-06-10 Thread Niels Basjes
Bzip2 is only splittable in newer versions of hadoop. On Jun 10, 2013 10:28 PM, Alan Crosswell a...@crosswell.us wrote: Ignore what I said and see https://forums.aws.amazon.com/thread.jspa?threadID=51232 bzip2 was documented somewhere as being splittable but this appears to not actually be

Loading data from ranges of ordered subdirs

2013-06-10 Thread Rodrick Megraw
Let's say I have my input data from the past 12 months organized into subdirs by date: /data/2012-06-10 /data/2012-06-11 ... /data/2013-06-09 And now say that I want to run a Pig script to process data from a range of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The

running pig from eclipse on hadoop cluster

2013-06-10 Thread Weiping Qu
Hi, I am currently running pig from eclipse on hadoop cluster. I added the hadoop conf location to the runtime configuration. But the mapreduce jobs failed as the built class files of pig cannot be called by hadoop. I added class file location to the classpath, but it did not work. Any hints?

Re: Loading data from ranges of ordered subdirs

2013-06-10 Thread Pradeep Gollakota
There's two possibilites that come to mind. 1. Write a custom LoadFunc in which you can handle these regular expressions. *Not the most ideal solution* 2. Use HCatalog. The example they have in their documentation seems to fit your use case perfectly.

Re: running pig from eclipse on hadoop cluster

2013-06-10 Thread Weiping Qu
Hi, Forget the question raised before. It's solved. Hi, I am currently running pig from eclipse on hadoop cluster. I added the hadoop conf location to the runtime configuration. But the mapreduce jobs failed as the built class files of pig cannot be called by hadoop. I added class file

RE: Loading data from ranges of ordered subdirs

2013-06-10 Thread Rodrick Megraw
Thank you for the suggestions. Writing a custom LoadFunc seems like a valid solution for me, given that I don't currently have Hive or HCatalog installed and I'm working on more of an ad-hoc problem at this point. HCatalog seems like a good solution for doing this type of thing on a repeated