Hello all,

I am using Pig's local mode to drive some tests of a Pig script.  Everything 
works fine when I run the tests one after the other, but I'm trying to speed 
things up by running all of the tests in parallel.  Unfortunately, I get the 
impression that Pig uses the same naming scheme to store job-specific files and 
the files related to one job are often overwritten and/or deleted by another 
simultaneous job.

As a concrete example, I'm running Pig scripts with commands of the following 
general form:

java pig-0.9.1.jar -x local events_by_day_reporting.pig

And I'm seeing errors that look like:

File already 
exists:file:/tmp/temp-1092987147/tmp-627659504/_temporary/_attempt_local_0001_r_000000_0/part-r-00000

I was also seeing errors with files stored under 
"/tmp/hadoop-<my_username>/mapred/system"--I didn't copy those errors when I 
had the chance, but they seemed to all be related to "checksum errors" when 
reading from the job.xml file.

My question for the user group is: what configuration parameters (if any) could 
I pass on the command line so that all of Pig's temporary files are created in 
distinct directories?  For example, I tried adding "-Dmapred.system.dir" (e.g. 
java -Dmapred.system.dir=dir1 pig-0.9.1.jar -x local 
events_by_day_reporting.pig) to my command, but I haven't much luck with that 
approach so far.

I am planning to try PigUnit soon, but any quick insights/advice/help would be 
greatly appreciated.

Thanks,
Matt Martin
Think Big Analytics
[email protected]



Reply via email to