Hello all, I am using Pig's local mode to drive some tests of a Pig script. Everything works fine when I run the tests one after the other, but I'm trying to speed things up by running all of the tests in parallel. Unfortunately, I get the impression that Pig uses the same naming scheme to store job-specific files and the files related to one job are often overwritten and/or deleted by another simultaneous job.
As a concrete example, I'm running Pig scripts with commands of the following general form: java pig-0.9.1.jar -x local events_by_day_reporting.pig And I'm seeing errors that look like: File already exists:file:/tmp/temp-1092987147/tmp-627659504/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 I was also seeing errors with files stored under "/tmp/hadoop-<my_username>/mapred/system"--I didn't copy those errors when I had the chance, but they seemed to all be related to "checksum errors" when reading from the job.xml file. My question for the user group is: what configuration parameters (if any) could I pass on the command line so that all of Pig's temporary files are created in distinct directories? For example, I tried adding "-Dmapred.system.dir" (e.g. java -Dmapred.system.dir=dir1 pig-0.9.1.jar -x local events_by_day_reporting.pig) to my command, but I haven't much luck with that approach so far. I am planning to try PigUnit soon, but any quick insights/advice/help would be greatly appreciated. Thanks, Matt Martin Think Big Analytics [email protected]
