Indeed I changed generate_data.sh and the jobs run faster. However, I still get an error on file input + the famous data storage exceptions all along the run.
1. Any idea what is the issue is with the file input? The generation was successful why aren't the files found? 2. I run these patches https://issues.apache.org/jira/browse/PIG-2183 but patch failed. > patch -p0 < PIG-2183-1.patch patching file build.xml Hunk #1 FAILED at 776. 1 out of 1 hunk FAILED -- saving rejects to file build.xml.rej patching file bin/pig Hunk #2 FAILED at 126. Hunk #3 FAILED at 171. Hunk #4 FAILED at 236. 3. Can you give me more details about the pig-withouthadoop.jar in this context (why is it necessary to generate and how)? Errors details below (highlighted): [kouaknine@dataland_oss:node020062 scripts]$ ./generate_data.sh Generating pages250m Using seed 1315763867700 Generate data in hadoop mode. Generating column config file in hdfs://node020062.boca.local:54310/user/kouakni ne/tmp/tmp1321141821/tmp45618478 Generating mapping file for column s:20:1600000:z:7 into hdfs://node020062.boca. local:54310/user/kouaknine/tmp/tmp1321141821/tmp777110755 processed 18%. processed 37%. processed 56%. processed 75%. processed 93%. processed 99%. Generating mapping file for column s:10:1800000:z:20 into hdfs://node020062.boca .local:54310/user/kouaknine/tmp/tmp1321141821/tmp2109113430 processed 16%. processed 33%. processed 50%. processed 66%. processed 83%. processed 99%. Generating mapping file for column d:1:100000:z:5 into hdfs://node020062.boca.lo cal:54310/user/kouaknine/tmp/tmp1321141821/tmp1328108041 processed 99%. Generating input files into hdfs://node020062.boca.local:54310/user/kouaknine/tm p/tmp1321141821/tmp757503163 Submit hadoop job... 11/09/11 13:58:14 INFO mapred.FileInputFormat: Total input paths to process : 90 11/09/11 13:58:15 INFO mapred.JobClient: Running job: job_201109111357_0001 11/09/11 13:58:16 INFO mapred.JobClient: map 0% reduce 0% 11/09/11 13:58:54 INFO mapred.JobClient: map 1% reduce 0% 11/09/11 13:58:55 INFO mapred.JobClient: map 2% reduce 0% 11/09/11 13:58:56 INFO mapred.JobClient: map 3% reduce 0% 11/09/11 13:58:57 INFO mapred.JobClient: map 10% reduce 0% 11/09/11 13:58:58 INFO mapred.JobClient: map 15% reduce 0% 11/09/11 13:58:59 INFO mapred.JobClient: map 16% reduce 0% 11/09/11 13:59:01 INFO mapred.JobClient: map 22% reduce 0% 11/09/11 14:00:04 INFO mapred.JobClient: map 23% reduce 0% 11/09/11 14:00:05 INFO mapred.JobClient: map 25% reduce 0% 11/09/11 14:00:06 INFO mapred.JobClient: map 26% reduce 0% 11/09/11 14:00:07 INFO mapred.JobClient: map 34% reduce 0% 11/09/11 14:00:08 INFO mapred.JobClient: map 37% reduce 0% 11/09/11 14:00:09 INFO mapred.JobClient: map 41% reduce 0% 11/09/11 14:00:10 INFO mapred.JobClient: map 44% reduce 0% 11/09/11 14:01:10 INFO mapred.JobClient: map 50% reduce 0% 11/09/11 14:01:12 INFO mapred.JobClient: map 52% reduce 0% 11/09/11 14:01:13 INFO mapred.JobClient: map 56% reduce 0% 11/09/11 14:01:14 INFO mapred.JobClient: map 58% reduce 0% 11/09/11 14:01:15 INFO mapred.JobClient: map 61% reduce 0% 11/09/11 14:01:16 INFO mapred.JobClient: map 62% reduce 0% 11/09/11 14:01:18 INFO mapred.JobClient: map 65% reduce 0% 11/09/11 14:01:20 INFO mapred.JobClient: map 66% reduce 0% 11/09/11 14:02:16 INFO mapred.JobClient: map 67% reduce 0% 11/09/11 14:02:18 INFO mapred.JobClient: map 68% reduce 0% 11/09/11 14:02:19 INFO mapred.JobClient: map 72% reduce 0% 11/09/11 14:02:20 INFO mapred.JobClient: map 73% reduce 0% 11/09/11 14:02:21 INFO mapred.JobClient: map 78% reduce 0% 11/09/11 14:02:22 INFO mapred.JobClient: map 83% reduce 0% 11/09/11 14:02:24 INFO mapred.JobClient: map 84% reduce 0% 11/09/11 14:02:25 INFO mapred.JobClient: map 86% reduce 0% 11/09/11 14:02:26 INFO mapred.JobClient: map 87% reduce 0% 11/09/11 14:02:28 INFO mapred.JobClient: map 88% reduce 0% 11/09/11 14:03:23 INFO mapred.JobClient: map 91% reduce 0% 11/09/11 14:03:24 INFO mapred.JobClient: map 92% reduce 0% 11/09/11 14:03:26 INFO mapred.JobClient: map 93% reduce 0% 11/09/11 14:03:27 INFO mapred.JobClient: map 96% reduce 0% 11/09/11 14:03:29 INFO mapred.JobClient: map 97% reduce 0% 11/09/11 14:03:30 INFO mapred.JobClient: map 98% reduce 0% 11/09/11 14:03:31 INFO mapred.JobClient: map 100% reduce 0% 11/09/11 14:04:02 INFO mapred.JobClient: Job complete: job_201109111357_0001 11/09/11 14:04:02 INFO mapred.JobClient: Counters: 14 11/09/11 14:04:02 INFO mapred.JobClient: Job Counters 11/09/11 14:04:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6114352 11/09/11 14:04:02 INFO mapred.JobClient: Total time spent by all reduces wai ting after reserving slots (ms)=0 11/09/11 14:04:02 INFO mapred.JobClient: Total time spent by all maps waitin g after reserving slots (ms)=0 11/09/11 14:04:02 INFO mapred.JobClient: Launched map tasks=100 11/09/11 14:04:02 INFO mapred.JobClient: Data-local map tasks=100 11/09/11 14:04:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 11/09/11 14:04:02 INFO mapred.JobClient: FileSystemCounters 11/09/11 14:04:02 INFO mapred.JobClient: HDFS_BYTES_READ=7140595587 11/09/11 14:04:02 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4632920 11/09/11 14:04:02 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1010430007 11/09/11 14:04:02 INFO mapred.JobClient: Map-Reduce Framework 11/09/11 14:04:02 INFO mapred.JobClient: Map input records=90 11/09/11 14:04:02 INFO mapred.JobClient: Spilled Records=0 11/09/11 14:04:02 INFO mapred.JobClient: Map input bytes=450 11/09/11 14:04:02 INFO mapred.JobClient: Map output records=625000 11/09/11 14:04:02 INFO mapred.JobClient: SPLIT_RAW_BYTES=13227 Job is successful! It took 374 seconds. Skimming users 11/09/11 14:04:02 INFO pig.Main: Logging error messages to: /mnt/disk1/home/koua knine/cdh3/pig/test/utils/pigmix/scripts/pig_1315764242612.log 2011-09-11 14:04:03,036 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to hadoop file system at: hdfs:// 10.239.20.62:5 4310 2011-09-11 14:04:03,380 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpecte d internal error. Failed to create DataStorage Details at logfile: /mnt/disk1/home/kouaknine/cdh3/pig/test/utils/pigmix/scripts /pig_1315764242612.log Generating users100m Using seed 1315764244229 Generate data in hadoop mode. Generating column config file in hdfs://node020062.boca.local:54310/user/kouakni ne/tmp/tmp-1292311506/tmp1030611576 Generating mapping file for column s:10:1600000:z:20 into hdfs://node020062.boca .local:54310/user/kouaknine/tmp/tmp-1292311506/tmp-897220645 processed 18%. processed 37%. processed 56%. processed 75%. processed 93%. processed 99%. Generating mapping file for column s:20:1600000:z:20 into hdfs://node020062.boca .local:54310/user/kouaknine/tmp/tmp-1292311506/tmp1918108696 processed 18%. processed 37%. processed 56%. processed 75%. processed 93%. processed 99%. Generating mapping file for column s:10:1600000:z:20 into hdfs://node020062.boca .local:54310/user/kouaknine/tmp/tmp-1292311506/tmp1429175303 processed 18%. processed 37%. processed 56%. processed 75%. processed 93%. processed 99%. Generating mapping file for column s:2:1600:z:20 into hdfs://node020062.boca.loc al:54310/user/kouaknine/tmp/tmp-1292311506/tmp-1749462746 processed 99%. Submit hadoop job... 11/09/11 14:04:27 INFO mapred.JobClient: Cleaning up the staging area hdfs://nod e020062.boca.local:54310/mnt/disk1/home/kouaknine/cdh3/tmp/mapred/staging/kouakn ine/.staging/job_201109111357_0002 *Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input* * path does not exist: hdfs://node020062.boca.local:54310/user/kouaknine/users* * at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j* ava:194) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:205) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1115) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8 33) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242) at org.apache.pig.test.utils.datagen.DataGenerator$HadoopRunner.goHadoop (DataGenerator.java:598) at org.apache.pig.test.utils.datagen.DataGenerator.go(DataGenerator.java :153) at org.apache.pig.test.utils.datagen.DataGenerator.main(DataGenerator.ja va:60) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Skimming power users 11/09/11 14:04:27 INFO pig.Main: Logging error messages to: /mnt/disk1/home/koua knine/cdh3/pig/test/utils/pigmix/scripts/pig_1315764267489.log 2011-09-11 14:04:27,904 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to hadoop file system at: hdfs:// 10.239.20.62:5 4310 *2011-09-11 14:04:28,246 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpecte* *d internal error. Failed to create DataStorage* Details at logfile: /mnt/disk1/home/kouaknine/cdh3/pig/test/utils/pigmix/scripts /pig_1315764267489.log Generating Using seed 1315764269071 Generate data in hadoop mode. Generating column config file in hdfs://node020062.boca.local:54310/user/kouakni ne/tmp/tmp2025262375/tmp-763782967 Generating mapping file for column s:10:1600000:z:20 into hdfs://node020062.boca .local:54310/user/kouaknine/tmp/tmp2025262375/tmp109692083 processed 18%. processed 37%. On Sun, Sep 11, 2011 at 2:12 PM, Daniel Dai <[email protected]> wrote: > It takes a while, but did you check the jobtracker UI? You will get the > notice "xxxx tuples generated". Or you can try to generate fewer data > first, > just change "generate_data.sh". > > Daniel > > On Sun, Sep 11, 2011 at 9:33 AM, Keren Ouaknine <[email protected]> wrote: > > > Hello Daniel, > > > > > Have you checked mapreduce UI? Most probably it is caused by OOM. If > you > > see > > that in mapreduce log, > > You mean the jobtracker's log? > > > > I applied the settings you sent me, and it solved that exception but I am > > stuck in the middle of the job: I reached 22 percents withing less than a > > minute and reached 22%, then no progress in the last 45 minutes... > > > > I am running on a 10 nodes cluster, with 4GB of memory each and defaults > > number of mappers and reducers. > > I am looking at the web interface of the jobtracker, but nothing looks > > abnormal. > > > > Thanks for your help! > > Keren > > > > > > Generating mapping file for column d:1:100000:z:5 into > > hdfs://node020062.boca.lo > > cal:54310/user/kouaknine/tmp/tmp-1343473685/tmp-467941747 > > processed 99%. > > Generating input files into > > hdfs://node020062.boca.local:54310/user/kouaknine/tm > > p/tmp-1343473685/tmp1142302988 > > Submit hadoop job... > > 11/09/11 11:50:00 INFO mapred.FileInputFormat: Total input paths to > process > > : 90 > > > > 11/09/11 11:50:01 INFO mapred.JobClient: Running job: > job_201109111147_0001 > > 11/09/11 11:50:02 INFO mapred.JobClient: map 0% reduce 0% > > 11/09/11 11:50:42 INFO mapred.JobClient: map 1% reduce 0% > > 11/09/11 11:50:43 INFO mapred.JobClient: map 3% reduce 0% > > 11/09/11 11:50:44 INFO mapred.JobClient: map 7% reduce 0% > > 11/09/11 11:50:45 INFO mapred.JobClient: map 13% reduce 0% > > 11/09/11 11:50:46 INFO mapred.JobClient: map 17% reduce 0% > > 11/09/11 11:50:47 INFO mapred.JobClient: map 20% reduce 0% > > 11/09/11 11:50:48 INFO mapred.JobClient: map 21% reduce 0% > > *11/09/11 11:50:49 INFO mapred.JobClient: map 22% reduce 0%* > > > > > > > > On Sun, Sep 11, 2011 at 3:07 AM, Daniel Dai <[email protected]> > wrote: > > > > > Hi, Keren, > > > Have you checked mapreduce UI? Most probably it is caused by OOM. If > you > > > see > > > that in mapreduce log, try to put this entry to mapred-site.xml: > > > <property> > > > <name>mapred.child.java.opts</name> > > > <value>-Xmx2048m</value> > > > </property> > > > > > > Also change hadoop-env.sh: > > > export HADOOP_HEAPSIZE=2000 > > > > > > I tried 0.20.204 with pig 0.8.1, I didn't finish the run but I didn't > see > > > any error for the first 15m (still running the first hadoop job to > > generate > > > page_view). > > > > > > Daniel > > > > > > On Sat, Sep 10, 2011 at 10:04 PM, Keren Ouaknine <[email protected]> > > wrote: > > > > > > > Hello, > > > > > > > > I tried several versions to generate data for pigmix queries: > > > > *- Hadoop apache 0.20.204 with pig 0.7* > > > > *==>* java.lang.RuntimeException: Error in configuring > > > > object at > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils... > > > > *[Full error at the very bottom]* > > > > * > > > > * > > > > *- Hadoop apache 0.20.204 with pig 0.9* > > > > ==> I get an error while patching the pixmix2.patch > > > > (on build.xml: Reversed (or previously applied) patch detected! ) > > > > I didnapplied the patch up to that error and when generating the > data: > > > > Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch: > > > > Protocol.. > > > > org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. > > > > > > > > *- CDH3 with pig 0.7* > > > > *==>* java.lang.RuntimeException: Error in configuring > > > > object at > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils... > > > > > > > > *- CDH3 with CDH3-pig (which I downloaded from > > > > http://nightly.cloudera.com/cdh/3/ )* > > > > I applied pigmix2.patch, and used pig.jar and pigperf.jar (which I > > > couldnt > > > > recompile locally for an internal reason), and got the same error: > > > > *==>* java.lang.RuntimeException: Error in configuring > > > > object at > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils... > > > > > > > > Bottom line, none of these configurations worked. > > > > Keren > > > > > > > > [kouaknine@dataland_oss:node020062 scripts]$ ./generate_data.sh > > > > > > > > Generating pages50m > > > > > > > > Using seed 1315606991650 > > > > > > > > Generate data in hadoop mode. > > > > > > > > Generating column config file in > > > > hdfs://node020062.boca.local:54310/user/kouak > > > > > > > > ne/tmp/tmp1129210882/tmp-1545723655 > > > > > > > > Generating mapping file for column s:20:1600000:z:7 into > > > > hdfs://node020062.boc > > > > > > > > local:54310/user/kouaknine/tmp/tmp1129210882/tmp-163757285 > > > > > > > > processed 18%. > > > > > > > > processed 37%. > > > > > > > > processed 56%. > > > > > > > > processed 75%. > > > > > > > > processed 93%. > > > > > > > > processed 99%. > > > > > > > > Generating mapping file for column s:10:1800000:z:20 into hdfs:// > > > > node020062.bo > > > > > > > > .local:54310/user/kouaknine/tmp/tmp1129210882/tmp-1525412142 > > > > > > > > processed 16%. > > > > > > > > processed 33%. > > > > > > > > processed 50%. > > > > > > > > processed 66%. > > > > > > > > processed 83%. > > > > > > > > processed 99%. > > > > > > > > Generating mapping file for column d:1:100000:z:5 into > > > > hdfs://node020062.boca. > > > > > > > > cal:54310/user/kouaknine/tmp/tmp1129210882/tmp-738880094 > > > > > > > > processed 99%. > > > > > > > > Generating input files into > > > > hdfs://node020062.boca.local:54310/user/kouaknine/ > > > > > > > > p/tmp1129210882/tmp-1696754417 > > > > > > > > Submit hadoop job... > > > > > > > > 11/09/09 18:23:38 INFO mapred.FileInputFormat: Total input paths to > > > process > > > > : > > > > > > > > > > > > > > > > 11/09/09 18:23:39 INFO mapred.JobClient: Running job: > > > job_201109091527_0005 > > > > > > > > 11/09/09 18:23:40 INFO mapred.JobClient: map 0% reduce 0% > > > > > > > > *11/09/09 18:24:45 INFO mapred.JobClient: Task Id : > > > > attempt_201109091527_0005_m* > > > > > > > > *00000_0, Status : FAILED* > > > > > > > > *java.lang.RuntimeException: Error in configuring object* > > > > > > > > * at > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.j* > > > > > > > > *a:93)* > > > > > > > > at > > > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java > > > > > > > > 4) > > > > > > > > at > > > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils. > > > > > > > > va:117) > > > > > > > > at > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386) > > > > > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) > > > > > > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > > > > > > > at java.security.AccessController.doPrivileged(Native Method) > > > > > > > > at javax.security.auth.Subject.doAs(Subject.java:396) > > > > > > > > at > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor > > > > > > > > tion.java:1115) > > > > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException > > > > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > > > > > at > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp > > > > > > > > java:39) > > > > > > > > at > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc > > > > > > > > sorImpl.java:25) > > > > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > > > > at > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.j > > > > > > > > a:88) > > > > > > > > ... 9 more > > > > > > > > > > > > On Fri, Sep 9, 2011 at 1:12 PM, Alan Gates <[email protected]> > > > wrote: > > > > > > > > > If you're going to run with Apache Pig 0.8.1 or 0.9, you should use > > > > Apache > > > > > Hadoop 0.20.2. If you want to use CDH, you should stick with their > > > > versions > > > > > of Hadoop and Pig. > > > > > > > > > > Alan. > > > > > > > > > > On Sep 9, 2011, at 6:59 AM, Keren Ouaknine wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > What is the latest version of pig supporting the pigmix queries? > > The > > > > jira > > > > > > latest update mentions pig 0.7 only: > > > > > > Assuming its 0.8 or 0.9, can I use hadoop cdh3 or should I switch > > to > > > > > > apache's version and which one? > > > > > > > > > > > > == > > > > > > * > > > > > > * > > > > > > *1. Download pig 0.7 release > > > > > > 2. Apply the patch > > > > > > 3. copy http://www.eli.sdsu.edu/java-SDSU/sdsuLibJKD12.jar to > lib > > > > > > 4. ant jar pigperf > > > > > > 5. You will use pig.jar, pigperf.jar. Scripts is in > > > > > > test/utils/pigmix/scripts. To generate data, use > generate_data.sh. > > To > > > > run > > > > > > PigMix2, use runpigmix-adhoc.pl.* > > > > > > > > > > > > Thanks, > > > > > > Keren > > > > > > > > > > > > -- > > > > > > Keren Ouaknine > > > > > > Cell: +972 54 2565404 > > > > > > Web: www.kereno.com > > > > > > > > > > > > > > > > > > > > > > -- > > > > Keren Ouaknine > > > > Cell: +972 54 2565404 > > > > Web: www.kereno.com > > > > > > > > > > > > > > > -- > > Keren Ouaknine > > Cell: +972 54 2565404 > > Web: www.kereno.com > > > -- Keren Ouaknine Cell: +972 54 2565404 Web: www.kereno.com
