I'll open a JIRA ticket when I'm back at a real computer. The Hadoop version should be 1.0.3.
Best, Florian Am 25.06.2013 17:28 schrieb "Josh Wills" <[email protected]>: > Yeah, that's exactly how it works. I'll have some time to look at this > more closely later this morning. Would you mind opening a JIRA and/or > letting me know which Hadoop version you're running against? > > Thanks! > Josh > > > On Tue, Jun 25, 2013 at 7:46 AM, Florian Laws <[email protected]>wrote: > >> Hi Josh, >> >> thanks for the quick reply. >> >> With pipeline.done(), there is still no content at the intended output >> path, >> and things get even more wierd: >> >> The log output states >> >> 2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of >> task 'attempt_local_0001_m_000000_0' to >> /tmp/crunch-1483549519/p1/output >> >> but the directory >> /tmp/crunch-1483549519 >> >> does not exist. It looks like this directory gets temporarily created >> during the run, but gets removed again when the program finishes. >> It gets kept around with run() and removed with done(). >> >> Best, >> >> Florian >> >> >> >> On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <[email protected]> wrote: >> > Hey Florian, >> > >> > At first glance, it seems like a bug to me. I'm curious if the result >> is any >> > different if you swap in pipeline.done() for pipeline.run()? >> > >> > J >> > >> > >> > On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <[email protected]> >> > wrote: >> >> >> >> Hi all, >> >> >> >> I'm trying to write a simple Crunch job that outputs a sequence file >> >> consisting of a custom Writable. >> >> >> >> The job runs successfully, but the output is not written to the path >> >> that I specify in To.sequenceFile(), >> >> but instead to a Crunch working directory. >> >> >> >> This happens when running the job both locally and on my 1-node Hadoop >> >> test cluster, >> >> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today >> >> (38a97e5). >> >> >> >> Code snippet: >> >> --- >> >> >> >> public int run(String[] args) throws IOException { >> >> CommandLine cl = parseCommandLine(args); >> >> Path output = new Path((String) cl.getValue(OUTPUT_OPTION)); >> >> int docIdIndex = getColumnIndex(cl, "DocID"); >> >> int ldaIndex = getColumnIndex(cl, "LDA"); >> >> >> >> Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class); >> >> pipeline.setConfiguration(getConf()); >> >> PCollection<String> lines = pipeline.readTextFile((String) >> >> cl.getValue(INPUT_OPTION)); >> >> PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo( >> >> new ConvertToSeqFileDoFn(docIdIndex, ldaIndex), >> >> tableOf(strings(), writables(NamedQuantizedVecWritable.class))); >> >> >> >> vectors.write(To.sequenceFile(output)); >> >> >> >> PipelineResult res = pipeline.run(); >> >> return res.succeeded() ? 0 : 1; >> >> } >> >> --- >> >> >> >> Log output from local run. >> >> Note how the intended output path "/tmp/foo.seq" is reported in the >> >> execution plan, >> >> is not actually used. >> >> --- >> >> >> >> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info >> >> from SCDynamicStore >> >> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq >> >> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files >> >> to new path: /tmp/foo.seq >> >> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set. User >> >> classes may not be found. See JobConf(Class) or >> >> JobConf#setJar(String). >> >> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to >> >> process : 1 >> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating >> >> MAP in >> >> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122 >> >> with rwxr-xr-x >> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached >> >> /tmp/crunch-1128974463/p1/MAP as >> >> >> >> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP >> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached >> >> /tmp/crunch-1128974463/p1/MAP as >> >> >> >> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP >> >> >> >> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job >> >> "com.issuu.mahout.utils.DbDumpToSeqFile: >> >> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)" >> >> >> >> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status >> >> available at: http://localhost:8080/ >> >> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0 >> >> is done. And is in the process of commiting >> >> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO] >> >> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0 >> >> is allowed to commit now >> >> >> >> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of >> >> task 'attempt_local_0001_m_000000_0' to >> >> /tmp/crunch-1128974463/p1/output >> >> >> >> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO] >> >> 2013-06-25 16:19:48 Task:904 [INFO] Task >> 'attempt_local_0001_m_000000_0' >> >> done. >> >> >> >> --- >> >> >> >> >> >> This crude patch makes the output end up at the right place, >> >> but breaks a lot of other tests. >> >> --- >> >> >> >> --- >> >> >> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java >> >> +++ >> >> >> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java >> >> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget { >> >> protected void configureForMapReduce(Job job, Class keyClass, Class >> >> valueClass, >> >> Class outputFormatClass, Path outputPath, String name) { >> >> try { >> >> - FileOutputFormat.setOutputPath(job, outputPath); >> >> + FileOutputFormat.setOutputPath(job, path); >> >> } catch (Exception e) { >> >> throw new RuntimeException(e); >> >> } >> >> >> >> --- >> >> >> >> >> >> Am I doing something wrong, or is this a bug? >> >> >> >> Best, >> >> >> >> Florian >> > >> > >> > >
