Re: Write to sequence file ignores destination path.

Florian Laws Tue, 25 Jun 2013 12:31:31 -0700

I submitted CRUNCH-227 in JIRA for this now.

Best,


Florian

On Tue, Jun 25, 2013 at 6:12 PM, Florian Laws <[email protected]> wrote:
> I'll open a JIRA ticket when I'm back at a real computer. The Hadoop version
> should be 1.0.3.
>
> Best,
>
> Florian
>
> Am 25.06.2013 17:28 schrieb "Josh Wills" <[email protected]>:
>
>> Yeah, that's exactly how it works. I'll have some time to look at this
>> more closely later this morning. Would you mind opening a JIRA and/or
>> letting me know which Hadoop version you're running against?
>>
>> Thanks!
>> Josh
>>
>>
>> On Tue, Jun 25, 2013 at 7:46 AM, Florian Laws <[email protected]>
>> wrote:
>>>
>>> Hi Josh,
>>>
>>> thanks for the quick reply.
>>>
>>> With pipeline.done(), there is still no content at the intended output
>>> path,
>>> and things get even more wierd:
>>>
>>> The log output states
>>>
>>> 2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of
>>> task 'attempt_local_0001_m_000000_0' to
>>> /tmp/crunch-1483549519/p1/output
>>>
>>> but the directory
>>> /tmp/crunch-1483549519
>>>
>>> does not exist. It looks like this directory gets temporarily created
>>> during the run, but gets removed again when the program finishes.
>>> It gets kept around with run() and removed with done().
>>>
>>> Best,
>>>
>>> Florian
>>>
>>>
>>>
>>> On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <[email protected]> wrote:
>>> > Hey Florian,
>>> >
>>> > At first glance, it seems like a bug to me. I'm curious if the result
>>> > is any
>>> > different if you swap in pipeline.done() for pipeline.run()?
>>> >
>>> > J
>>> >
>>> >
>>> > On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <[email protected]>
>>> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I'm trying to write a simple Crunch job that outputs a sequence file
>>> >> consisting of a custom Writable.
>>> >>
>>> >> The job runs successfully, but the output is not written to the path
>>> >> that I specify in To.sequenceFile(),
>>> >> but instead to a Crunch working directory.
>>> >>
>>> >> This happens when running the job both locally and on my 1-node Hadoop
>>> >> test cluster,
>>> >> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
>>> >> (38a97e5).
>>> >>
>>> >> Code snippet:
>>> >> ---
>>> >>
>>> >> public int run(String[] args) throws IOException {
>>> >>   CommandLine cl = parseCommandLine(args);
>>> >>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>>> >>   int docIdIndex = getColumnIndex(cl, "DocID");
>>> >>   int ldaIndex = getColumnIndex(cl, "LDA");
>>> >>
>>> >>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>>> >>   pipeline.setConfiguration(getConf());
>>> >>   PCollection<String> lines = pipeline.readTextFile((String)
>>> >> cl.getValue(INPUT_OPTION));
>>> >>   PTable<String, NamedQuantizedVecWritable> vectors =
>>> >> lines.parallelDo(
>>> >>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>>> >>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>>> >>
>>> >>   vectors.write(To.sequenceFile(output));
>>> >>
>>> >>   PipelineResult res = pipeline.run();
>>> >>   return res.succeeded() ? 0 : 1;
>>> >> }
>>> >> ---
>>> >>
>>> >> Log output from local run.
>>> >> Note how the intended output path "/tmp/foo.seq" is reported in the
>>> >> execution plan,
>>> >> is not actually used.
>>> >> ---
>>> >>
>>> >> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
>>> >> from SCDynamicStore
>>> >> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
>>> >> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
>>> >> to new path: /tmp/foo.seq
>>> >> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
>>> >> classes may not be found. See JobConf(Class) or
>>> >> JobConf#setJar(String).
>>> >> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
>>> >> process : 1
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
>>> >> MAP in
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
>>> >> with rwxr-xr-x
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
>>> >> /tmp/crunch-1128974463/p1/MAP as
>>> >>
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
>>> >> /tmp/crunch-1128974463/p1/MAP as
>>> >>
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>>> >>
>>> >> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
>>> >> "com.issuu.mahout.utils.DbDumpToSeqFile:
>>> >> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>>> >>
>>> >> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
>>> >> available at: http://localhost:8080/
>>> >> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
>>> >> is done. And is in the process of commiting
>>> >> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
>>> >> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
>>> >> is allowed to commit now
>>> >>
>>> >> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
>>> >> task 'attempt_local_0001_m_000000_0' to
>>> >> /tmp/crunch-1128974463/p1/output
>>> >>
>>> >> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
>>> >> 2013-06-25 16:19:48 Task:904 [INFO] Task
>>> >> 'attempt_local_0001_m_000000_0'
>>> >> done.
>>> >>
>>> >> ---
>>> >>
>>> >>
>>> >> This crude patch makes the output end up at the right place,
>>> >> but breaks a lot of other tests.
>>> >> ---
>>> >>
>>> >> ---
>>> >>
>>> >> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>>> >> +++
>>> >>
>>> >> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>>> >> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>>> >>    protected void configureForMapReduce(Job job, Class keyClass, Class
>>> >> valueClass,
>>> >>        Class outputFormatClass, Path outputPath, String name) {
>>> >>      try {
>>> >> -      FileOutputFormat.setOutputPath(job, outputPath);
>>> >> +      FileOutputFormat.setOutputPath(job, path);
>>> >>      } catch (Exception e) {
>>> >>        throw new RuntimeException(e);
>>> >>      }
>>> >>
>>> >> ---
>>> >>
>>> >>
>>> >> Am I doing something wrong, or is this a bug?
>>> >>
>>> >> Best,
>>> >>
>>> >> Florian
>>> >
>>> >
>>
>>
>

Re: Write to sequence file ignores destination path.

Reply via email to