duplicate hadoop temp files
---------------------------

                 Key: NUTCH-829
                 URL: https://issues.apache.org/jira/browse/NUTCH-829
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.0.0, 1.1
            Reporter: Mike Baranczak
            Priority: Minor


When two crawls are started at exactly the same time, I see the following 
error: 
{quote}
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists
        at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:472)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
        [...]
{quote}

I traced it down to this code in Generator (I'm using Nutch 1.0, but this is 
still in the trunk):

{quote}
Path tempDir =
      new Path(getConf().get("mapred.temp.dir", ".") +
               "/generate-temp-"+ System.currentTimeMillis());
{quote}

I admit that this is an unlikely scenario for most users, but it just so 
happens that I ran into it. To absolutely guarantee that the temp directory 
doesn't already exist, I suggest changing System.currentTimeMillis() to 
java.util.UUID.randomUUID().toString().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to