duplicate hadoop temp files --------------------------- Key: NUTCH-829 URL: https://issues.apache.org/jira/browse/NUTCH-829 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0, 1.1 Reporter: Mike Baranczak Priority: Minor
When two crawls are started at exactly the same time, I see the following error: {quote} org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.Generator.generate(Generator.java:472) at org.apache.nutch.crawl.Generator.generate(Generator.java:409) [...] {quote} I traced it down to this code in Generator (I'm using Nutch 1.0, but this is still in the trunk): {quote} Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/generate-temp-"+ System.currentTimeMillis()); {quote} I admit that this is an unlikely scenario for most users, but it just so happens that I ran into it. To absolutely guarantee that the temp directory doesn't already exist, I suggest changing System.currentTimeMillis() to java.util.UUID.randomUUID().toString(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.