Re: [Nutch-cvs] svn commit: r516888 - /lucene/nutch/trunk/bin/nutch
Sami Siren wrote: How the code ended up in this place on Linux? The $cygwin condition should have prevented that, because it evaluates to true only on Cygwin, where this utility is required to translate the paths. You also changed the if syntax - before it was using the /bin/test utility to evaluate the expression, now it uses a shell built-in - I'm not sure if these two follow the same evaluation rules on all supported platforms ... Please revert it to the earlier syntax. revert it so it desn't work on linux, are you sure? I'll make a patch and check that it works on Linux too - apparently it was me who botched the AND syntax in my previous commit... ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output
[ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathijs Homminga updated NUTCH-451: --- Attachment: LocalFetchRecover-0.8.1.java works with Nutch 0.8.1 Tool to recover partial fetcher output -- Key: NUTCH-451 URL: https://issues.apache.org/jira/browse/NUTCH-451 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java This class may help you to recover partial data from a failed Fetcher run. NOTE 1: this works ONLY if you ran Fetcher using local file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams. NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered. The recovery proces requires some preparation: * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of Text, FetcherOutput, named e.g. part-0.out, or file.out, or spill0.out. * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this: input/part-0 input/part-1 input/part-2 input/part-3 ... * specify the input directory as the input to this tool. If all goes well, a new segment will be created as a subdirectory of the output dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Hadoop 0.11.2 vs. 0.12.1
Andrzej Bialecki wrote: Dennis Kubes wrote: I agree there may be subtle bugs. I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop 12.1 on a small cluster of 5 machines if this would help? We have already Certainly, that would be most welcome. I will start that up today. done some crawls 100K urls with 11.2 without problems. I say let's test it and if there aren't any significant issues then let's go with 12.1 if the hadoop team thinks it will be more stable. 0.12.1 is not out the door yet. I can create a patch that uses the latest Hadoop trunk binaries, so that we could test it. I can just pull it down from source. Let me know if that isn't what we want'. One question though, are there any concerns about upgrading clusters as opposed to new fetches? Theoretically, there shouldn't be, but this is an uncharted area ... until someone tries it we won't know for sure. :-/