Re: [Nutch-cvs] svn commit: r516888 - /lucene/nutch/trunk/bin/nutch

2007-03-12 Thread Andrzej Bialecki

Sami Siren wrote:

How the code ended up in this place on Linux? The $cygwin condition
should have prevented that, because it evaluates to true only on Cygwin,
where this utility is required to translate the paths.

You also changed the if syntax - before it was using the /bin/test
utility to evaluate the expression, now it uses a shell built-in  - I'm
not sure if these two follow the same evaluation rules on all supported
platforms ... Please revert it to the earlier syntax.



revert it so it desn't work on linux, are you sure?
  


I'll make a patch and check that it works on Linux too - apparently it 
was me who botched the AND syntax in my previous commit... ;)


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output

2007-03-12 Thread Mathijs Homminga (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathijs Homminga updated NUTCH-451:
---

Attachment: LocalFetchRecover-0.8.1.java

works with Nutch 0.8.1

 Tool to recover partial fetcher output
 --

 Key: NUTCH-451
 URL: https://issues.apache.org/jira/browse/NUTCH-451
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java


 This class may help you to recover partial data from a failed Fetcher run. 
 NOTE 1: this works ONLY if you ran Fetcher using local file system, i.e. 
 you didn't use DFS - partial output to DFS is permanently lost if a process 
 fails to properly close the output streams.
 NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial 
 SequenceFile-s will be corrupted at the end. This means that it won't be 
 possible to recover all data from them - most likely only the data up to the 
 last sync marker can be recovered.
 The recovery proces requires some preparation: 
 * determine the map directories corresponding to the map task outputs of the 
 failed job. These map directories contain SequenceFile-s consisting of pairs 
 of Text, FetcherOutput, named e.g. part-0.out, or file.out, or spill0.out.
 * create the new input directory, let's say input/. Copy all SequenceFile-s 
 into this directory, renaming them sequentially like this: 
   input/part-0
   input/part-1
   input/part-2
   input/part-3
   ...
   
 * specify the input directory as the input to this tool. 
 If all goes well, a new segment will be created as a subdirectory of the 
 output dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Hadoop 0.11.2 vs. 0.12.1

2007-03-12 Thread Dennis Kubes



Andrzej Bialecki wrote:

Dennis Kubes wrote:

I agree there may be subtle bugs.

I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop
12.1 on a small cluster of 5 machines if this would help?  We have 
already
  


Certainly, that would be most welcome.


I will start that up today.




done some crawls  100K urls with 11.2 without problems.  I say let's 
test

it and if there aren't any significant issues then let's go with 12.1 if
the hadoop team thinks it will be more stable.
  


0.12.1 is not out the door yet. I can create a patch that uses the 
latest Hadoop trunk binaries, so that we could test it.


I can just pull it down from source.  Let me know if that isn't what we 
want'.




One question though, are there any concerns about upgrading clusters as
opposed to new fetches?
  


Theoretically, there shouldn't be, but this is an uncharted area ... 
until someone tries it we won't know for sure. :-/