[jira] Closed: (NUTCH-683) NUTCH-676 broke CrawlDbMerger

2009-02-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-683.
---

Resolution: Fixed

Fixed in rev. 743277.

 NUTCH-676 broke CrawlDbMerger
 -

 Key: NUTCH-683
 URL: https://issues.apache.org/jira/browse/NUTCH-683
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: crawldbmerger_v2.patch


 Switch to hadoop's MapWritable broke CrawlDbMerger. Part of the reason is 
 that we reuse the same MapWritable instance during reduce
 which apparently is a big no-no for hadoop's MapWritable. Also, hadoop's 
 MapWritable#putAll doesn't work (I think see HADOOP-5142),
 so we should also work around that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by FrankMcCown

2009-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
clarified

--
  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
  
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
- Then add them to the libraries to the build path (First refresh the 
workspace. Then right-click on the source
+ Then add the jar files to the build path (First refresh the workspace. Then 
right-click on the source
  folder  Java Build Path  Libraries  Add Jars. In Eclipse version 3.4, 
right-click the project folder  Build Path  Configure Build Path...  Then 
select the Libraries tab, click Add Jars... and then add each .jar file 
individually).
  
  


[Nutch Wiki] Update of GettingNutchRunningWithWindows by FrankMcCown

2009-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Added some clarifications

--
  
  === Download ===
  
- [http://lucene.apache.org/nutch/release/ Download] the release and extract 
anywhere on your hard disk e.g. `c:\nutch-0.9`
+ [http://lucene.apache.org/nutch/release/ Download] the release and extract on 
your hard disk in a directory that ''does not'' contain a space in it (e.g., 
`c:\nutch-0.9`).  If the directory does contain a space (e.g., `c:\my 
programs\nutch-0.9`), the Nutch scripts will not work properly.
  
- Create an empty text file in your nutch directory e.g. `urls` and add the 
URLs of the sites you want to crawl.
+ Create an empty text file (use any name you wish) in your nutch directory 
(e.g., `urls`) and add the URLs of the sites you want to crawl.
  
- Add your URLs to the `crawl-urlfilter.txt` (e.g. 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
+ Add your URLs to the `crawl-urlfilter.txt` (e.g., 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
  {{{
  +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
+ Load up cygwin and navigate to your `nutch` directory.  When cygwin launches, 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
  
- If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the 
[http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to 
get through it.  You'll then need to edit the `nutch-site.xml` file to point to 
the port opened by the app.
+ If your workstation needs to go through a Windows Authentication Proxy to get 
to the Internet (this is not common), then you can use an application such as 
the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] 
to get through it.  You'll then need to edit the `nutch-site.xml` file to point 
to the port opened by the app.
  
  == Intranet Crawling ==
  
@@ -48, +48 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3  crawl.log
  }}}
- then a folder called crawl/ is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
+ then a folder called `crawl` is created in your `nutch` directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
  
  You'll need to delete or move the crawl directory before starting the crawl 
off again unless you specify another path on the command above.
  


[jira] Commented: (NUTCH-683) NUTCH-676 broke CrawlDbMerger

2009-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672871#action_12672871
 ] 

Hudson commented on NUTCH-683:
--

Integrated in Nutch-trunk #722 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/])
 - NUTCH-676 broke CrawlDbMerger


 NUTCH-676 broke CrawlDbMerger
 -

 Key: NUTCH-683
 URL: https://issues.apache.org/jira/browse/NUTCH-683
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: crawldbmerger_v2.patch


 Switch to hadoop's MapWritable broke CrawlDbMerger. Part of the reason is 
 that we reuse the same MapWritable instance during reduce
 which apparently is a big no-no for hadoop's MapWritable. Also, hadoop's 
 MapWritable#putAll doesn't work (I think see HADOOP-5142),
 so we should also work around that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

2009-02-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672870#action_12672870
 ] 

Hudson commented on NUTCH-676:
--

Integrated in Nutch-trunk #722 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/])
NUTCH-683 -  broke CrawlDbMerger


 MapWritable is written inefficiently and confusingly
 

 Key: NUTCH-676
 URL: https://issues.apache.org/jira/browse/NUTCH-676
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: 
 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, 
 NUTCH-676_v2.patch, NUTCH-676_v3.patch


 The MapWritable implemention in o.a.n.crawl is written confusingly - it 
 maintains its own internal linked list which I think may have a bug somewhere 
 (I'm getting an NPE in certain cases in the code, though it's hard to track 
 down)
 Can anyone comment as to why MapWritable is written the way it is, rather 
 than just using a HashMap or a LinkedHashMap if consistent ordering is 
 important? I imagine that would improve performance.
 What about just using the Hadoop MapWritable? Obviously that would break some 
 backwards compatibility but it may be a good idea at some point to reduce 
 confusion (I didn't realize that Nutch had its own impl until a few minutes 
 ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.