[jira] Closed: (NUTCH-683) NUTCH-676 broke CrawlDbMerger
[ https://issues.apache.org/jira/browse/NUTCH-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-683. --- Resolution: Fixed Fixed in rev. 743277. NUTCH-676 broke CrawlDbMerger - Key: NUTCH-683 URL: https://issues.apache.org/jira/browse/NUTCH-683 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: crawldbmerger_v2.patch Switch to hadoop's MapWritable broke CrawlDbMerger. Part of the reason is that we reuse the same MapWritable instance during reduce which apparently is a big no-no for hadoop's MapWritable. Also, hadoop's MapWritable#putAll doesn't work (I think see HADOOP-5142), so we should also work around that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: clarified -- http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. - Then add them to the libraries to the build path (First refresh the workspace. Then right-click on the source + Then add the jar files to the build path (First refresh the workspace. Then right-click on the source folder Java Build Path Libraries Add Jars. In Eclipse version 3.4, right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually).
[Nutch Wiki] Update of GettingNutchRunningWithWindows by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: Added some clarifications -- === Download === - [http://lucene.apache.org/nutch/release/ Download] the release and extract anywhere on your hard disk e.g. `c:\nutch-0.9` + [http://lucene.apache.org/nutch/release/ Download] the release and extract on your hard disk in a directory that ''does not'' contain a space in it (e.g., `c:\nutch-0.9`). If the directory does contain a space (e.g., `c:\my programs\nutch-0.9`), the Nutch scripts will not work properly. - Create an empty text file in your nutch directory e.g. `urls` and add the URLs of the sites you want to crawl. + Create an empty text file (use any name you wish) in your nutch directory (e.g., `urls`) and add the URLs of the sites you want to crawl. - Add your URLs to the `crawl-urlfilter.txt` (e.g. `C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this: + Add your URLs to the `crawl-urlfilter.txt` (e.g., `C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this: {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} - Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. `C:\Documents and Settings\username`). + Load up cygwin and navigate to your `nutch` directory. When cygwin launches, you'll usually find yourself in your user folder (e.g. `C:\Documents and Settings\username`). - If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to get through it. You'll then need to edit the `nutch-site.xml` file to point to the port opened by the app. + If your workstation needs to go through a Windows Authentication Proxy to get to the Internet (this is not common), then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to get through it. You'll then need to edit the `nutch-site.xml` file to point to the port opened by the app. == Intranet Crawling == @@ -48, +48 @@ {{{ bin/nutch crawl urls -dir crawl -depth 3 crawl.log }}} - then a folder called crawl/ is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. + then a folder called `crawl` is created in your `nutch` directory, along with the crawl.log file. Use this log file to debug any errors you might have. You'll need to delete or move the crawl directory before starting the crawl off again unless you specify another path on the command above.
[jira] Commented: (NUTCH-683) NUTCH-676 broke CrawlDbMerger
[ https://issues.apache.org/jira/browse/NUTCH-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672871#action_12672871 ] Hudson commented on NUTCH-683: -- Integrated in Nutch-trunk #722 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/]) - NUTCH-676 broke CrawlDbMerger NUTCH-676 broke CrawlDbMerger - Key: NUTCH-683 URL: https://issues.apache.org/jira/browse/NUTCH-683 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: crawldbmerger_v2.patch Switch to hadoop's MapWritable broke CrawlDbMerger. Part of the reason is that we reuse the same MapWritable instance during reduce which apparently is a big no-no for hadoop's MapWritable. Also, hadoop's MapWritable#putAll doesn't work (I think see HADOOP-5142), so we should also work around that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672870#action_12672870 ] Hudson commented on NUTCH-676: -- Integrated in Nutch-trunk #722 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/]) NUTCH-683 - broke CrawlDbMerger MapWritable is written inefficiently and confusingly Key: NUTCH-676 URL: https://issues.apache.org/jira/browse/NUTCH-676 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down) Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance. What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.