Where exactly nutch scoring takes place ?
I want to use nutch as an environment to test my proposed algorithm for web mining 1- Where exactly does the nutch score take place ? in which packages or files? 2- Can the LinkAnalysisTool be run at the intranet level?, some documents mentioned that it can take place only at the whole web crawling level 3- what technologies and concepts that i must be familiar with to get into nuch development? is it only jsp, servlet ro anything else ? - Be a chatter box. Enjoy free PC-to-PC calls with Yahoo! Messenger with Voice.
RE: Where exactly nutch scoring takes place ?
Hi, The scoring in Nutch-08 is done in a plugin: scoring-opic. It is called from Indexr.java HTH -Original Message- From: ahmed ghouzia [mailto:[EMAIL PROTECTED] Sent: Friday, May 26, 2006 3:16 PM To: nutch-user@lucene.apache.org; nutch-dev@incubator.apache.org Subject: Where exactly nutch scoring takes place ? I want to use nutch as an environment to test my proposed algorithm for web mining 1- Where exactly does the nutch score take place ? in which packages or files? 2- Can the LinkAnalysisTool be run at the intranet level?, some documents mentioned that it can take place only at the whole web crawling level 3- what technologies and concepts that i must be familiar with to get into nuch development? is it only jsp, servlet ro anything else ? - Be a chatter box. Enjoy free PC-to-PC calls with Yahoo! Messenger with Voice.
[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] Doug Cutting commented on NUTCH-273: Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the fetcher output. Then, when the crawl db is updated with the fetcher output, the target of the redirect should be added, with the full OPIC score of the original URL. This will enable proper politeness guarantees. It would be nice to still associate the original URL with the content of the redirect URL when indexing. Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata? Can anyone think of a better way to implement this? When a page is redirected, the original url is NOT updated. --- Key: NUTCH-273 URL: http://issues.apache.org/jira/browse/NUTCH-273 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: n/a Reporter: Lukas Vlcek [Excerpt from maillist, sender: Andrzej Bialecki] When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug. In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-289) CrawlDatum should store IP address
CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira