Where exactly nutch scoring takes place ?

2006-05-26 Thread ahmed ghouzia
I want to use nutch as an environment to test my proposed algorithm for web 
mining

1- Where exactly does the nutch score take place ? in which packages or files?

2- Can the LinkAnalysisTool be run at the intranet level?, some documents 
mentioned that it can take place only at the whole web crawling level

3- what technologies and concepts that i must be familiar with to get into nuch 
development?
is it only jsp, servlet ro anything else ?


-
Be a chatter box. Enjoy free PC-to-PC calls  with Yahoo! Messenger with Voice.

RE: Where exactly nutch scoring takes place ?

2006-05-26 Thread Gal Nitzan
Hi,

The scoring in Nutch-08 is done in a plugin: scoring-opic. It is called from
Indexr.java

HTH



-Original Message-
From: ahmed ghouzia [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 26, 2006 3:16 PM
To: nutch-user@lucene.apache.org; nutch-dev@incubator.apache.org
Subject: Where exactly nutch scoring takes place ?

I want to use nutch as an environment to test my proposed algorithm for web
mining

1- Where exactly does the nutch score take place ? in which packages or
files?

2- Can the LinkAnalysisTool be run at the intranet level?, some documents
mentioned that it can take place only at the whole web crawling level

3- what technologies and concepts that i must be familiar with to get into
nuch development?
is it only jsp, servlet ro anything else ?


-
Be a chatter box. Enjoy free PC-to-PC calls  with Yahoo! Messenger with
Voice.




[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-05-26 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] 

Doug Cutting commented on NUTCH-273:


Redirects should really not be followed immediately anyway.  We should instead 
note that it was redirected and to which URL in the fetcher output.  Then, when 
the crawl db is updated with the fetcher output, the target of the redirect 
should be added, with the full OPIC score of the original URL.  This will 
enable proper politeness guarantees.

It would be nice to still associate the original URL with the content of the 
redirect URL when indexing.  Perhaps a list of URLs that redirected to each 
page could be kept in the CrawlDatum metadata?  Can anyone think of a better 
way to implement this?


 When a page is redirected, the original url is NOT updated.
 ---

  Key: NUTCH-273
  URL: http://issues.apache.org/jira/browse/NUTCH-273
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: n/a
 Reporter: Lukas Vlcek


 [Excerpt from maillist, sender: Andrzej Bialecki]
 When a page is redirected, the original url is NOT updated - so, CrawlDB will 
 never know that a redirect occured, it won't even know that a fetch 
 occured... This looks like a bug.
 In 0.7 this was recorded in the segment, and then it would affect the Page 
 status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-289) CrawlDatum should store IP address

2006-05-26 Thread Doug Cutting (JIRA)
CrawlDatum should store IP address
--

 Key: NUTCH-289
 URL: http://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Doug Cutting


If the CrawlDatum stored the IP address of the host of it's URL, then one could:

- partition fetch lists on the basis of IP address, for better politeness;
- truncate pages to fetch per IP address, rather than just hostname.  This 
would be a good way to limit the impact of domain spammers.

The IP addresses could be resolved when a CrawlDatum is first created for a new 
outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira