[jira] Created: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-19 Thread Chris Schneider (JIRA)
Need tool to retrieve domain statistics
---

 Key: NUTCH-558
 URL: https://issues.apache.org/jira/browse/NUTCH-558
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Chris Schneider
Assignee: Chris Schneider


Several developers have expressed interest in a tool to retrieve statistics 
from a crawl on a domain basis (e.g., how many pages were successfully fetched 
from www.apache.org vs. apache.org, where the latter total would include the 
former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-19 Thread Susam Pal (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528854
 ] 

Susam Pal commented on NUTCH-557:
-

No, there isn't any significant difference in performance. Here's a list of the 
CPU time consumed by Nutch crawl for 15 attempts (5 per plugin). 

They are in the order: Serial No, protocol-http11, protocol-http, 
protocol-httpclient. The values are in seconds.

1) 17.6, 17.4, 17.4
2) 17.4, 17.2, 17.5
3) 23.6, 23.7, 23.3
4) 31.9, 33.7, 31.6
5) 51.1, 51.2, 52.1


> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --
>
> Key: NUTCH-557
> URL: https://issues.apache.org/jira/browse/NUTCH-557
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Susam Pal
>Priority: Minor
> Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents 
> via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
> and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- 
> [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Host-level stats, ranking and recrawl

2007-09-19 Thread Chris Schneider
Andrzej, et. al.,

At 9:38 PM +0200 9/17/07, Andrzej Bialecki wrote:
>I was recently reading again some scoring-related papers, and found some 
>interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better 
>Strategies than Breadth-First for Web Page Ordering" 
>(http://citeseer.ist.psu.edu/730674.html).
>
>This paper compares various strategies for prioritizing a crawl of unfetched 
>pages. Among others, it compared the OPIC scoring and a simple strategy which 
>is called "large sites first". This strategy prioritizes pages from large 
>sites and deprioritizes pages from small / medium sites. In order to measure 
>the effectiveness the authors used the value of accumulated PageRank vs. the 
>percentage of crawled pages - the strategy that ensures quick ramp-up of 
>aggregate pagerank is the best.
>
>A bit surprisingly, they found that large-sites-first wins over OPIC:
>
>"Breadth-first is close to the best strategies for the first 20-30% of pages, 
>but after that it becomes less efficient.
> The strategies batch-pagerank, larger-sites-first and OPIC have better 
> performance than the other strategies, with an advantage towards 
> larger-sites-first when the desired coverage is high. These strategies can 
> retrieve about half of the Pagerank value of their domains downloading only 
> around 20-30% of the pages."
>
>Nutch currently uses OPIC-like scoring for this, so most likely it suffers 
>from the same symptoms (the authors also mention a relatively poor OPIC 
>performance at the beginning of a crawl).
>
>Nutch doesn't collect at the moment any host-level statistics, so we couldn't 
>use the other strategy even if we wanted.
>
>What if we added a host-level DB to Nutch? Arguments against this: it's an 
>additional data structure to maintain, and this adds complexity to the system; 
>it's an additional step in the workflow (-> it takes longer time to complete 
>one cycle of crawling). Arguments for are the following: we could implement 
>the above scoring method ;), plus the host-level statistics are good for 
>detecting spam sites, limiting the crawl by site size, etc.
>
>We could start by implementing a tool to collect such statistics from CrawlDb 
>- this should be a trivial map-reduce job, so if anyone wants to take a crack 
>at this it would be a good exercise ... ;)

Stefan Groschupf developed a tool (with a little help from me) called 
DomainStats that collects such domain-level data from the crawl results (both 
crawldb and segment data). We use it to count both pages crawled in each domain 
and pages crawled that meet a "technical" threshold, since the tool can be used 
to select for various field and metadata conditions when counting pages. We use 
the results to create a "white list" of the most technical domains in which to 
focus our next crawl. Domains and sub-domains are counted separately, so you 
get separate counts for www.apache.org, apache.org, and org.

Is there a Jira ticket open for this? If not, I could create one and submit a 
patch. We're currently using a Nutch code base that originated around 417928, 
but I think this is pretty self-contained.

Let me know,

- Schmed
-- 

Chris Schneider
Krugle, Inc.
http://www.krugle.com
[EMAIL PROTECTED]



[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-19 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528729
 ] 

Emmanuel Joke commented on NUTCH-557:
-

Did you notice any difference in term of performance ? improvement or 
degradation ?

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --
>
> Key: NUTCH-557
> URL: https://issues.apache.org/jira/browse/NUTCH-557
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Susam Pal
>Priority: Minor
> Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents 
> via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
> and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- 
> [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Scoring API issues (LONG)

2007-09-19 Thread Andrzej Bialecki

Doğacan Güney wrote:

On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Doğacan Güney wrote:




public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);

What about the segment's metadata in prepareUpdateConfig? Following your
idea, we would have to pass a Map ...


Yeah, I think it looks good but I guess you disagree?


:) Each variant looks a bit like a hack ... The variant with Path[] 
doesn't require that you retrieve all segments' metadata first, even if 
your plugins don't use it.


On the other hand the variant with Map avoids the 
pesky issue of doing file I/O in the plugins. So perhaps it's a little 
better ...



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com