[jira] Created: (NUTCH-558) Need tool to retrieve domain statistics
Need tool to retrieve domain statistics --- Key: NUTCH-558 URL: https://issues.apache.org/jira/browse/NUTCH-558 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Chris Schneider Assignee: Chris Schneider Several developers have expressed interest in a tool to retrieve statistics from a crawl on a domain basis (e.g., how many pages were successfully fetched from www.apache.org vs. apache.org, where the latter total would include the former). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528854 ] Susam Pal commented on NUTCH-557: - No, there isn't any significant difference in performance. Here's a list of the CPU time consumed by Nutch crawl for 15 attempts (5 per plugin). They are in the order: Serial No, protocol-http11, protocol-http, protocol-httpclient. The values are in seconds. 1) 17.6, 17.4, 17.4 2) 17.4, 17.2, 17.5 3) 23.6, 23.7, 23.3 4) 31.9, 33.7, 31.6 5) 51.1, 51.2, 52.1 > protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication > -- > > Key: NUTCH-557 > URL: https://issues.apache.org/jira/browse/NUTCH-557 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 >Reporter: Susam Pal >Priority: Minor > Attachments: protocol-http11v0.1.patch > > > 'protocol-http11' is a protocol plugin which supports retrieving documents > via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest > and NTLM authentication schemes for web server as well as proxy server. > The user guide and other information can be found here:- > [http://wiki.apache.org/nutch/protocol-http11] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Host-level stats, ranking and recrawl
Andrzej, et. al., At 9:38 PM +0200 9/17/07, Andrzej Bialecki wrote: >I was recently reading again some scoring-related papers, and found some >interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better >Strategies than Breadth-First for Web Page Ordering" >(http://citeseer.ist.psu.edu/730674.html). > >This paper compares various strategies for prioritizing a crawl of unfetched >pages. Among others, it compared the OPIC scoring and a simple strategy which >is called "large sites first". This strategy prioritizes pages from large >sites and deprioritizes pages from small / medium sites. In order to measure >the effectiveness the authors used the value of accumulated PageRank vs. the >percentage of crawled pages - the strategy that ensures quick ramp-up of >aggregate pagerank is the best. > >A bit surprisingly, they found that large-sites-first wins over OPIC: > >"Breadth-first is close to the best strategies for the first 20-30% of pages, >but after that it becomes less efficient. > The strategies batch-pagerank, larger-sites-first and OPIC have better > performance than the other strategies, with an advantage towards > larger-sites-first when the desired coverage is high. These strategies can > retrieve about half of the Pagerank value of their domains downloading only > around 20-30% of the pages." > >Nutch currently uses OPIC-like scoring for this, so most likely it suffers >from the same symptoms (the authors also mention a relatively poor OPIC >performance at the beginning of a crawl). > >Nutch doesn't collect at the moment any host-level statistics, so we couldn't >use the other strategy even if we wanted. > >What if we added a host-level DB to Nutch? Arguments against this: it's an >additional data structure to maintain, and this adds complexity to the system; >it's an additional step in the workflow (-> it takes longer time to complete >one cycle of crawling). Arguments for are the following: we could implement >the above scoring method ;), plus the host-level statistics are good for >detecting spam sites, limiting the crawl by site size, etc. > >We could start by implementing a tool to collect such statistics from CrawlDb >- this should be a trivial map-reduce job, so if anyone wants to take a crack >at this it would be a good exercise ... ;) Stefan Groschupf developed a tool (with a little help from me) called DomainStats that collects such domain-level data from the crawl results (both crawldb and segment data). We use it to count both pages crawled in each domain and pages crawled that meet a "technical" threshold, since the tool can be used to select for various field and metadata conditions when counting pages. We use the results to create a "white list" of the most technical domains in which to focus our next crawl. Domains and sub-domains are counted separately, so you get separate counts for www.apache.org, apache.org, and org. Is there a Jira ticket open for this? If not, I could create one and submit a patch. We're currently using a Nutch code base that originated around 417928, but I think this is pretty self-contained. Let me know, - Schmed -- Chris Schneider Krugle, Inc. http://www.krugle.com [EMAIL PROTECTED]
[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528729 ] Emmanuel Joke commented on NUTCH-557: - Did you notice any difference in term of performance ? improvement or degradation ? > protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication > -- > > Key: NUTCH-557 > URL: https://issues.apache.org/jira/browse/NUTCH-557 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 >Reporter: Susam Pal >Priority: Minor > Attachments: protocol-http11v0.1.patch > > > 'protocol-http11' is a protocol plugin which supports retrieving documents > via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest > and NTLM authentication schemes for web server as well as proxy server. > The user guide and other information can be found here:- > [http://wiki.apache.org/nutch/protocol-http11] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Scoring API issues (LONG)
Doğacan Güney wrote: On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Doğacan Güney wrote: public void prepareGeneratorConfig(MapWritable crawlDbMeta, Configuration config); What about the segment's metadata in prepareUpdateConfig? Following your idea, we would have to pass a Map ... Yeah, I think it looks good but I guess you disagree? :) Each variant looks a bit like a hack ... The variant with Path[] doesn't require that you retrieve all segments' metadata first, even if your plugins don't use it. On the other hand the variant with Map avoids the pesky issue of doing file I/O in the plugins. So perhaps it's a little better ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com