Re: unnecessary results in search

Gora Mohanty Tue, 04 Jan 2011 03:28:15 -0800

On Tue, Jan 4, 2011 at 5:40 AM,  <[email protected]> wrote:
> Hello,
>
> I used nutch-1.2 to index a few domains. I noticed that nutch correctly 
> crawled all sub-pages of domains. By sub-pages I mean the followings, for 
> example for a domain mydomain.com all links inside it like
> mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that 
> google-bot also crawled all sub-pages.
> However, in search for mydomain.com google gives mydomain.com in the first 
> page and almost no subpages, but nutch gives all subpages. If a domain has, 
> let say 200 sub-pages and we display 10 results in a page then it would take 
> us 10 pages to go forward to see results from other domains. In contrary 
> google displays results form ohter domains in the second place.
[...]


It is not entirely clear what you want:
* If your goal is to only crawl to a certain depth on a domain, you can
  use the -depth argument for the Nutch crawl, or use the -topN option
  to specify the max. number of pages to retrieve.
* Can you give an actual example of what you are searching for.
  It is difficult to understand your description above. E.g., searching
  Google for "yahoo.com" returns many, many links from yahoo.com.
* If you mean that a search with any query string returns different
  results between Google, and Nutch, that could be due to many
  reasons. In both cases, the returned pages are ranked by relevancy,
  but the algorithm is different. Also, Google has probably indexed many
  more sites than your Nutch crawl.

Regards,
Gora

Re: unnecessary results in search

Reply via email to