Hello,

Thanks you for your response. 

Let me give you more detail of the issue that I have.
First definitions. Let say I have my own domain that I host on a dedicated 
server and call it mydomain.com
Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, 
maps.mydomain.com and etc.
Call subpages the followings mydomain.com/show/photos/1, 
mydomain.com/forum/id/5 and etc.

Having these definitions, I have observed by examinig apache log files that 
Google and Nutch crawlers crawled all subpages of mydomain.com
However, if we search in google for keyword mydomain.com it gives in results 
all subdomains of mydomain.com not all subpages, maybe some of them. If we 
search in Nutch for the keyword mydomain.com it gives all subdomains and 
subpages. My concern was not to include all subpages in a search for keyword 
mydomain.com. Of course, we must see subpages  for keywords that is in that 
subpage. This means we must not remove subpages from index.

I hope this gives you more detail of the issue that I have.

Thanks.
Alex.



 

 


 

 

-----Original Message-----
From: Gora Mohanty <[email protected]>
To: user <[email protected]>
Sent: Tue, Jan 4, 2011 3:28 am
Subject: Re: unnecessary results in search


On Tue, Jan 4, 2011 at 5:40 AM,  <[email protected]> wrote:

> Hello,

>

> I used nutch-1.2 to index a few domains. I noticed that nutch correctly 

crawled all sub-pages of domains. By sub-pages I mean the followings, for 

example for a domain mydomain.com all links inside it like

> mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that 

google-bot also crawled all sub-pages.

> However, in search for mydomain.com google gives mydomain.com in the first 

page and almost no subpages, but nutch gives all subpages. If a domain has, let 

say 200 sub-pages and we display 10 results in a page then it would take us 10 

pages to go forward to see results from other domains. In contrary google 

displays results form ohter domains in the second place.

[...]



It is not entirely clear what you want:

* If your goal is to only crawl to a certain depth on a domain, you can

  use the -depth argument for the Nutch crawl, or use the -topN option

  to specify the max. number of pages to retrieve.

* Can you give an actual example of what you are searching for.

  It is difficult to understand your description above. E.g., searching

  Google for "yahoo.com" returns many, many links from yahoo.com.

* If you mean that a search with any query string returns different

  results between Google, and Nutch, that could be due to many

  reasons. In both cases, the returned pages are ranked by relevancy,

  but the algorithm is different. Also, Google has probably indexed many

  more sites than your Nutch crawl.



Regards,

Gora




 

Reply via email to