I do search directly in Nutch version 1-2. 
I think google gives very low scores to subpages of a domain and higher scores 
to other domains for a given keyword.
This must be so because if  mydomain.com has let say 2000 subpages then in the 
search result for keyword mydomain.com  the next 200 pages all will be subpages 
of mydomain.com.

If someone could direct me to the part of the source code where Nutch gives 
scores to pages I can take a look to it.

For testing this issue you can index a domain with a few subpages and compare 
search results with the one in google.

Thanks.
Alex.


 

 


 

 

-----Original Message-----
From: Gora Mohanty <[email protected]>
To: user <[email protected]>
Sent: Wed, Jan 5, 2011 4:10 am
Subject: Re: unnecessary results in search


On Tue, Jan 4, 2011 at 11:36 PM,  <[email protected]> wrote:

> Hello,

>

> Thanks you for your response.

>

> Let me give you more detail of the issue that I have.

> First definitions. Let say I have my own domain that I host on a dedicated 

server and call it mydomain.com

> Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, 

maps.mydomain.com and etc.

> Call subpages the followings mydomain.com/show/photos/1, 
> mydomain.com/forum/id/5 

and etc.

>

> Having these definitions, I have observed by examinig apache log files that 

Google and Nutch crawlers crawled all subpages of mydomain.com

> However, if we search in google for keyword mydomain.com it gives in results 

all subdomains of mydomain.com not all subpages, maybe some of them. If we 

search in Nutch for the keyword mydomain.com it gives all subdomains and 

subpages. My concern was not to include all subpages in a search for keyword 

mydomain.com. Of course, we must see subpages  for keywords that is in that 

subpage. This means we must not remove subpages from index.

[...]



OK, the above description makes more sense, after looking

through Google results for "yahoo.com". I do not have the

results of an equivalent Nutch crawl to compare, but I do

imagine that the result would be what you describe above.



What Google seems to be doing here is some special-case

processing for when it recognises that the search is a primary

domain. Interestingly, while it does this for a popular domain

name, searching for more obscure domain names does not

seem to work in the same manner.



You could probably implement a similar special-case handling

of domain names. How are you searching with Nutch? Directly,

or via indexing through Solr?



Regards,

Gora




 

Reply via email to