I do search directly in Nutch version 1-2. I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword. This must be so because if mydomain.com has let say 2000 subpages then in the search result for keyword mydomain.com the next 200 pages all will be subpages of mydomain.com.
If someone could direct me to the part of the source code where Nutch gives scores to pages I can take a look to it. For testing this issue you can index a domain with a few subpages and compare search results with the one in google. Thanks. Alex. -----Original Message----- From: Gora Mohanty <[email protected]> To: user <[email protected]> Sent: Wed, Jan 5, 2011 4:10 am Subject: Re: unnecessary results in search On Tue, Jan 4, 2011 at 11:36 PM, <[email protected]> wrote: > Hello, > > Thanks you for your response. > > Let me give you more detail of the issue that I have. > First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com > Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com and etc. > Call subpages the followings mydomain.com/show/photos/1, > mydomain.com/forum/id/5 and etc. > > Having these definitions, I have observed by examinig apache log files that Google and Nutch crawlers crawled all subpages of mydomain.com > However, if we search in google for keyword mydomain.com it gives in results all subdomains of mydomain.com not all subpages, maybe some of them. If we search in Nutch for the keyword mydomain.com it gives all subdomains and subpages. My concern was not to include all subpages in a search for keyword mydomain.com. Of course, we must see subpages for keywords that is in that subpage. This means we must not remove subpages from index. [...] OK, the above description makes more sense, after looking through Google results for "yahoo.com". I do not have the results of an equivalent Nutch crawl to compare, but I do imagine that the result would be what you describe above. What Google seems to be doing here is some special-case processing for when it recognises that the search is a primary domain. Interestingly, while it does this for a popular domain name, searching for more obscure domain names does not seem to work in the same manner. You could probably implement a similar special-case handling of domain names. How are you searching with Nutch? Directly, or via indexing through Solr? Regards, Gora

