On Tue, Jan 4, 2011 at 11:36 PM, <[email protected]> wrote: > Hello, > > Thanks you for your response. > > Let me give you more detail of the issue that I have. > First definitions. Let say I have my own domain that I host on a dedicated > server and call it mydomain.com > Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, > maps.mydomain.com and etc. > Call subpages the followings mydomain.com/show/photos/1, > mydomain.com/forum/id/5 and etc. > > Having these definitions, I have observed by examinig apache log files that > Google and Nutch crawlers crawled all subpages of mydomain.com > However, if we search in google for keyword mydomain.com it gives in results > all subdomains of mydomain.com not all subpages, maybe some of them. If we > search in Nutch for the keyword mydomain.com it gives all subdomains and > subpages. My concern was not to include all subpages in a search for keyword > mydomain.com. Of course, we must see subpages for keywords that is in that > subpage. This means we must not remove subpages from index. [...]
OK, the above description makes more sense, after looking through Google results for "yahoo.com". I do not have the results of an equivalent Nutch crawl to compare, but I do imagine that the result would be what you describe above. What Google seems to be doing here is some special-case processing for when it recognises that the search is a primary domain. Interestingly, while it does this for a popular domain name, searching for more obscure domain names does not seem to work in the same manner. You could probably implement a similar special-case handling of domain names. How are you searching with Nutch? Directly, or via indexing through Solr? Regards, Gora

