On Tue, Jan 4, 2011 at 11:36 PM,  <[email protected]> wrote:
> Hello,
>
> Thanks you for your response.
>
> Let me give you more detail of the issue that I have.
> First definitions. Let say I have my own domain that I host on a dedicated 
> server and call it mydomain.com
> Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, 
> maps.mydomain.com and etc.
> Call subpages the followings mydomain.com/show/photos/1, 
> mydomain.com/forum/id/5 and etc.
>
> Having these definitions, I have observed by examinig apache log files that 
> Google and Nutch crawlers crawled all subpages of mydomain.com
> However, if we search in google for keyword mydomain.com it gives in results 
> all subdomains of mydomain.com not all subpages, maybe some of them. If we 
> search in Nutch for the keyword mydomain.com it gives all subdomains and 
> subpages. My concern was not to include all subpages in a search for keyword 
> mydomain.com. Of course, we must see subpages  for keywords that is in that 
> subpage. This means we must not remove subpages from index.
[...]

OK, the above description makes more sense, after looking
through Google results for "yahoo.com". I do not have the
results of an equivalent Nutch crawl to compare, but I do
imagine that the result would be what you describe above.

What Google seems to be doing here is some special-case
processing for when it recognises that the search is a primary
domain. Interestingly, while it does this for a popular domain
name, searching for more obscure domain names does not
seem to work in the same manner.

You could probably implement a similar special-case handling
of domain names. How are you searching with Nutch? Directly,
or via indexing through Solr?

Regards,
Gora

Reply via email to