Thanks for the top on constructing the query, it's a good starting point..

Yes I'm very aggressive about deduping at several levels.  Duplicate pages
don't seem to be that much of a problem at the moment.  This is mostly for
domains that have excessively used keywords to get rankings.. Deduping near
duplicates and spammy pages is another topic..

When the query is 'mazda' it return many different pages from
mazda-parts.tld before returning pages from other domains. This seems to be
because they all score higher in solr than the next domain.. collapsing
would help as then there would only be 2 links for the domain's hosts, www
and tld with the most relevant link being displayed..

I'll have to work on it a bit..  :)


Markus Jelsma-2 wrote
> Hello Alexis, see inline.
> 
> Regards,
> Markus 
> 
> fq={!collapse field=host}





-----
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Reply via email to