RE: Limiting Results From Single Domain
Thanks for the top on constructing the query, it's a good starting point.. Yes I'm very aggressive about deduping at several levels. Duplicate pages don't seem to be that much of a problem at the moment. This is mostly for domains that have excessively used keywords to get rankings.. Deduping near duplicates and spammy pages is another topic.. When the query is 'mazda' it return many different pages from mazda-parts.tld before returning pages from other domains. This seems to be because they all score higher in solr than the next domain.. collapsing would help as then there would only be 2 links for the domain's hosts, www and tld with the most relevant link being displayed.. I'll have to work on it a bit.. :) Markus Jelsma-2 wrote > Hello Alexis, see inline. > > Regards, > Markus > > fq={!collapse field=host} - Bee Keeper at IZaBEE.com -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
RE: Limiting Results From Single Domain
Hello Alexis, see inline. Regards, Markus -Original message- > From:IZaBEE_Keeper > Sent: Wednesday 20th March 2019 1:28 > To: user@nutch.apache.org > Subject: RE: Limiting Results From Single Domain > > Markus Jelsma-2 wrote > > Hello Alexis, > > > > This is definately a question for Solr. Regardless of that, you choice is > > between Solr's Result Grouping component, or FieldCollapsing filter query > > parser. > > > > Regards, > > Markus > > Thank you.. > > I kinda figured that I'd need to figure out how to use the FieldCollapsing > query parser & figure out how to make it work on a per hostname basis from > the hostname field.. I'm not too sure on how to write the function for it > but I should be able to figure it out.. fq={!collapse field=host} keep in mind, for this to work equal hosts must be indexed into equals shards. > I'm hopeful though that nutch might solve some of this for me as it indexes > another billion pages.. It seems to be less frequent with more pages added > to the index from multiple domains.. Nutch, out-of-the-box, can't solve this for you, unless you crawl or index less. Or get rid of a decent amount of duplicates, which are usually around if you crawl a few billion pages. > > Thanks again.. :) > > > > > - > Bee Keeper at IZaBEE.com > -- > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html >
RE: Limiting Results From Single Domain
Markus Jelsma-2 wrote > Hello Alexis, > > This is definately a question for Solr. Regardless of that, you choice is > between Solr's Result Grouping component, or FieldCollapsing filter query > parser. > > Regards, > Markus Thank you.. I kinda figured that I'd need to figure out how to use the FieldCollapsing query parser & figure out how to make it work on a per hostname basis from the hostname field.. I'm not too sure on how to write the function for it but I should be able to figure it out.. I'm hopeful though that nutch might solve some of this for me as it indexes another billion pages.. It seems to be less frequent with more pages added to the index from multiple domains.. Thanks again.. :) - Bee Keeper at IZaBEE.com -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
RE: Limiting Results From Single Domain
Hello Alexis, This is definately a question for Solr. Regardless of that, you choice is between Solr's Result Grouping component, or FieldCollapsing filter query parser. Regards, Markus -Original message- > From:IZaBEE_Keeper > Sent: Monday 18th March 2019 1:43 > To: user@nutch.apache.org > Subject: Limiting Results From Single Domain > > I'm not sure if this should be a Nutch question or a Solr question.. > > I have a large index of the WWW that is rapidly growing daily. Some queries > to the Solr index return result sets that include page after page from the > same site/hostname.. > > I have set the host and the domain fields as stored and indexed. I'm trying > to figure out how to limit the number of results returned per hostname on a > Solr query.. > > Solr 7.5, Nutch 1.5 > > The site is at izabee.com > > > > - > Bee Keeper at IZaBEE.com > -- > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html >
Limiting Results From Single Domain
I'm not sure if this should be a Nutch question or a Solr question.. I have a large index of the WWW that is rapidly growing daily. Some queries to the Solr index return result sets that include page after page from the same site/hostname.. I have set the host and the domain fields as stored and indexed. I'm trying to figure out how to limit the number of results returned per hostname on a Solr query.. Solr 7.5, Nutch 1.5 The site is at izabee.com - Bee Keeper at IZaBEE.com -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html