RE: Limiting Results From Single Domain

2019-03-20 Thread IZaBEE_Keeper
Thanks for the top on constructing the query, it's a good starting point..

Yes I'm very aggressive about deduping at several levels.  Duplicate pages
don't seem to be that much of a problem at the moment.  This is mostly for
domains that have excessively used keywords to get rankings.. Deduping near
duplicates and spammy pages is another topic..

When the query is 'mazda' it return many different pages from
mazda-parts.tld before returning pages from other domains. This seems to be
because they all score higher in solr than the next domain.. collapsing
would help as then there would only be 2 links for the domain's hosts, www
and tld with the most relevant link being displayed..

I'll have to work on it a bit..  :)


Markus Jelsma-2 wrote
> Hello Alexis, see inline.
> 
> Regards,
> Markus 
> 
> fq={!collapse field=host}





-
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html


RE: Limiting Results From Single Domain

2019-03-20 Thread Markus Jelsma
Hello Alexis, see inline.

Regards,
Markus 
 
-Original message-
> From:IZaBEE_Keeper 
> Sent: Wednesday 20th March 2019 1:28
> To: user@nutch.apache.org
> Subject: RE: Limiting Results From Single Domain
> 
> Markus Jelsma-2 wrote
> > Hello Alexis,
> > 
> > This is definately a question for Solr. Regardless of that, you choice is
> > between Solr's Result Grouping component, or FieldCollapsing filter query
> > parser.
> > 
> > Regards,
> > Markus
> 
> Thank you..  
> 
> I kinda figured that I'd need to figure out how to use the FieldCollapsing
> query parser & figure out how to make it work on a per hostname basis from
> the hostname field.. I'm not too sure on how to write the function for it
> but I should be able to figure it out..

fq={!collapse field=host}

keep in mind, for this to work equal hosts must be indexed into equals shards.
 
> I'm hopeful though that nutch might solve some of this for me as it indexes
> another billion pages.. It seems to be less frequent with more pages added
> to the index from multiple domains..

Nutch, out-of-the-box, can't solve this for you, unless you crawl or index 
less. Or get rid of a decent amount of duplicates, which are usually around if 
you crawl a few billion pages.

> 
> Thanks again..  :)
> 
> 
> 
> 
> -
> Bee Keeper at IZaBEE.com
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> 


RE: Limiting Results From Single Domain

2019-03-19 Thread IZaBEE_Keeper
Markus Jelsma-2 wrote
> Hello Alexis,
> 
> This is definately a question for Solr. Regardless of that, you choice is
> between Solr's Result Grouping component, or FieldCollapsing filter query
> parser.
> 
> Regards,
> Markus

Thank you..  

I kinda figured that I'd need to figure out how to use the FieldCollapsing
query parser & figure out how to make it work on a per hostname basis from
the hostname field.. I'm not too sure on how to write the function for it
but I should be able to figure it out..

I'm hopeful though that nutch might solve some of this for me as it indexes
another billion pages.. It seems to be less frequent with more pages added
to the index from multiple domains..

Thanks again..  :)




-
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html


RE: Limiting Results From Single Domain

2019-03-18 Thread Markus Jelsma
Hello Alexis,

This is definately a question for Solr. Regardless of that, you choice is 
between Solr's Result Grouping component, or FieldCollapsing filter query 
parser.

Regards,
Markus

 
 
-Original message-
> From:IZaBEE_Keeper 
> Sent: Monday 18th March 2019 1:43
> To: user@nutch.apache.org
> Subject: Limiting Results From Single Domain
> 
> I'm not sure if this should be a Nutch question or a Solr question..
> 
> I have a large index of the WWW that is rapidly growing daily.  Some queries
> to the Solr index return result sets that include page after page from the
> same site/hostname..
> 
> I have set the host and the domain fields as stored and indexed.  I'm trying
> to figure out how to limit the number of results returned per hostname on a
> Solr query..
> 
> Solr 7.5, Nutch 1.5
> 
> The site is at izabee.com
> 
> 
> 
> -
> Bee Keeper at IZaBEE.com
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> 


Limiting Results From Single Domain

2019-03-17 Thread IZaBEE_Keeper
I'm not sure if this should be a Nutch question or a Solr question..

I have a large index of the WWW that is rapidly growing daily.  Some queries
to the Solr index return result sets that include page after page from the
same site/hostname..

I have set the host and the domain fields as stored and indexed.  I'm trying
to figure out how to limit the number of results returned per hostname on a
Solr query..

Solr 7.5, Nutch 1.5

The site is at izabee.com



-
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html