RE: Limiting Results From Single Domain

2019-03-20 Thread IZaBEE_Keeper
Thanks for the top on constructing the query, it's a good starting point..

Yes I'm very aggressive about deduping at several levels.  Duplicate pages
don't seem to be that much of a problem at the moment.  This is mostly for
domains that have excessively used keywords to get rankings.. Deduping near
duplicates and spammy pages is another topic..

When the query is 'mazda' it return many different pages from
mazda-parts.tld before returning pages from other domains. This seems to be
because they all score higher in solr than the next domain.. collapsing
would help as then there would only be 2 links for the domain's hosts, www
and tld with the most relevant link being displayed..

I'll have to work on it a bit..  :)


Markus Jelsma-2 wrote
> Hello Alexis, see inline.
> 
> Regards,
> Markus 
> 
> fq={!collapse field=host}





-
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html


RE: Boilerpipe algorithm is not working as expected

2019-03-20 Thread Markus Jelsma
Hello Hany,

For Boilerpipe you can only select which extractor it should use. By default it 
uses ArticleExtractor, which is the best choice in most cases. However, if 
content is more spread out into separate blocks, CanolaExtractor could be a 
better choice.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Tuesday 19th March 2019 18:06
> To: user@nutch.apache.org
> Subject: Boilerpipe algorithm is not working as expected
> 
> Hello,
> 
> I am using Boilerpipe algorithm in Nutch; however, I noticed the extracted 
> content is almost 5% of the page; main page content is removed.
> 
> How does Boilerpipe is working and based on which criteria is deciding to 
> remove a section or not?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Limiting Results From Single Domain

2019-03-20 Thread Markus Jelsma
Hello Alexis, see inline.

Regards,
Markus 
 
-Original message-
> From:IZaBEE_Keeper 
> Sent: Wednesday 20th March 2019 1:28
> To: user@nutch.apache.org
> Subject: RE: Limiting Results From Single Domain
> 
> Markus Jelsma-2 wrote
> > Hello Alexis,
> > 
> > This is definately a question for Solr. Regardless of that, you choice is
> > between Solr's Result Grouping component, or FieldCollapsing filter query
> > parser.
> > 
> > Regards,
> > Markus
> 
> Thank you..  
> 
> I kinda figured that I'd need to figure out how to use the FieldCollapsing
> query parser & figure out how to make it work on a per hostname basis from
> the hostname field.. I'm not too sure on how to write the function for it
> but I should be able to figure it out..

fq={!collapse field=host}

keep in mind, for this to work equal hosts must be indexed into equals shards.
 
> I'm hopeful though that nutch might solve some of this for me as it indexes
> another billion pages.. It seems to be less frequent with more pages added
> to the index from multiple domains..

Nutch, out-of-the-box, can't solve this for you, unless you crawl or index 
less. Or get rid of a decent amount of duplicates, which are usually around if 
you crawl a few billion pages.

> 
> Thanks again..  :)
> 
> 
> 
> 
> -
> Bee Keeper at IZaBEE.com
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>