Joseph - thank you very much!

This is exactly the crap we are looking for, now we can train our classifiers 
to detect at least these bastards.

But how would partitioning by IP really help if they don't all point to the 
same IP? All hosts i manually checked are indeed on the same subnet, but many 
have a different 4th octet.

Regards,
Markus

 
 
-----Original message-----
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Friday 13th January 2017 15:11
> To: user@nutch.apache.org
> Subject: RE: General question about subdomains
> 
> Markus,
> 
> Interestingly enough, we do use OpenDNS to filter undesirable content, 
> including parked content. In this case, however, the domain in question isn't 
> tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
> 
> This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear 
> to all point to the same link-filled content, possibly a domain park site. 
> Example URLs:
> - http://e2qya.hjsjp.com/
> - http://ml081.hjsjp.com/xzudb
> - http://www.ch8yu.hjsjp.com/1805/8371.html
> 
> As Julien mentioned, partitioning and fetching by IP would help.
> 
> ---
> Joe Naegele
> Grier Forensics
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Wednesday, January 11, 2017 9:43 AM
> To: user@nutch.apache.org
> Subject: RE: General question about subdomains
> 
> Hello Joseph,
> 
> The only feasible method, as i see, is being able to detect these kinds of 
> spam sites as well as domain park sites, they produce lots of garbage as 
> well. Once you detect them, you can chose not to follow outlinks, or to mark 
> them in a domain-blacklist urlfilter.
> 
> We have seen these examples as well and they caused similar problems but we 
> lost track of them, those domains don't exist anymore. Can you send me the 
> domains that cause you trouble, we could use them for our classification 
> training sets.
> 
> Regards,
> Markus
>  
> -----Original message-----
> > From:Joseph Naegele <jnaeg...@grierforensics.com>
> > Sent: Wednesday 11th January 2017 15:21
> > To: user@nutch.apache.org
> > Subject: General question about subdomains
> > 
> > This is more of a general question, not Nutch-specific:
> > 
> > Our crawler discovered some URLs pointing to a number of subdomains of a 
> > Chinese-owned [strmy domain. It then proceeded to discover millions more 
> > URLs pointing to other subdomains (hosts) of the same domain. Most of the 
> > names appear to be gibberish but they do have robots.txt files and the URLs 
> > appear to serve HTML. A few days later I found that our crawler machine was 
> > no longer able to resolve these subdomains, as if it was blocked by their 
> > DNS servers, significantly slowing our crawl (due to DNS timeouts). This 
> > led me to investigate and find that 40% of all our known URLs were hosts on 
> > this same parent domain.
> > 
> > Since the hosts are actually different, is Nutch able to prevent this 
> > trap-like behavior? Are there any established methods for preventing 
> > similar issues in web crawlers?
> > 
> > Thanks
> > 
> > ---
> > Joe Naegele
> > Grier Forensics
> > 
> > 
> > 
> 
> 

Reply via email to