Joseph - thank you very much, we can use the data or stumble upon it ourselves once again.
The work is, unfortunately, not FOSS. It is among the few things we have to bear the costs of continuous R&D. You'd have to contact us off-list for further inquiries. Besides everything, there may be more list subscribers interested in the set you have, so please share that with the list if you can. Thanks, Markus -----Original message----- > From:Joseph Naegele <[email protected]> > Sent: Thursday 9th February 2017 23:39 > To: [email protected] > Subject: RE: General question about subdomains > > Thanks Markus. I'll put together a list shortly. Is your classifier plugin > open-source or available to share? It sounds interesting and very useful. > > --- > Joe Naegele > Grier Forensics > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Thursday, February 09, 2017 3:36 AM > To: [email protected] > Subject: RE: General question about subdomains > > Hello Joseph, > > My colleague has not yet started to build a model for these crappy pages, but > would still like to. We are going to run into this again soon enough so if > you have any set of distinct crap sites would be most helpful. Possibly sites > that are not closely interconnected, so we can model and evaluate nicely at > the same time. > > The classifier is a sub project of our custom parser/stuff detector and > extractor, packed as a Nutch parser plugin. It does hierarchical > classification, first detecting host type then page type, models are build > using feature selection via a genetic algorithm to have it perform and keep > it as lightweight as possible. A crap/spam host type is one we'd love to add. > > Any set, even small, will do. > > Thanks, > Markus > > -----Original message----- > > From:Joseph Naegele <[email protected]> > > Sent: Wednesday 8th February 2017 18:20 > > To: [email protected] > > Subject: RE: General question about subdomains > > > > Markus, > > > > The example URLs I sent are all the same IP address. This isn't always the > > case, however, so you're correct that partitioning by IP won't help us. > > Additionally, we'd like to avoid resolving the IPs of these domains in the > > first place since most of them resolve to the same IP. > > > > We're now finding many webs of these spam/parked domains, all > > interconnected. Do you have more information on classifying domains? This > > is something we're now very interested in doing. > > > > I'm still working on putting together a list of "bad" domains. > > > > Thanks > > --- > > Joe Naegele > > Grier Forensics > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Friday, January 13, 2017 10:00 AM > > To: [email protected] > > Subject: RE: General question about subdomains > > > > Joseph - thank you very much! > > > > This is exactly the crap we are looking for, now we can train our > > classifiers to detect at least these bastards. > > > > But how would partitioning by IP really help if they don't all point to the > > same IP? All hosts i manually checked are indeed on the same subnet, but > > many have a different 4th octet. > > > > Regards, > > Markus > > > > > > > > -----Original message----- > > > From:Joseph Naegele <[email protected]> > > > Sent: Friday 13th January 2017 15:11 > > > To: [email protected] > > > Subject: RE: General question about subdomains > > > > > > Markus, > > > > > > Interestingly enough, we do use OpenDNS to filter undesirable content, > > > including parked content. In this case, however, the domain in question > > > isn't tagged in OpenDNS and is therefore "allowed", along with all its > > > subdomains. > > > > > > This particular domain is "hjsjp.com". It's Chinese-owned and the URLs > > > appear to all point to the same link-filled content, possibly a domain > > > park site. Example URLs: > > > - http://e2qya.hjsjp.com/ > > > - http://ml081.hjsjp.com/xzudb > > > - http://www.ch8yu.hjsjp.com/1805/8371.html > > > > > > As Julien mentioned, partitioning and fetching by IP would help. > > > > > > --- > > > Joe Naegele > > > Grier Forensics > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:[email protected]] > > > Sent: Wednesday, January 11, 2017 9:43 AM > > > To: [email protected] > > > Subject: RE: General question about subdomains > > > > > > Hello Joseph, > > > > > > The only feasible method, as i see, is being able to detect these kinds > > > of spam sites as well as domain park sites, they produce lots of garbage > > > as well. Once you detect them, you can chose not to follow outlinks, or > > > to mark them in a domain-blacklist urlfilter. > > > > > > We have seen these examples as well and they caused similar problems but > > > we lost track of them, those domains don't exist anymore. Can you send me > > > the domains that cause you trouble, we could use them for our > > > classification training sets. > > > > > > Regards, > > > Markus > > > > > > -----Original message----- > > > > From:Joseph Naegele <[email protected]> > > > > Sent: Wednesday 11th January 2017 15:21 > > > > To: [email protected] > > > > Subject: General question about subdomains > > > > > > > > This is more of a general question, not Nutch-specific: > > > > > > > > Our crawler discovered some URLs pointing to a number of subdomains of > > > > a Chinese-owned [strmy domain. It then proceeded to discover millions > > > > more URLs pointing to other subdomains (hosts) of the same domain. Most > > > > of the names appear to be gibberish but they do have robots.txt files > > > > and the URLs appear to serve HTML. A few days later I found that our > > > > crawler machine was no longer able to resolve these subdomains, as if > > > > it was blocked by their DNS servers, significantly slowing our crawl > > > > (due to DNS timeouts). This led me to investigate and find that 40% of > > > > all our known URLs were hosts on this same parent domain. > > > > > > > > Since the hosts are actually different, is Nutch able to prevent this > > > > trap-like behavior? Are there any established methods for preventing > > > > similar issues in web crawlers? > > > > > > > > Thanks > > > > > > > > --- > > > > Joe Naegele > > > > Grier Forensics > > > > > > > > > > > > > > > > > > > > > > > >

