On Sun, Jan 25, 2004 at 02:00:19PM -0800, Gary Funck wrote: > [snip] > Over the weekend, I've collected 3600 host names associated with 16,300 > URL's > extracted from about 80,000 spam messages going back to August of this year. > They're sorted in reverse dot order, for example: > [snip excellent example]
Do you have information on what you used to do the extraction and sorting? A lot of tools have trouble just getting THAT part right. :) > > Question: is there a Perl package that can be used to boil these down > to their domain name part, suitable for a whois look up? Where I'm going > with this is to try and build a data base of same regirstrar/techinal point > of contact and so on. One approach I thought of was to try a whois on the > fully qualified host names above, and if it doesn't succed, then remove > the first component and try again, and so on, but that's not very elegant. I agree. Not very elegant at all. :) You might check with the rfc-ignorant.org folk. They have logic in their lookup and submission forms that is close to what you seek, though it may not be quite on target. Some top level domains have mandatory second level domains, and all registrants hold "third level" domains. Some TLDs have a mix. There are often grandfathered exceptions to every rule. If your goal is just to obtain the whois info for a domain, no need to reinvent the wheel at this point. See below. If the goal is something else... please clarify. :) > > Regarding whois, I tried a few of the domains in the list and noticed > that whois turned up empty. Is there a database somewhere that relates > domain names to their registrar, or to a server that will reply with their > whois info? For generic top level domains, you have the registries. Then you have the registrars. As far as maintaining the contact info, those should be the only folk you need to deal with. EPP registries are simple. The registries hold all the whois data. Most registries are "thin", in that they just refer you to the registrar's whois server for information. Some top level domains won't have available/accurate whois info. The WHOIS-SERVERS.NET zone is an excellent resource. The zone contains CNAME entries to match a TLD with the whois server's A record for that TLD. If you point your whois client at TLD.WHOIS-SERVERS.NET where TLD is the top level domain that you wish to query, you should get results. There are many whois clients and scripts out there. The geektools.com whois proxy is nice, and I believe you can download the proxy code itself. Many people like the bbwhois client, and it offers a nice web interface and a database-backed cache. IANA maintains a list of country code top level domains. This list includes the entity acting as registry, whois servers, etc. http://www.iana.org/cctld/cctld-whois.htm A similar list of registries for gTLD domains is available here: http://www.icann.org/registries/listing.html I don't think I've answered most of your questions, but hopefully the information I've provided will be helpful. Please keep in mind that there are various restrictions placed on the data in whois, and placed on the access to the whois servers for each registry. Pay careful attention that your actions do not cause others harm, or cause yourself to be blacklisted, etc. Bulk automated whois queries for any reason place a load on the target whois server(s). Tread lightly, and use caching as you see fit. hth, -jeff -- Jeff Godin Network Specialist Traverse Area District Library / Traverse Community Network [EMAIL PROTECTED]
