Like most of you I imagine, I want to capture and index file types from a particular set of types. I want to index HTML but I may or may not want to index cgi-bin or PDFs. It seems that there are two general approaches for selecting what to include and exclude and neither seems ideal.
1. I can include files I care about based on the URL matching a reg ex. So, I can have a list: html, HTML, pdf, PDF, etc. and filter out URLs that don't match the pattern. 2. I can exclude files I don't want. I can exclude files with reg exes that match /cgi-bin/, .ico, .doc, etc and keep everything else. The problem with the first approach is that lots of HTML files don't end in .html. Often there is no file name. The home page of a site may just be http://foo.bar. So, the first approach will miss lots of HTML files. The second approach is ok until I forget a file pattern that I really want to exclude. I'm wondering if using the MIME type in conjunction with the first approach would work well. So, accept URLs with MIME type text/html, accept URLs that match some URL patterns I want to include and exclude the rest. I can, I suppose, use approach #2 and not worry since files that don't have text won't produce any searchable text in the index. I'm not too worried about having some junk in the index as I'm not crawling a huge number of pages. Thoughts? What do folks generally do? Thanks. Sol

