General question on dealing with file types

Sol Lederman Sat, 25 Nov 2017 08:57:06 -0800

Like most of you I imagine, I want to capture and index file types from a
particular set of types. I want to index HTML but I may or may not want to
index cgi-bin or PDFs. It seems that there are two general approaches for
selecting what to include and exclude and neither seems ideal.


1. I can include files I care about based on the URL matching a reg ex. So,
I can have a list: html, HTML, pdf, PDF, etc. and filter out URLs that
don't match the pattern.

2. I can exclude files I don't want. I can exclude files with reg exes that
match /cgi-bin/, .ico, .doc, etc and keep everything else.

The problem with the first approach is that lots of HTML files don't end in
.html. Often there is no file name. The home page of a site may just be
http://foo.bar. So, the first approach will miss lots of HTML files.

The second approach is ok until I forget a file pattern that I really want
to exclude.

I'm wondering if using the MIME type in conjunction with the first approach
would work well. So, accept URLs with MIME type text/html, accept URLs that
match some URL patterns I want to include and exclude the rest.

I can, I suppose, use approach #2 and not worry since files that don't have
text won't produce any searchable text in the index. I'm not too worried
about having some junk in the index as I'm not crawling a huge number of
pages.

Thoughts? What do folks generally do?

Thanks.

Sol

General question on dealing with file types

Reply via email to