Hi Sol,
Note that you do not need to use a regular expression to filter by file suffix,
the suffix-urlfilter plugin does that.
Obviously, if the URL does not contain the file type, you have to fetch it
anyway, to get the mime-type. If there is no parser for this fie type, it will
not be parsed and indexed anyway. If there is a parser and you want to disable
it, I think you can do it in parse-plugins.xml (remove the * rule, and map only
the mime-types you do want).
Yossi.
> -----Original Message-----
> From: Sol Lederman [mailto:[email protected]]
> Sent: 25 November 2017 18:57
> To: [email protected]
> Subject: General question on dealing with file types
>
> Like most of you I imagine, I want to capture and index file types from a
> particular set of types. I want to index HTML but I may or may not want to
> index
> cgi-bin or PDFs. It seems that there are two general approaches for selecting
> what to include and exclude and neither seems ideal.
>
> 1. I can include files I care about based on the URL matching a reg ex. So, I
> can
> have a list: html, HTML, pdf, PDF, etc. and filter out URLs that don't match
> the
> pattern.
>
> 2. I can exclude files I don't want. I can exclude files with reg exes that
> match
> /cgi-bin/, .ico, .doc, etc and keep everything else.
>
> The problem with the first approach is that lots of HTML files don't end in
> .html.
> Often there is no file name. The home page of a site may just be
> http://foo.bar.
> So, the first approach will miss lots of HTML files.
>
> The second approach is ok until I forget a file pattern that I really want to
> exclude.
>
> I'm wondering if using the MIME type in conjunction with the first approach
> would work well. So, accept URLs with MIME type text/html, accept URLs that
> match some URL patterns I want to include and exclude the rest.
>
> I can, I suppose, use approach #2 and not worry since files that don't have
> text
> won't produce any searchable text in the index. I'm not too worried about
> having some junk in the index as I'm not crawling a huge number of pages.
>
> Thoughts? What do folks generally do?
>
> Thanks.
>
> Sol