Hi Sol,

Note that you do not need to use a regular expression to filter by file suffix, 
the suffix-urlfilter plugin does that.
Obviously, if the URL does not contain the file type, you have to fetch it 
anyway, to get the mime-type. If there is no parser for this fie type, it will 
not be parsed and indexed anyway. If there is a parser and you want to disable 
it, I think you can do it in parse-plugins.xml (remove the * rule, and map only 
the mime-types you do want).

        Yossi.

> -----Original Message-----
> From: Sol Lederman [mailto:[email protected]]
> Sent: 25 November 2017 18:57
> To: [email protected]
> Subject: General question on dealing with file types
> 
> Like most of you I imagine, I want to capture and index file types from a
> particular set of types. I want to index HTML but I may or may not want to 
> index
> cgi-bin or PDFs. It seems that there are two general approaches for selecting
> what to include and exclude and neither seems ideal.
> 
> 1. I can include files I care about based on the URL matching a reg ex. So, I 
> can
> have a list: html, HTML, pdf, PDF, etc. and filter out URLs that don't match 
> the
> pattern.
> 
> 2. I can exclude files I don't want. I can exclude files with reg exes that 
> match
> /cgi-bin/, .ico, .doc, etc and keep everything else.
> 
> The problem with the first approach is that lots of HTML files don't end in 
> .html.
> Often there is no file name. The home page of a site may just be 
> http://foo.bar.
> So, the first approach will miss lots of HTML files.
> 
> The second approach is ok until I forget a file pattern that I really want to
> exclude.
> 
> I'm wondering if using the MIME type in conjunction with the first approach
> would work well. So, accept URLs with MIME type text/html, accept URLs that
> match some URL patterns I want to include and exclude the rest.
> 
> I can, I suppose, use approach #2 and not worry since files that don't have 
> text
> won't produce any searchable text in the index. I'm not too worried about
> having some junk in the index as I'm not crawling a huge number of pages.
> 
> Thoughts? What do folks generally do?
> 
> Thanks.
> 
> Sol

Reply via email to