I found the reason that MCF job does not recognize the file name to exclude from crawling. You need to put a slash character follwoing by a file name.
I obtained a log below. This time I had a root directory, //xxxxx/SharePrjG2/xxxxx/sug/, then placed a file named as "phs.txt". In the job setting, I entered "phs.txt" to exclude the file from crawling, so the crawling rule became as follwoing: 1. Exclude file(s) matching phs.txt DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Matching startpoint 'smb://xxxxx/SharePrjG2/xxxxx/sug/' against actual 'smb://xxxxx/SharePrjG2/xxxxx/sug/' DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Startpoint found! DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Checking 'phs.txt' against '/phs.txt' DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: No match! The third line above tells "phs.txt" does not match "/phs.txt". Well, I feel it is kind of hard for users to find out you need a slash. If this specification is going to be kept, I think it would be nice to specify this rule in the user documentation. Thanks for your help. Regards, Shigeki 2012/9/11 Karl Wright <[email protected]> > I am wondering if there might be another locale-specific toLowerCase() > issue like we saw in Turkey... > > I've asked Shigeki to turn on connector debugging and send us the log. > That should demonstrate if the rule is not matching due to case > reasons. > > Karl > > On Tue, Sep 11, 2012 at 7:44 AM, Ahmet Arslan <[email protected]> wrote: > > Hi Shigeki > > > > Can you try entering "*text.txt" in the text box? > > > > Ahmet > > --- On Tue, 9/11/12, Shigeki Kobayashi < > [email protected]> wrote: > > > > From: Shigeki Kobayashi <[email protected]> > > Subject: Rules of excluding specific files in Windows file server are > not recognized > > To: [email protected] > > Date: Tuesday, September 11, 2012, 1:46 PM > > > > Hi guys. > > I need some help in excluding specific files from crawling. > > I am trying to crawl Windows file server using Windows shares connector > to index to Solr. > > > > There are some files I do not want to index so I set paths to exclude > them from crawling, but the job crawls them. > > For example, I do NOT want to index "text.txt" in a directory D which is > a root path. > > > > > > In "Paths" tab: - Set D as the root path. - To create crawling rules, > from pulldown, chose "exclude" and "file", and enter "text.txt" in a text > box. > > > > - The list of crawling rules is created as following: > > 1. Exclude file(s) matching text.txt 2. Include indexable file(s) > matching * 3. Include directory(s) matching * > > > > > > - Save the job setting > > As the result, the job still tries to crawl the file.I wonder why > "text.txt" does not match in the crawling rule. > > > > > > Anyone knows what I did wrong? > > Version: MCF 0.5 Solr 3.5 MySql 5.5 > > > > Regards, > > Shigeki > > > > > -- *~~~~~~~~~~~~~~~~~~~~**~~~~* ソフトバンクモバイル株式会社 情報システム本部 システムサービス事業統括部 サービス企画部 小林 茂樹 [email protected] *~~~~~~~~~~~~~~~~~~~~**~~~~*
