Re: Rules of excluding specific files in Windows file server are not recognized

Karl Wright Wed, 12 Sep 2012 01:51:32 -0700

Hi Shigeki,

The issue is that the "file part" must match whatever is left after
the start point part of the path matches.  So, the file part will
always begin with a "/".


There are two things we could do: (1) Document it, or (2) change it
(by removing the starting "/").  But remember that if there is a path
before the filename it will also look funny, e.g.:

/my/path/and/file.txt

would become

my/path/and/file.txt

Furthermore, if we change the behavior, maybe some peoples' jobs don't
work right anymore.

I will open a ticket to track discussion of this issue.  CONNECTORS-526.

Thanks,
Karl

On Tue, Sep 11, 2012 at 10:11 PM, Shigeki Kobayashi
<[email protected]> wrote:
> I found the reason that  MCF job does not recognize the file name to exclude
> from crawling.
> You need to put a slash character follwoing by a file name.
>
> I obtained a log below. This time I had a root directory,
> //xxxxx/SharePrjG2/xxxxx/sug/, then placed a file named as "phs.txt".
> In the job setting, I entered "phs.txt" to exclude the file from crawling,
> so the crawling rule became as follwoing:
>
>   1. Exclude file(s) matching phs.txt
>
> DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Matching
> startpoint 'smb://xxxxx/SharePrjG2/xxxxx/sug/' against actual
> 'smb://xxxxx/SharePrjG2/xxxxx/sug/'
> DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Startpoint
> found!
> DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: Checking
> 'phs.txt' against '/phs.txt'
> DEBUG 2012-09-12 09:56:35,829 (Worker thread '31') - JCIFS: No match!
>
>
> The third line above tells "phs.txt" does not match "/phs.txt".
>
> Well, I feel it is kind of hard for users to find out you need a slash.
> If this specification is going to be kept, I think it would be nice to
> specify this rule in the user documentation.
>
> Thanks for your help.
>
>
> Regards,
>
>
> Shigeki
>
> 2012/9/11 Karl Wright <[email protected]>
>>
>> I am wondering if there might be another locale-specific toLowerCase()
>> issue like we saw in Turkey...
>>
>> I've asked Shigeki to turn on connector debugging and send us the log.
>>  That should demonstrate if the rule is not matching due to case
>> reasons.
>>
>> Karl
>>
>> On Tue, Sep 11, 2012 at 7:44 AM, Ahmet Arslan <[email protected]> wrote:
>> > Hi Shigeki
>> >
>> > Can you try entering "*text.txt" in the text box?
>> >
>> > Ahmet
>> > --- On Tue, 9/11/12, Shigeki Kobayashi
>> > <[email protected]> wrote:
>> >
>> > From: Shigeki Kobayashi <[email protected]>
>> > Subject: Rules of excluding specific files in Windows file server are
>> > not recognized
>> > To: [email protected]
>> > Date: Tuesday, September 11, 2012, 1:46 PM
>> >
>> > Hi guys.
>> > I need some help in excluding specific files from crawling.
>> > I am trying to crawl Windows file server using Windows shares connector
>> > to index to Solr.
>> >
>> > There are some files I do not want to index so I set paths to exclude
>> > them from crawling, but the job crawls them.
>> > For example, I do NOT want to index "text.txt" in a directory D which is
>> > a root path.
>> >
>> >
>> > In "Paths" tab: - Set D as the root path.  - To create crawling rules,
>> > from pulldown, chose "exclude" and "file", and enter "text.txt" in a text
>> > box.
>> >
>> > - The list of crawling rules is created as following:
>> >   1. Exclude file(s) matching text.txt   2. Include indexable file(s)
>> > matching *  3. Include directory(s) matching *
>> >
>> >
>> > - Save the job setting
>> > As the result, the job still tries to crawl the file.I wonder why
>> > "text.txt" does not match in the crawling rule.
>> >
>> >
>> > Anyone knows what I did wrong?
>> > Version:  MCF 0.5  Solr 3.5  MySql 5.5
>> >
>> > Regards,
>> > Shigeki
>> >
>> >
>
>
>
>
> --
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>
>  小林 茂樹
>  [email protected]
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>
>
>

Re: Rules of excluding specific files in Windows file server are not recognized

Reply via email to