Hi Julien,

Thanks for the message. I think  you have found part of the problem - I have 
this in regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

I will try modifying this and re-running the crawl.


Ian.
--

On 23 Jan 2012, at 16:04, Julien Nioche wrote:

> Hi Ian
> 
> 
>> The problem I'm finding is that the crawler is not apparently visiting or
>> indexing the content of these urls. The document at the far end of the link
>> has this url
>> 
>> http://[domain]/medialibrary.axd?id=414405745
>> 
>> is actually a pdf. I am using the tika plugin which I thought would allow
>> for indexing pdfs.
>> 
>> 
> don't blame parse-tika : if the URL is not fetched then it has no chance of
> being parsed then indexed
> 
> check your URL filter : the link above contains a '?' which by default
> would get the URL to be filtered out
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

Reply via email to