Hi Julien, Thanks for the message. I think you have found part of the problem - I have this in regex-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc. -[?*!@=] I will try modifying this and re-running the crawl. Ian. -- On 23 Jan 2012, at 16:04, Julien Nioche wrote: > Hi Ian > > >> The problem I'm finding is that the crawler is not apparently visiting or >> indexing the content of these urls. The document at the far end of the link >> has this url >> >> http://[domain]/medialibrary.axd?id=414405745 >> >> is actually a pdf. I am using the tika plugin which I thought would allow >> for indexing pdfs. >> >> > don't blame parse-tika : if the URL is not fetched then it has no chance of > being parsed then indexed > > check your URL filter : the link above contains a '?' which by default > would get the URL to be filtered out > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com

