Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Ian Piper
Yes, I had read that tutorial and used it to get to where I am. It's very confusing, alas. It's unfortunate that this is the most clear description that I could find, because it (like others that I have found) assumes a large amount of prior knowledge. There doesn't seem to be a clear,

solution for scanned pdf parsing

2012-04-24 Thread nutchsolruser
I have some pdf files , data present in pdf is scanned articles and some unicode text. I am using tika as pdf parser. but parser fails for pdf's with images in it. is it possible to index only parsable data present in that pdf. currently it is not indexing any data from that pdf. Thanks. -- View

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Markus Jelsma
On Mon, 23 Apr 2012 10:23:05 +0100, Ian Piper ianpi...@tellura.co.uk wrote: Hi all, I have set up a process for crawling a client's website using nutch and then creating a Solr index. I have run into a workflow problem and would appreciate some guidance - preferably a tutorial of some sort.

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Lewis John Mcgibbney
Hi Ian, There is an older well used script [0] which we moved to the aerchive section of the wiki to clean it up a bit. If you browse around there you should be able to manufacture something custom for your needs with little overhead. You may also wish to scrutinise the current command line

Re: solution for scanned pdf parsing

2012-04-24 Thread Lewis John Mcgibbney
Can you please post the output when the Tika parser plugin fails? If possible also the URL if you can...? Thank you Lewis On Tue, Apr 24, 2012 at 6:38 AM, nutchsolruser nutchsolru...@gmail.comwrote: I have some pdf files , data present in pdf is scanned articles and some unicode text. I am

Re: solution for scanned pdf parsing

2012-04-24 Thread remi tassing
It could also be due to the filesize //Remi On Tuesday, April 24, 2012, nutchsolruser nutchsolru...@gmail.com wrote: I have some pdf files , data present in pdf is scanned articles and some unicode text. I am using tika as pdf parser. but parser fails for pdf's with images in it. is it

RE: Question related to NUCTH 1044 redirected URLS and invalid scores

2012-04-24 Thread Pravin Agrawal
Hi Lewis, thanks for the reply. Sorry I couldn't get back to you soon as I was on vacation. I tried out the NUTCH 1044 patch on nutch 1.4 with a test website where a jsp page sends a 302 redirect request to another jsp page. But here I observed that the score of redirected URL is still set

Getting Started with NUTCH

2012-04-24 Thread eliea
I have just started picking up Nutch for a project that I am working on that involves building a Framework for QA teams to check site pages for broken links, expired content, etc... I have successfully setup Nutch on a Windows OS, and fire up an initial Crawl with a single Url in the seed file. I

Getting Started with NUTCH

2012-04-24 Thread eliea
I have just started picking up Nutch for a project that I am working on that involves building a Framework for QA teams to check site pages for broken links, expired content, etc... I have successfully setup Nutch on a Windows OS, and fire up an initial Crawl with a single Url in the seed file. I