date:20120424

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Ian Piper

Yes, I had read that tutorial and used it to get to where I am. It's very confusing, alas. It's unfortunate that this is the most clear description that I could find, because it (like others that I have found) assumes a large amount of prior knowledge. There doesn't seem to be a clear,

solution for scanned pdf parsing

2012-04-24 Thread nutchsolruser

I have some pdf files , data present in pdf is scanned articles and some unicode text. I am using tika as pdf parser. but parser fails for pdf's with images in it. is it possible to index only parsable data present in that pdf. currently it is not indexing any data from that pdf. Thanks. -- View

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Markus Jelsma

On Mon, 23 Apr 2012 10:23:05 +0100, Ian Piper ianpi...@tellura.co.uk wrote: Hi all, I have set up a process for crawling a client's website using nutch and then creating a Solr index. I have run into a workflow problem and would appreciate some guidance - preferably a tutorial of some sort.

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Lewis John Mcgibbney

Hi Ian, There is an older well used script [0] which we moved to the aerchive section of the wiki to clean it up a bit. If you browse around there you should be able to manufacture something custom for your needs with little overhead. You may also wish to scrutinise the current command line

Re: solution for scanned pdf parsing

2012-04-24 Thread Lewis John Mcgibbney

Can you please post the output when the Tika parser plugin fails? If possible also the URL if you can...? Thank you Lewis On Tue, Apr 24, 2012 at 6:38 AM, nutchsolruser nutchsolru...@gmail.comwrote: I have some pdf files , data present in pdf is scanned articles and some unicode text. I am

Re: solution for scanned pdf parsing

2012-04-24 Thread remi tassing

It could also be due to the filesize //Remi On Tuesday, April 24, 2012, nutchsolruser nutchsolru...@gmail.com wrote: I have some pdf files , data present in pdf is scanned articles and some unicode text. I am using tika as pdf parser. but parser fails for pdf's with images in it. is it

RE: Question related to NUCTH 1044 redirected URLS and invalid scores

2012-04-24 Thread Pravin Agrawal

Hi Lewis, thanks for the reply. Sorry I couldn't get back to you soon as I was on vacation. I tried out the NUTCH 1044 patch on nutch 1.4 with a test website where a jsp page sends a 302 redirect request to another jsp page. But here I observed that the score of redirected URL is still set

Getting Started with NUTCH

2012-04-24 Thread eliea

I have just started picking up Nutch for a project that I am working on that involves building a Framework for QA teams to check site pages for broken links, expired content, etc... I have successfully setup Nutch on a Windows OS, and fire up an initial Crawl with a single Url in the seed file. I

Getting Started with NUTCH

2012-04-24 Thread eliea

I have just started picking up Nutch for a project that I am working on that involves building a Framework for QA teams to check site pages for broken links, expired content, etc... I have successfully setup Nutch on a Windows OS, and fire up an initial Crawl with a single Url in the seed file. I

Re: Good workflow for a regular re-indexing job

solution for scanned pdf parsing

Re: Good workflow for a regular re-indexing job

Re: Good workflow for a regular re-indexing job

Re: solution for scanned pdf parsing

Re: solution for scanned pdf parsing

RE: Question related to NUCTH 1044 redirected URLS and invalid scores

Getting Started with NUTCH

Getting Started with NUTCH

9 matches

Site Navigation

Mail list logo

Footer information