Re: Exception in DeleteDuplicates.java

2008-01-15 Thread Manoj Bist
Thanks a lot Ismael. I applied this patch to release-0.9 and recompiled and it worked. I can finally try out nutch successfully. Thanks, - Manoj. On Jan 13, 2008 4:43 AM, Ismael [EMAIL PROTECTED] wrote: Hello. I apparently had a similar problem when trying to Dedup, I solved it updating

Re: Redirect pages in segment

2008-01-15 Thread Tomislav Poljak
Andrzej, thanks for explanation. How can I distinguish this redirect pages from the 'normal' ones (with content)? Some status or flag (with parseData.getStatus() I get success(1,0) for both redirect and normal pages). Can I use HTTP response code and if so how can I get it (I don't see it in

Customize Crawling..

2008-01-15 Thread Volkan Ebil
Hi, I am a new nutch user. My problem is to customize the crawl process.My aim is to detect and crawl web sites written in my language.I want to crawl only the sites that contains special chars like ğ or ç and also , i want to limit the urls that ends only with special extensions like com.uk

RE: Customize Crawling..

2008-01-15 Thread kishore.krishna2
Hi I dnt knw abt the special character part...but u can limit the urls using conf/urfilter.txt... Thanx kishore -Original Message- From: Volkan Ebil [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 15, 2008 6:13 PM To: nutch-user@lucene.apache.org Subject: Customize Crawling.. Hi, I

Re: Problems building the parse-rtf plugin

2008-01-15 Thread Chaz Hickman
Shi Wang wrote: Hi! Hickman, You can download it here: _http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/_ Actually, You will have this problem if you use the version 0.9, and, the other plugin you may miss is the mp3 parser, you can download it here:

Re: Redirect pages in segment

2008-01-15 Thread Andrzej Bialecki
Tomislav Poljak wrote: Andrzej, thanks for explanation. How can I distinguish this redirect pages from the 'normal' ones (with content)? Some status or flag (with parseData.getStatus() I get success(1,0) for both redirect and normal pages). Can I use HTTP response code and if so how can I get

Re: 'crawled already exists' - how do I recrawl?

2008-01-15 Thread nghianghesi
the script work well (Nutch 0.9) However, I have some concerns: As log in screen, and review code, the script re-index all database -- low speed (long as index new one) -- Is there any way to re-index only changed pages? The generate step is long also -- can improve it? The

How to use Nutch to parse Web-pages!

2008-01-15 Thread Morrowwind
Hi, My project is about web page processing and I need to parse the web-pages to get all the plain text first. Now I have finished the crawling part using nutch, and I'm in trouble with the parsing part. I have my data in crawldb folder. How can I parse the plain text out of the web pages and

Re: partial crawling

2008-01-15 Thread mistapony
I am curious about this same question. This looks like a very old thread but is there a way to force a timeout in the fetch at all? Thanks. Sorantis wrote: Hi All! I want to add nutch crawl command to cron. There are few thing I need to know. When I execute ./nutch crawl ..etc. I

Re: How to use Nutch to parse Web-pages!

2008-01-15 Thread Developer Developer
check this out http://kuthrax.blogspot.com/2008/01/how-to-retrieve-parsed-content-from.html On Jan 15, 2008 2:46 PM, Morrowwind [EMAIL PROTECTED] wrote: Hi, My project is about web page processing and I need to parse the web-pages to get all the plain text first. Now I have finished

Re: form-based authentication?

2008-01-15 Thread cornelius2000
Hi, This would be a great feature to have, it's just what I need! Has any progress been made? I'd be more than happy to test out a beta version if I can. Cheers, Iwan Susam Pal wrote: Hi, Indeed the answer is negative and also, many people have often asked this in this list.

Need pointers regarding accessing crawled data/plugin etc.

2008-01-15 Thread Manoj Bist
Hi, I would really appreciate if someone could provide pointers to doing the following(via plugins or otherwise). I have gone through plugin central on nutch wiki. 1.) Is it possible to have a control on the 'policy' to decide how soon a url is fetched. For e.g. if a document does not change