Problem in crawling windows shared folder using Nutch's SMB protocol plugin

2009-12-21 Thread Rupesh Mankar
Hi, I downloaded SMB protocol plugin from following location: http://issues.apache.org/jira/browse/NUTCH-427 I configured it (as mentioned it in read.txt) with Nutch. But when I tried to crawl, nothing gets crawled and get following exception in hadoop log. 2009-12-21 16:25:04,728 FATAL

Large files - nutch failing to fetch

2009-12-21 Thread Sundara Kaku
Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch all the files using Nutch, but nutch is

Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki
On 2009-12-21 17:15, Sundara Kaku wrote: Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch

domain crawl using bin/nutch

2009-12-21 Thread Ted Yu
Hi, I found db.ignore.external.links property. How do I limit the crawl by also excluding links within the same domain as well ? Thanks

unicode 2029 paragraph separator

2009-12-21 Thread reinhard schwab
http://www.fileformat.info/info/unicode/char/2029/index.htm i have experienced that this unicode character breaks JSON deserializing when using SOLR and AJAX. it comes from a pdf text. where to filter out or replace this character? pdf parser/text extractor? solr indexer? regards reinhard

Re: domain crawl using bin/nutch

2009-12-21 Thread Jesse Hires
You should be able to do this using one of the variations of *-urlfilter.txt files. Instead of using + in front of the regex, you can tell it to exclude lines that match the regex with a -. Just a guess, I haven't actually tried it, but you could probably use something like the following. (I'm

RE: domain crawl using bin/nutch

2009-12-21 Thread Jun Mao
But how could we tell Nutch every time to do crawling in this way? I do not want to edit *-filter.txt every time. Thanks, Jun -Original Message- From: Jesse Hires [mailto:jhi...@gmail.com] Sent: 2009年12月22日 9:23 To: nutch-user@lucene.apache.org Subject: Re: domain crawl using