Re: How fetcher works

2009-07-30 Thread reinhard schwab
Saurabh Suman schrieb: > Hi > I have some confusion regarding Fetcher.java. Does Fetcher fetches Html > page ,stores it first and then parse? > Can i just store the html and i don't want to parse it? > it can. it has a -noParsing option bin/nutch fetch Usage: Fetcher [-threads n] [-noParsin

Nutch and Solr

2009-07-30 Thread Paul Tomblin
I'm trying to follow the example in the Wiki, but it's corrupt. It has a bunch of garbage in the part you're supposed to past into solrconfig.xml - I don't know if something got interpreted as wiki markup when it shouldn't, or what, but I doubt superscripts are a normal part of the configuration.

Meaning of ProtocolStatus.ACCESS_DENIED

2009-07-30 Thread Saurabh Suman
Hi In Fetcher.java, if protacol status of a url is ProtocolStatus.ACCESS_DENIED. Will nutch try to crawl it again after certain time interval? If yes , how can i prevent nutch to recrawl it again if its protocol status is ProtocolStatus.ACCESS_DENIED? -- View this message in context: http://ww

Re: Dumping what I have?

2009-07-30 Thread schroedi
Hi Paul, yeah there is a dump command bin/nutch readlinkdb crawl/linkdb/ -dump dumpdir You can also dump the CrawlDB, but I dont know if the complete data are dumpable and this is usefull for you... HTH Mario Paul Tomblin wrote: > The nutch data files are pretty opaque, and even "strings" can'

Dumping Crawl DB with XML

2009-07-30 Thread schroedi
I found to dump the crawldb/linkdb into a txt file. Are there other formats to dump into? Maybe XML? I am thinking to read aboutn csv format thanks in advance Mario -- Mario Schröder | http://www.finanz-checks.de Phone: +49 34464 62301 Cell: +49 163 27 09 807 http://www.xing.com/go/invite/6035

Nutch in C++

2009-07-30 Thread alxsss
Hi, As I understood only indexing part of nutch is in C++ as clucene.? I want to code? nutch in C++, only in case if it is worth doing that.? I wondered if is worth coding the remaining parts of nutch in C++, let say the crawler. Can someone give me directions on what to start. Thanks Alex.

how to exclude some external links

2009-07-30 Thread alxsss
Hi, I would like to know how can I modify nutch code to exclude external links with certain extensions. For example, if have in urls mydomain.com and my domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to fetch(crawl) these kind of urls at all. Thanks

Re: how to exclude some external links

2009-07-30 Thread Paul Tomblin
On Thu, Jul 30, 2009 at 9:15 PM, wrote: > I would like to know how can I modify nutch code to exclude external links > with certain extensions. For example, if have in urls mydomain.com and my > domain.com has a lot of links like mydomain.com/mylink.shtml, then I want > nutch not to fetch(craw

Plugin development

2009-07-30 Thread Paul Tomblin
How do I develop a plugin that isn't in the nutch source tree? I want to keep all my project's source code together, and not put the project specific plugin in with the nutch code. Do I just have my plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml? (I'm a little shakey on ant synt

denied by robots.txt rules

2009-07-30 Thread Saurabh Suman
Hi -- View this message in context: http://www.nabble.com/denied-by-robots.txt-rules-tp24750512p24750512.html Sent from the Nutch - User mailing list archive at Nabble.com.

denied by robots.txt rules

2009-07-30 Thread Saurabh Suman
Hi if a url is denied by denied once by robots.txt rules,is crawled again by nutch? -- View this message in context: http://www.nabble.com/denied-by-robots.txt-rules-tp24750517p24750517.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Plugin development

2009-07-30 Thread Alexander Aristov
This is a simple HowTo http://wiki.apache.org/nutch/WritingPluginExample-0.9 Best Regards Alexander Aristov 2009/7/31 Paul Tomblin > How do I develop a plugin that isn't in the nutch source tree? I want > to keep all my project's source code together, and not put the > project specific plugi