Compiling Nutch

2010-01-08 Thread Allan Baquerizo
Hello! I've trying Nutch on Cygwin and it seems really interesting and useful. I am newbie in Nutch and also a beginner programmer in Java and I would know how to recompile and execute Nutch after making some changes in the source files (e.g. Crawl.java)? Thanks ;) -- View this message in

Bad connection to FS. command aborted.

2010-01-08 Thread vishnukumar
Hi, I am new to Nutch and Hadoop. I follow NutchHadoopTutorial from http://wiki.apache.org/nutch/NutchHadoopTutorial. In this tutorial, making the directory as a DFS, I got the error -bash-3.2$ bin/hadoop dfs -put urls urls Bad connection to FS. command aborted. I Google, and not found any

Nutch

2010-01-08 Thread Dhamodharan
nutch-user@lucene.apache.org

Re: Nutch

2010-01-08 Thread dhamu
Dhamodharan wrote: nutch-user@lucene.apache.org -- View this message in context: http://old.nabble.com/Nutch-tp27073547p27073636.html Sent from the Nutch - User mailing list archive at Nabble.com.

Adding additional metadata

2010-01-08 Thread Erlend GarĂ¥sen
Hello, I have tried to add additional metadata by changing the code in HtmlParser.java and MoreIndexingFilter.java without any luck. Do I really have to do something which is mentioned on the following wiki in order to fetch the content of the metadata, i.e. write my own parser, filter and

Crawling only specific urls and depth

2010-01-08 Thread Kumar Krishnasami
Hi, I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder. Can I accomplish this by setting the depth argument

Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
Hi, I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder. Can I accomplish this by setting the depth argument

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
Hello Kumar, There is a config property you can set in conf/nutch-site.xml, as follows : !-- property namedb.max.outlinks.per.page/name value0/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most

Re: Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
Thanks, Mischa. That worked!! So, it looks like once this config property is set, crawl ignores the 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never crawl any of the outlinks. Is that correct? Regards, Kumar. Mischa Tuffield wrote: Hello Kumar, There is a config

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
Hi Kumar, Am happy that that was of use to you. Sadly I have no feel for what the depth argument does, I don't tend to ever use it, I tend to use nutch's more specific commands: inject, generate, fetch, updatedb, merge, etc ... Perhaps someone else could shed light on the crawl command.

Enabling Query Strings in *filter.txt files

2010-01-08 Thread Kumar Krishnasami
Hi All, I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them. If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409

Re: Enabling Query Strings in *filter.txt files

2010-01-08 Thread Mischa Tuffield
Hi Kumar, You could try using curl and sending the accept headers your nutch installation exposes. These are set in conf/nutch-site.xml, this would at least help you eliminate the idea that techcrunch is blocking your instance of nutch. Mischa On 8 Jan 2010, at 13:01, Kumar Krishnasami

Re: Crawl specific urls and depth argument

2010-01-08 Thread MilleBii
Depth argument is only used for the crawl command and basically is the number of run cycles craw/fetch/update/index 2010/1/8, Mischa Tuffield mischa.tuffi...@garlik.com: Hi Kumar, Am happy that that was of use to you. Sadly I have no feel for what the depth argument does, I don't tend to ever

Re: Adding additional metadata

2010-01-08 Thread MilleBii
For lastModified just enable the index|query-more plugins it will do the job for you. For other meta searc the mailing list its explained many times how to do it 2010/1/8, Erlend GarĂ¥sen e.f.gara...@usit.uio.no: Hello, I have tried to add additional metadata by changing the code in

Re: Crawling only specific urls and depth

2010-01-08 Thread Godmar Back
Have you tried using Peano's sixth axiom? On Fri, Jan 8, 2010 at 5:41 AM, Kumar Krishnasami kumara...@vembu.com wrote: Hi, I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl

Purging from Nutch after indexing with Solr

2010-01-08 Thread Ulysses Rangel Ribeiro
I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some questions regarding data redundancy with this setup. Considering the following sample segment: 2.0Gcontent 196Kcrawl_fetch 152Kcrawl_generate 376Kcrawl_parse 392Kparse_data 441Mparse_text 1. From

Re: Purging from Nutch after indexing with Solr

2010-01-08 Thread Andrzej Bialecki
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote: I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some questions regarding data redundancy with this setup. Considering the following sample segment: 2.0Gcontent 196Kcrawl_fetch 152Kcrawl_generate 376K

Re: Adding additional metadata

2010-01-08 Thread J.G.Konrad
Something like this may work for your filter. I have not tested this but maybe it will give you a better idea of what you need to do for the author data. This is based on nutch-1.0 so I'm not sure if this would work for the trunk version. public class AuthorFilter implements HtmlParseFilter {

Re: Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
Thanks, MilleBii. That explains it. All the docs I came across mentioned something like -depth /depth/ indicates the link depth from the root page that should be crawled (from http://lucene.apache.org/nutch/tutorial8.html). MilleBii wrote: Depth argument is only used for the crawl command

Re: Crawling only specific urls and depth

2010-01-08 Thread Kumar Krishnasami
Not sure if Peano's sixth axiom has any specific meaning in the context of nutch. I did try using a depth of 1 and it retrieved the root url as well as urls under subfolders of the root url. Godmar Back wrote: Have you tried using Peano's sixth axiom? On Fri, Jan 8, 2010 at 5:41 AM, Kumar