Hello!
I've trying Nutch on Cygwin and it seems really interesting and useful. I am
newbie in Nutch and also a beginner programmer in Java and I would know how
to recompile and execute Nutch after making some changes in the source files
(e.g. Crawl.java)?
Thanks ;)
--
View this message in
Hi,
I am new to Nutch and Hadoop. I follow NutchHadoopTutorial from
http://wiki.apache.org/nutch/NutchHadoopTutorial. In this tutorial, making
the directory as a DFS, I got the error
-bash-3.2$ bin/hadoop dfs -put urls urls
Bad connection to FS. command aborted.
I Google, and not found any
nutch-user@lucene.apache.org
Dhamodharan wrote:
nutch-user@lucene.apache.org
--
View this message in context:
http://old.nabble.com/Nutch-tp27073547p27073636.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hello,
I have tried to add additional metadata by changing the code in
HtmlParser.java and MoreIndexingFilter.java without any luck. Do I
really have to do something which is mentioned on the following wiki in
order to fetch the content of the metadata, i.e. write my own parser,
filter and
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do
not want nutch to crawl to any depth beyond the ones that are listed in
the urls folder.
Can I accomplish this by setting the depth argument
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do
not want nutch to crawl to any depth beyond the ones that are listed in
the urls folder.
Can I accomplish this by setting the depth argument
Hello Kumar,
There is a config property you can set in conf/nutch-site.xml, as follows :
!--
property
namedb.max.outlinks.per.page/name
value0/value
descriptionThe maximum number of outlinks that we'll process for a page.
If this value is nonnegative (=0), at most
Thanks, Mischa. That worked!!
So, it looks like once this config property is set, crawl ignores the
'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never
crawl any of the outlinks. Is that correct?
Regards,
Kumar.
Mischa Tuffield wrote:
Hello Kumar,
There is a config
Hi Kumar,
Am happy that that was of use to you. Sadly I have no feel for what the depth
argument does, I don't tend to ever use it, I tend to use nutch's more specific
commands: inject, generate, fetch, updatedb, merge, etc ...
Perhaps someone else could shed light on the crawl command.
Hi All,
I have some urls that need to be crawled that have a query string in
them. I've commented out the appropriate line in crawl_urlfilter.txt and
regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409
Hi Kumar,
You could try using curl and sending the accept headers your nutch installation
exposes. These are set in conf/nutch-site.xml, this would at least help you
eliminate the idea that techcrunch is blocking your instance of nutch.
Mischa
On 8 Jan 2010, at 13:01, Kumar Krishnasami
Depth argument is only used for the crawl command and basically is the
number of run cycles craw/fetch/update/index
2010/1/8, Mischa Tuffield mischa.tuffi...@garlik.com:
Hi Kumar,
Am happy that that was of use to you. Sadly I have no feel for what the
depth argument does, I don't tend to ever
For lastModified just enable the index|query-more plugins it will do
the job for you.
For other meta searc the mailing list its explained many times how to do it
2010/1/8, Erlend GarĂ¥sen e.f.gara...@usit.uio.no:
Hello,
I have tried to add additional metadata by changing the code in
Have you tried using Peano's sixth axiom?
On Fri, Jan 8, 2010 at 5:41 AM, Kumar Krishnasami kumara...@vembu.com wrote:
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do not
want nutch to crawl
I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.
Considering the following sample segment:
2.0Gcontent
196Kcrawl_fetch
152Kcrawl_generate
376Kcrawl_parse
392Kparse_data
441Mparse_text
1. From
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.
Considering the following sample segment:
2.0Gcontent
196Kcrawl_fetch
152Kcrawl_generate
376K
Something like this may work for your filter. I have not tested this but
maybe it will give you a better idea of what you need to do for the author
data. This is based on nutch-1.0 so I'm not sure if this would work for the
trunk version.
public class AuthorFilter implements HtmlParseFilter {
Thanks, MilleBii. That explains it. All the docs I came across mentioned
something like -depth /depth/ indicates the link depth from the root
page that should be crawled (from
http://lucene.apache.org/nutch/tutorial8.html).
MilleBii wrote:
Depth argument is only used for the crawl command
Not sure if Peano's sixth axiom has any specific meaning in the
context of nutch.
I did try using a depth of 1 and it retrieved the root url as well as
urls under subfolders of the root url.
Godmar Back wrote:
Have you tried using Peano's sixth axiom?
On Fri, Jan 8, 2010 at 5:41 AM, Kumar
20 matches
Mail list logo