On 23.01.11 22.32, Markus Jelsma wrote:
Nutch can detect 404's by recrawling existing URL's. The mutation, however, is
not pushed to Solr at the moment.
Any registered Jira issue for this future? If not, maybe I should create
one. This is a crucial functionality if you want to build a search
Each item in the CrawlDB carries a status field. Reading the CrawlDB will
return this information as well, the same goes for a complete dump with which
you could create the appropriate delete statements for your Solr instance.
51 /** Page no longer exists. */
52 public static final
No, Nutch 1.x cannot resume an interrupted fetch job.
On Monday 24 January 2011 10:39:37 Amna Waqar wrote:
Dear all,
I am using nutch 1.2.The problem im facing is that i m unable to resume the
crawl after power failure,network disconnection or some other sort of
interruption from the same
Hello list,
I have been using Nutch 1.2 to crawl the web for a small number of very
relevant html pages and associated URL’s containing PDF document’s. I have then
been using Luke v 1.0.1 to look inside my index to guarantee I have indexed
specific PDF documents which reside on these web
I might not understand your question correctly, but it looks like you
can send your data to SOLR and issue your queries there. You'll ask SOLR
to return the snippet of the content that matches the query.
If your question relates on how to extract the data from the pdf, you
can configure nutch to
Hi list,
I am using Nutch 1.2 and currently working my way through the Nutch and Hadoop
tutorial on the wiki for the first time. Not having much luck to date and have
reached the following part So log into the master nodes and all of the slave
nodes as root. which I do not understand. Under
Hi
Can you verify if you set the job tracker entry in right format in
mapred-config.xml?
Thanks,
Charan
Sent from my iPhone
On Jan 24, 2011, at 9:42 AM, McGibbney, Lewis John
lewis.mcgibb...@gcu.ac.uk wrote:
Hi list,
I am using Nutch 1.2 and currently working my way through the Nutch
In addition my cygwin output is as follows
Mcgibbney@Mcgibbney-PC /cygdrive/c/nutch-1.2/bin
$ ./start-all.sh ssh -1 root Mcgibbney-PC
starting namenode, logging to /cygdrive/c/nutch-1.2/bin/../logs/hadoop-Mcgibbney
-namenode-Mcgibbney-PC.out
localhost: /cygdrive/c/nutch-1.2/bin/slaves.sh: line
Hi Charan
I have not touched mapred-config.xml and don't appear to have this file in my
conf directory within 1.2 dist!
From: Charan K [charan.ku...@gmail.com]
Sent: 24 January 2011 17:49
To: user@nutch.apache.org
Subject: Re: Hadoop Tutorial
Hi
Can
Hi all,
I am very new to Nutch and Lucene as well. I am having few questions about
Nutch, I know they are very much basic but I could not get clear cut answers
out of googling for this. The questions are,
- If I have to crawl just 5-6 web sites or URL's should I use intranet
crawl or
1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
gives u more control and speed
2.After the first crawl,the recrawling the same sites time is 30 days by
default in db.fetcher.interval,you can change it according to ur own
convenience.
3.I ve no idea about the third question
Refer NutchBean.java for the their question. You can run than from command line
to test the index.
If you use SOLR indexing, it is going to be much simpler, they have a solr
java client..
Sent from my iPhone
On Jan 24, 2011, at 8:07 PM, Amna Waqar amna.waqar...@gmail.com wrote:
1,to crawl
To use solr:
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*
assuming the crawl dir is crawl
From: alx...@aim.com [mailto:alx...@aim.com]
Sent: Mon 1/24/2011 9:23 PM
To: user@nutch.apache.org
Subject: Re: Few
NoSQL technology scales better, but for a reasonable volume MySQL
will do the job fine and faster.
Sorry it was not working that well in my tests with Gora code as is
and MySQL backend, because of the broad SELECT statement. The issue is
described here:
14 matches
Mail list logo