Re: Can Nucth detect modified and deleted URLs?

2011-01-24 Thread Erlend Garåsen
On 23.01.11 22.32, Markus Jelsma wrote: Nutch can detect 404's by recrawling existing URL's. The mutation, however, is not pushed to Solr at the moment. Any registered Jira issue for this future? If not, maybe I should create one. This is a crucial functionality if you want to build a search

Re: Can Nucth detect modified and deleted URLs?

2011-01-24 Thread Markus Jelsma
Each item in the CrawlDB carries a status field. Reading the CrawlDB will return this information as well, the same goes for a complete dump with which you could create the appropriate delete statements for your Solr instance. 51 /** Page no longer exists. */ 52 public static final

Re: resuming the nutch crawl after interruption

2011-01-24 Thread Markus Jelsma
No, Nutch 1.x cannot resume an interrupted fetch job. On Monday 24 January 2011 10:39:37 Amna Waqar wrote: Dear all, I am using nutch 1.2.The problem im facing is that i m unable to resume the crawl after power failure,network disconnection or some other sort of interruption from the same

PDF Content Extraction

2011-01-24 Thread McGibbney, Lewis John
Hello list, I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages and associated URL’s containing PDF document’s. I have then been using Luke v 1.0.1 to look inside my index to guarantee I have indexed specific PDF documents which reside on these web

Re: PDF Content Extraction

2011-01-24 Thread Claudio Martella
I might not understand your question correctly, but it looks like you can send your data to SOLR and issue your queries there. You'll ask SOLR to return the snippet of the content that matches the query. If your question relates on how to extract the data from the pdf, you can configure nutch to

Hadoop Tutorial

2011-01-24 Thread McGibbney, Lewis John
Hi list, I am using Nutch 1.2 and currently working my way through the Nutch and Hadoop tutorial on the wiki for the first time. Not having much luck to date and have reached the following part So log into the master nodes and all of the slave nodes as root. which I do not understand. Under

Re: Hadoop Tutorial

2011-01-24 Thread Charan K
Hi Can you verify if you set the job tracker entry in right format in mapred-config.xml? Thanks, Charan Sent from my iPhone On Jan 24, 2011, at 9:42 AM, McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk wrote: Hi list, I am using Nutch 1.2 and currently working my way through the Nutch

RE: Hadoop Tutorial

2011-01-24 Thread McGibbney, Lewis John
In addition my cygwin output is as follows Mcgibbney@Mcgibbney-PC /cygdrive/c/nutch-1.2/bin $ ./start-all.sh ssh -1 root Mcgibbney-PC starting namenode, logging to /cygdrive/c/nutch-1.2/bin/../logs/hadoop-Mcgibbney -namenode-Mcgibbney-PC.out localhost: /cygdrive/c/nutch-1.2/bin/slaves.sh: line

RE: Hadoop Tutorial

2011-01-24 Thread McGibbney, Lewis John
Hi Charan I have not touched mapred-config.xml and don't appear to have this file in my conf directory within 1.2 dist! From: Charan K [charan.ku...@gmail.com] Sent: 24 January 2011 17:49 To: user@nutch.apache.org Subject: Re: Hadoop Tutorial Hi Can

Few questions from a newbie

2011-01-24 Thread .: Abhishek :.
Hi all, I am very new to Nutch and Lucene as well. I am having few questions about Nutch, I know they are very much basic but I could not get clear cut answers out of googling for this. The questions are, - If I have to crawl just 5-6 web sites or URL's should I use intranet crawl or

Re: Few questions from a newbie

2011-01-24 Thread Amna Waqar
1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl gives u more control and speed 2.After the first crawl,the recrawling the same sites time is 30 days by default in db.fetcher.interval,you can change it according to ur own convenience. 3.I ve no idea about the third question

Re: Few questions from a newbie

2011-01-24 Thread Charan K
Refer NutchBean.java for the their question. You can run than from command line to test the index. If you use SOLR indexing, it is going to be much simpler, they have a solr java client.. Sent from my iPhone On Jan 24, 2011, at 8:07 PM, Amna Waqar amna.waqar...@gmail.com wrote: 1,to crawl

RE: Few questions from a newbie

2011-01-24 Thread Chris Woolum
To use solr: bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/* assuming the crawl dir is crawl From: alx...@aim.com [mailto:alx...@aim.com] Sent: Mon 1/24/2011 9:23 PM To: user@nutch.apache.org Subject: Re: Few

Re: Database data storage question

2011-01-24 Thread Alexis
NoSQL technology scales better, but for a reasonable volume MySQL will do the job fine and faster. Sorry it was not working that well in my tests with Gora code as is and MySQL backend, because of the broad SELECT statement. The issue is described here: