Just add scoring-opic to your plugin.includes in nutch-site.xml.
Rgrds, Thomas
On 9/1/06, Cam Bazz [EMAIL PROTECTED] wrote:
Hello,
I wanted to index my files so I followed the instructions at
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
I get : Exception in
Hi Doug,
is it possible you could post your hadoop-site.xml? I would like to
accomplish the same.
Rgrds. Thomas
On 7/21/06, Doug Cook [EMAIL PROTECTED] wrote:
Thanks, HÃ¥vard (and Doug, in the original email).
Those pointers, plus a few other tips from elsewhere, did the trick. I'm now
up
Eric,
you should setup the searcher.dir property in nutch-site.xml to point
to the crawl directory,. See nutch-default.xml for an explanation of
this config property.
Rgrds, Thomas
On 7/22/06, Eric Wu [EMAIL PROTECTED] wrote:
Hi,
I am new to Nutch and I got a null pointer exception whenI try
Matt,
it's the index that is used for searching, not the webdb.
What is the status of these pages in webdb? Likely they are not
fetched yet (DB_UNFETCHED), and thus can never be in your index.
These articles give very nice basic explanation of different concepts:
There's 'nutch readdb' command -
[EMAIL PROTECTED]:~ nutch readdb
Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -topN
out_dir [min] | -url url)
crawldb directory name where crawldb is located
-stats print overall statistics to System.out
-dump out_dir
For stuff like this best use whole web concepts as explained in the tutorial.
Rgrds, Thomas
On 7/25/06, Robert Sanford [EMAIL PROTECTED] wrote:
I'm running version 0.7.2 and I'm using the Intranet crawl where I
specify a list of site root URIs in a text file along with a list of
regex for
As far as I can tell from the src (0.7.1), it is either calculated from the
last-modified metadata property, or, when it is not available, from the
fetchDate.
See org.apache.nutch.indexer.more.MoreIndexingFilter.
This also answers my own question
I will try to answer your questions. If I am wrong, I am sure one of the
more experienced developers can correct me ...:)
- How do I update/refresh the index? There is no explanation or example
about the intranet crawl!
The main index (in crawldir/index) is updated by the CrawlTool after every
I think the http://wiki.apache.org/nutch/WritingPluginExample tutorial shows
how to implement the Filter - you would be filtering the 'content' metatag
instead of the 'recommended'. Then it is up to you what other Filters you
enable/disable. Also look at the
I have worked through the
WritingPluginExamplehttp://wiki.apache.org/nutch/WritingPluginExampleexample.
Now I am wondering if the following makes any sense. I would like
to store the date (mmdd) the first time a Page was added to the Index. I
thought I could create a plugin that would add a
If the url is already in WebDB, it will not be added again. (WebDBInjector
calls WebDBWriter.addPageIfNotPresent(page)).
Rgrds, Thomas
On 2/13/06, Hasan Diwan [EMAIL PROTECTED] wrote:
I've written a perl script to build up a urls file to crawl from RSS
feeds. Will nutch handle duplicate
, Jack Tang [EMAIL PROTECTED] wrote:
Hi Thomas
I suppose the only unique key of contents in web db is page' url. So
why not retrieve the content by url directly?
/Jack
On 1/8/06, Thomas Delnoij [EMAIL PROTECTED] wrote:
I am working with Nutch 0.7.1.
As far as I understand the current
Maybe one of the other developers can answer my question as well?
I want to know if I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), lines 236-240, to accomplish unique
MD5Hash for each Page based on their URL.
Thanks is advance,
Thomas D.
On 1/15/06, Thomas Delnoij
um 22:14 schrieb Thomas Delnoij:
I am working with Nutch 0.7.1.
As far as I understand the current implementation (please correct
me if I
am wrong), the MD5Hash is calculated based on the Pages' content.
Pages with
the same content but identified by different URLs, share the same
Andy,
you need to install the Nutch webapp as the ROOT application of your tomcat
installation, as desribed in the tutorial:
http://lucene.apache.org/nutch/tutorial.html
Rgrds, Thomas
On 1/11/06, Andy Morris [EMAIL PROTECTED] wrote:
Okay I used this guy's how-to to install IBM JAVA and
-
regex.jar
On 11/29/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
For the sake of the archives, I will answer my own question here: I had
to
add the following line to the bin/nutch script to be able to run
org.apache.nutch.net.RegexURLFilter from the command line:
CLASSPATH=${CLASSPATH
Kumar.
you can use the nutch readdb [db_name] -stats command to generate statistics
for your WebDB and the nutch segread command for your segments.
HTH Thomas Delnoij
On 11/29/05, Kumar Limbu [EMAIL PROTECTED] wrote:
Hi Everyone,
I am new to nutch and I would like to know how can I know how
overrides the classpath environment variable, so adding the
jar there didn't help.
Rgrds, Thomas Delnoij
On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
All.
The problem is actualy a bit different. I was a bit in a hurry when I
posted the previous message, apologies.
I added both
for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB
storage on my datanodes?
Thanks for your help.
Thomas Delnoij
On 11/13/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
Am 13.11.2005 um 12:58 schrieb Thomas Delnoij:
I have studied the available documentation and the mailing
) what happens to Pages that cannot be parsed (for instance content-type:
image/jpg); are they kept in WebDB or are they removed?
Thanks for your help. Nutch is a great tool!
- Thomas Delnoij
for testing
the regex-urlfilter.
Secondly, I want to tune my regex-urlfilter for maximum relevancy of the
crawl result. By now, I have around 50 entries. My second question is if I
can expect any performance impact?
Your help is greatly appreciated.
Kind regards, Thomas Delnoij.
I was a bit in a hurry when I posted this message, apologies.
The problem is actualy a bit different.
I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
When I run java org.apache.nutch.net.RegexURLFilter,
On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
All.
I want
help is really appreciated.
Kind regards, Thomas Delnoij
On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
I was a bit in a hurry when I posted this message, apologies.
The problem is actualy a bit different.
I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
When I run
I don't think it is a-typical, because I had similar
effects with crawl depth = 10.
Rgrds, Thomas
--- blackwater dev [EMAIL PROTECTED] wrote:
Over a gig now, 18 hours running and still
going...might just have to
kill it unless this is typical.
On 7/27/05, blackwater dev [EMAIL PROTECTED]
24 matches
Mail list logo