Re: how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-12 Thread Stefan Groschupf
Hi Feng, map Writrable is a kind of hashmap. You can put in any key value pair, but the key and values need to be Writables: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ Writable.html You can use UTF8 as StingKey and Value or ByteWritable as key and Utf8 as Values. Etc.

nutch-default.xml configuration

2006-06-12 Thread Lourival Júnior
Hi all! I have a question about nutch-default.xml configuration file. There is a parameter db.default.fetch.interval that is set by default to 30. It means that pages from the webdb are recrawled every 30 days.http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02058.htmlI want to know

Re: nutch-default.xml configuration

2006-06-12 Thread Dima Mazmanov
Hi,Lourival. You wrote 12 èþíÿ 2006 ã., 19:33:15: Hi all! I have a question about nutch-default.xml configuration file. There is a parameter db.default.fetch.interval that is set by default to 30. It means that pages from the webdb are recrawled every 30

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Hi Lourival, this means all pages older than 30 days are potential candidates for a fetch list that is created by segment generation process. Stefan Am 12.06.2006 um 16:33 schrieb Lourival Júnior: Hi all! I have a question about nutch-default.xml configuration file. There is a

Re: nutch-default.xml configuration

2006-06-12 Thread Lourival Júnior
Ok. So, have you any solution to do this job automatically? I have a shell script, but I don't see if this really works yet. Sorry if I'm being redundant. I'm learn about this tool and I have a lot of questions :). Thanks! On 6/12/06, Dima Mazmanov [EMAIL PROTECTED] wrote: Hi,Lourival. You

Re[2]: nutch-default.xml configuration

2006-06-12 Thread Dima Mazmanov
Hi,Lourival. What kind of shell script do you have? You wrote 12 èþíÿ 2006 ã., 19:51:06: Ok. So, have you any solution to do this job automatically? I have a shell script, but I don't see if this really works yet. Sorry if I'm being redundant. I'm learn about this tool and I have a lot of

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Ok. So, have you any solution to do this job automatically? I have a shell script, but I don't see if this really works yet. Shell scripts are the best solution. Sorry if I'm being redundant. I'm learn about this tool and I have a lot of questions :). No Problem, but the nutch user

Re: Re[2]: nutch-default.xml configuration

2006-06-12 Thread Lourival Júnior
Let explain the problem. I have this shell script: #!/bin/bash # A simple script to run a Nutch re-crawl if [ -n $1 ] then crawl_dir=$1 else echo Usage: recrawl crawl_dir [depth] [adddays] exit 1 fi if [ -n $2 ] then depth=$2 else depth=5 fi if [ -n $3 ] then adddays=$3 else adddays=0

Re[4]: nutch-default.xml configuration

2006-06-12 Thread Dima Mazmanov
Hi,Lourival. Ok after first indexing you must merge segments, and if you want to reindex your db, you have to delete segments wich are older then predefined date, in your case 30 days. this is my solution, if someone has better , please share your experience! Let explain the problem. I have

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV5.patch Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte[4]) +

[jira] Resolved: (NUTCH-303) logging improvements

2006-06-12 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-303?page=all ] Jerome Charron resolved NUTCH-303: -- Resolution: Fixed Nutch now uses the Commons Logging API and log4j as the default implementation. There is 3 log4j.properties configuration files:

Cached.jsp to show images

2006-06-12 Thread Marco Pereira
Hi everybody, As I have said on another message, I'm trying to get Nutch search for images. Till now it's searching alt and title tags and indexing the image content (the one you see when you open a image on NotePad for example). Now that I've indexed almost 3 million images, I am trying to