Re: Adding additional metadata

2010-01-08 Thread J.G.Konrad
Something like this may work for your filter. I have not tested this but
maybe it will give you a better idea of what you need to do for the author
data. This is based on nutch-1.0 so I'm not sure if this would work for the
trunk version.

public class AuthorFilter implements HtmlParseFilter {

  public ParseResult filter(Content content, ParseResult parse, HTMLMetaTags
metaTags,
  DocumentFragment doc) {

parse.get(content.getUrl()).getData().getParseMeta().set("author",
metaTags.getGeneralTags().getProperty("author"));

return parse;
  }
}

You will also need a indexing filter that will store the author data in the
index.

-Jason

On Fri, Jan 8, 2010 at 6:00 AM, MilleBii  wrote:

> For lastModified just enable the index|query-more plugins it will do
> the job for you.
>
> For other meta searc the mailing list its explained many times how to do it
>
> 2010/1/8, Erlend Garåsen :
> >
> > Hello,
> >
> > I have tried to add additional metadata by changing the code in
> > HtmlParser.java and MoreIndexingFilter.java without any luck. Do I
> > really have to do something which is mentioned on the following wiki in
> > order to fetch the content of the metadata, i.e. write my own parser,
> > filter and a plugin.xml file:
> >
> http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
> >
> > I find the plugin examples complicated and difficult to understand. What
> > the existing HtmlParser does is good for me as long as I am able to
> > fetch two additional metadata (author and lastModified) which are
> > included in many of my university's webpages.
> >
> > The last thing I tried to do was to make HtmlParser implement the
> > HtmlParseFilter interface, but the implemented required method does not
> run.
> >
> > My hope was that we could use Solr/Nutch instead of Ultraseek, but it
> > requires that we are able to parse our metadata successfully.
> >
> > Erlend
> > --
> > Erlend Garåsen
> > Center for Information Technology Services
> > University of Oslo
> > P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> 31050
> >
>
>
> --
> -MilleBii-
>


Re: crawl-urlfilter.txt & regex-urlfilter.txt

2010-01-06 Thread J.G.Konrad
Hi Ken,
  If you are doing a whole-web crawl then only the regex-urlfiter.txt file
is used. The regex-filter.txt is used during the inject and fetchlist
generation. You can also use the regex-urlfitler when updating the crawldb
but you will have to add the '-filter' option when running the updatedb
command.

  -Jason
On Wed, Jan 6, 2010 at 7:04 AM, Godmar Back  wrote:

> On Wed, Jan 6, 2010 at 5:36 AM, Ken Ken  wrote:
>
> >
> >
> > Hello,
> >
> > I have a few questions on these two files in nutch-1.0/conf
> >
> >
>
> Ken,
>
> I can't answer your question, but I'm struggling with a similar one;
> perhaps
> both questions can be answered at the same time.
> My question is which crawl-urlfilter.txt is being used when, and if it
> matters.
>
> If you have your nutch in say /opt/nutch, and your tomcat in /opt/tomcat,
> then after deploying the .war you have /opt/nutch/conf/crawl-urlfilter.txt
> and /opt/tomcat/webapps/nutch/WEB-INF/classes/crawl-urlfilter.txt
>
> I'd assume that bin/nutch will look at the former, and Tomcat will look at
> the latter, but how do you keep the two in sync?
>
> Specifically, if you rebuild nutch.war and redeploy, won't that override
> the
> crawl-urlfilter.txt in the WEB-INF tree?
>
>  - Godmar
>


Re: Multiple Nutch instances for crawling?

2009-12-18 Thread J.G.Konrad
The crawl directory ( crawldb, linkdb, etc. ) can be shared between jobs
although depending on the implementation you may run into locking issues.
For example if there is a 'generate' job running while a 'updatedb' job is
trying to be executed ( on the same crawldb ) the 'updatedb' job will faile
because the 'generate' job has a lock on the crawldb.


On Thu, Dec 17, 2009 at 10:15 PM, Jun Mao  wrote:
> Is that still true if I start two jobs( they will not share crawdb,linkdb)
and write index to two different locations?
>
> Thanks,
>
> Jun
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: 2009年12月17日 16:57
> To: nutch-user@lucene.apache.org
> Subject: Re: Multiple Nutch instances for crawling?
>
> I guess because of the different nutch-site.xml & url filter that you want
> to use it won't work... but you could try installing nutch twice run the
> crawl/fetch/parse from those two locations. And joined the segments to
> recreate a unified searchable index (make sure you put all your segments
> under the same location).
>
> Just one comment though I think hadoop will serialize your jobs any how so
> you won't get a parallel execution of your hadoop jobs unless you run them
> from different hardware.
>
> 2009/12/16 Christopher Bader 
>
>> Felix,
>>
>> I've had trouble running multiple instances.  I would be interested in
>> hearing from anyone who has done it successfully.
>>
>> CB
>>
>>
>> On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann 
wrote:
>>
>> > Hi,
>> >
>> > I would like to run at least two instances of nutch ONLY for crawling
at
>> > one time; one for very frequently updated sites and one for other
sites.
>> > Will the nutch instances get in trouble when running several
>> > crawlscripts, especially the nutch confdir variable?
>> >
>> > Thanks!
>> > Felix.
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> -MilleBii-
>


Re: Multiple Nutch instances for crawling?

2009-12-18 Thread J.G.Konrad
Nope. I do not know how that version of nutch works. I am not familiar
with that project

On Thu, Dec 17, 2009 at 3:39 PM, Felix Zimmermann
 wrote:
> Hi Jason,
>
> thank you very much for your detailled description, I'll give this a
> try. Do you know how the multiple instances are realised in the
> nutch-gui on http://github.com/101tec/nutch ?
>
> Felix.
>
> Am Donnerstag, den 17.12.2009, 15:11 -0800 schrieb J.G.Konrad:
>> I have upgraded to 0.19.2 but the capacity scheduler feature is
>> available in 0.19.1. You will need to download the hadoop common
>> package ( http://hadoop.apache.org/common/releases.html ) to get the
>> capacity scheduler jar. It is not included in the Nutch releases.
>>
>>   After downloading you will want to copy the
>> contrib/capacity-scheduler/hadoop-0.19.x-capacity-scheduler.jar into
>> lib directory where you have your nutch code. There is also a example
>> configuration file that comes with the hadoop package that is quite
>> explanatory ( conf/capacity-scheduler ). The docs can be found here
>> http://hadoop.apache.org/common/docs/r0.19.1/capacity_scheduler.html
>>
>>  The first thing to do is to set the scheduler and define the queues.
>> This is done in the hadoop-site.xml file that is used to start the
>> jobtracker. This is what mine looks like for defining two queues (plus
>> the default).
>>
>> 
>>   mapred.jobtracker.taskScheduler
>>   org.apache.hadoop.mapred.CapacityTaskScheduler
>> 
>>
>> 
>>   mapred.queue.names
>>   nutchFetchCycle,nutchIndex,default
>> 
>>
>> The other file to add  is conf/capacity-scheduler.conf. This is where
>> the properties of the queues are defined. Here is some of my scheduler
>> conf
>>
>>   
>>     
>> mapred.capacity-scheduler.queue.nutchFetchCycle.guaranteed-capacity
>>     50
>>     Percentage of the number of slots in the cluster that are
>>       guaranteed to be available for jobs in this queue.
>>     
>>   
>>   
>>     
>> mapred.capacity-scheduler.queue.nutchIndex.guaranteed-capacity
>>     50
>>     Percentage of the number of slots in the cluster that are
>>       guaranteed to be available for jobs in this queue.
>>     
>>   
>>   
>>     mapred.capacity-scheduler.queue.default.guaranteed-capacity
>>     0
>>     Percentage of the number of slots in the cluster that are
>>       guaranteed to be available for jobs in this queue.
>>     
>>   
>>
>>   You will need to restart your jobtracker in order for these changes
>> to be applied. You will be able to see the queues if you visit the web
>> interface of the jobtracker.
>>
>> To use the different queues you will need to set the
>> mapred.job.queue.name property. To accomplish this I have two
>> directores, nutch-fetch and nutch-index. Each directory has their own
>> conf/hadoop-site.xml file with the different queue names ( I also have
>> different number of map/reduce tasks ).
>>
>> nutch-index:
>>   
>>     mapred.job.queue.name
>>     nutchIndex
>>   
>>
>> nutch-fetch:
>>   
>>     mapred.job.queue.name
>>     nutchFetchCycle
>>   
>>
>>
>> When a job is started from one of the directories the job will be
>> placed in the corresponding queue and will be able to run
>> simultaneously with jobs in the other queue. In order for this to work
>> there needs to be at least the capacity for 2 map and 2 reduce tasks
>> so each queue will be guaranteed 1 of each ( in this example since
>> it's a 50/50 distribution).
>>
>> Good luck with your concurrent fetches and don't forget to set the
>> generate.update.crawldb property to 'true' so that you will generate
>> different fetch lists for each instance.
>>
>> Enjoy,
>>   Jason
>>
>>
>> On Thu, Dec 17, 2009 at 1:03 PM, Yves Petinot  wrote:
>> > Jason,
>> >
>> > that sounds really good ! ... did you have to upgrade the default version 
>> > of
>> > Hadoop or were you able to get the distro that comes with Nutch (0.19.1 for
>> > me , which I assume is the standard) to accept it ? If the latter worked 
>> > for
>> > you, can you take use through your configuration changes ?
>> >
>> > thanks a bunch ;-)
>> >
>> > -y
>> >
>> > J.G.Konrad wrote:
>> >>
>> >> I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup
>> >> although it is not for doing concurren

Re: Multiple Nutch instances for crawling?

2009-12-17 Thread J.G.Konrad
I have upgraded to 0.19.2 but the capacity scheduler feature is
available in 0.19.1. You will need to download the hadoop common
package ( http://hadoop.apache.org/common/releases.html ) to get the
capacity scheduler jar. It is not included in the Nutch releases.

  After downloading you will want to copy the
contrib/capacity-scheduler/hadoop-0.19.x-capacity-scheduler.jar into
lib directory where you have your nutch code. There is also a example
configuration file that comes with the hadoop package that is quite
explanatory ( conf/capacity-scheduler ). The docs can be found here
http://hadoop.apache.org/common/docs/r0.19.1/capacity_scheduler.html

 The first thing to do is to set the scheduler and define the queues.
This is done in the hadoop-site.xml file that is used to start the
jobtracker. This is what mine looks like for defining two queues (plus
the default).


  mapred.jobtracker.taskScheduler
  org.apache.hadoop.mapred.CapacityTaskScheduler



  mapred.queue.names
  nutchFetchCycle,nutchIndex,default


The other file to add  is conf/capacity-scheduler.conf. This is where
the properties of the queues are defined. Here is some of my scheduler
conf

  

mapred.capacity-scheduler.queue.nutchFetchCycle.guaranteed-capacity
50
Percentage of the number of slots in the cluster that are
  guaranteed to be available for jobs in this queue.

  
  
mapred.capacity-scheduler.queue.nutchIndex.guaranteed-capacity
50
Percentage of the number of slots in the cluster that are
  guaranteed to be available for jobs in this queue.

  
  
mapred.capacity-scheduler.queue.default.guaranteed-capacity
0
Percentage of the number of slots in the cluster that are
  guaranteed to be available for jobs in this queue.

  

  You will need to restart your jobtracker in order for these changes
to be applied. You will be able to see the queues if you visit the web
interface of the jobtracker.

To use the different queues you will need to set the
mapred.job.queue.name property. To accomplish this I have two
directores, nutch-fetch and nutch-index. Each directory has their own
conf/hadoop-site.xml file with the different queue names ( I also have
different number of map/reduce tasks ).

nutch-index:
  
mapred.job.queue.name
nutchIndex
  

nutch-fetch:
  
mapred.job.queue.name
nutchFetchCycle
  


When a job is started from one of the directories the job will be
placed in the corresponding queue and will be able to run
simultaneously with jobs in the other queue. In order for this to work
there needs to be at least the capacity for 2 map and 2 reduce tasks
so each queue will be guaranteed 1 of each ( in this example since
it's a 50/50 distribution).

Good luck with your concurrent fetches and don't forget to set the
generate.update.crawldb property to 'true' so that you will generate
different fetch lists for each instance.

Enjoy,
  Jason


On Thu, Dec 17, 2009 at 1:03 PM, Yves Petinot  wrote:
> Jason,
>
> that sounds really good ! ... did you have to upgrade the default version of
> Hadoop or were you able to get the distro that comes with Nutch (0.19.1 for
> me , which I assume is the standard) to accept it ? If the latter worked for
> you, can you take use through your configuration changes ?
>
> thanks a bunch ;-)
>
> -y
>
> J.G.Konrad wrote:
>>
>> I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup
>> although it is not for doing concurrent fetching but it could be used
>> for that purpose.  Using the capacity scheduler you set up two
>> separate queues and allocate half of the available resources
>> (map/reduce tasks) to each queue. Technically there are three queues
>> in my setup , two specialized and the 'default'. The capacity
>> scheduler requires the 'default' queue to be defined although you
>> don't have to send any jobs to it.
>>
>> The only catch is you are not guaranteed symmetrical distribution if
>> you have multiple machines in the cluster. That may or may not be a
>> issues depending on your requirements.
>>
>> -Jason
>>
>>
>> On Thu, Dec 17, 2009 at 11:00 AM, Yves Petinot  wrote:
>>
>>>>>
>>>>> Just one comment though I think hadoop will serialize your jobs any how
>>>>> so
>>>>> you won't get a parallel execution of your hadoop jobs unless you run
>>>>> them
>>>>> from different hardware.
>>>>>
>>>
>>> I'm actually wondering if someone on this list has been able to use
>>> Hadoop's
>>> Fair Scheduler (or any other scheduler for that matter). This would
>>> definitely solve this problem (which i'm experiencing too). Is it at all
>>> possible to chan

Re: Multiple Nutch instances for crawling?

2009-12-17 Thread J.G.Konrad
I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup
although it is not for doing concurrent fetching but it could be used
for that purpose.  Using the capacity scheduler you set up two
separate queues and allocate half of the available resources
(map/reduce tasks) to each queue. Technically there are three queues
in my setup , two specialized and the 'default'. The capacity
scheduler requires the 'default' queue to be defined although you
don't have to send any jobs to it.

The only catch is you are not guaranteed symmetrical distribution if
you have multiple machines in the cluster. That may or may not be a
issues depending on your requirements.

-Jason


On Thu, Dec 17, 2009 at 11:00 AM, Yves Petinot  wrote:
>>> Just one comment though I think hadoop will serialize your jobs any how
>>> so
>>> you won't get a parallel execution of your hadoop jobs unless you run
>>> them
>>> from different hardware.
>
> I'm actually wondering if someone on this list has been able to use Hadoop's
> Fair Scheduler (or any other scheduler for that matter). This would
> definitely solve this problem (which i'm experiencing too). Is it at all
> possible to change the default scheduler with Nutch 1.0 (and the version of
> Hadoop that comes with it) or do we have to wait until the next hadoop
> upgrade ?
>
> hopefully someone on the list can shed some light on this issue,
>
> cheers,
>
> -y
>
>
> MilleBii wrote:
>>
>> I guess because of the different nutch-site.xml & url filter that you want
>> to use it won't work... but you could try installing nutch twice run the
>> crawl/fetch/parse from those two locations. And joined the segments to
>> recreate a unified searchable index (make sure you put all your segments
>> under the same location).
>>
>> Just one comment though I think hadoop will serialize your jobs any how so
>> you won't get a parallel execution of your hadoop jobs unless you run them
>> from different hardware.
>>
>> 2009/12/16 Christopher Bader 
>>
>>
>>>
>>> Felix,
>>>
>>> I've had trouble running multiple instances.  I would be interested in
>>> hearing from anyone who has done it successfully.
>>>
>>> CB
>>>
>>>
>>> On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann 
>>> wrote:
>>>
>>>

 Hi,

 I would like to run at least two instances of nutch ONLY for crawling at
 one time; one for very frequently updated sites and one for other sites.
 Will the nutch instances get in trouble when running several
 crawlscripts, especially the nutch confdir variable?

 Thanks!
 Felix.





>>
>>
>>
>>
>
>


Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?

2009-12-03 Thread J.G.Konrad
Why does a url with a fetch status of 'fetch_gone' show up as
'db_unfetched'? Shouldn't the crawldb entry have a status of
'db_gone'? This is happening in nutch-1.0

Here is one example of what I'm talking about
=
[jkon...@rampage search]$ ./bin/nutch readseg -get
testParseSegment/20091202111849
"http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s";
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Nov 27 16:28:09 PST 2009
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 7776000 seconds (90 days)
Score: 7.535359E-10
Signature: null
Metadata: _ngt_: 1259781530311

Crawl Fetch::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Wed Dec 02 12:25:21 PST 2009
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Score: 2.47059988E10
Signature: null
Metadata: _ngt_: 1259781530311_pst_: notfound(14), lastModified=0:
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s

[jkon...@rampage search]$ ./bin/nutch readdb testParseSegment/c -url
"http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s";
URL: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Apr 03 01:25:21 PDT 2010
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Score: 2.47059988E10
Signature: null
Metadata: _pst_: notfound(14), lastModified=0:
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s
=


Thanks,
  Jason


support for robot rules that include a wild card

2009-11-19 Thread J.G.Konrad
I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
  Jason