How to control contents to be indexed?

2006-02-10 Thread Elwin
In the process of crawling and indexing, some pages are just used as
temporary links  to the pages I want to index, so how can I control those
kinds of pages not being indexed? Or which part of nutch should I extend?


Bug in closing the database?

2006-02-10 Thread Nguyen Ngoc Giang
  Hi everyone,

  I'm constantly encounter this problem when Nutch comes to database closing
stage. The crawler causes my system hung and needs to be restarted. Can
anyone help me to figure out this! Here is my log file before hanging:

060210 161454 Finishing update
060210 161959 Processing pagesByURL: Sorted 25968131 instructions in
305.23seconds.
060210 161959 Processing pagesByURL: Sorted 85077.25649510205instructions/second
060210 162403 Processing pagesByURL: Merged to new DB containing 4088028
records in 133.31 seconds
060210 162403 Processing pagesByURL: Merged 30665.576475883277records/second
060210 162428 Processing pagesByMD5: Sorted 3479369 instructions in
25.146seconds.
060210 162428 Processing pagesByMD5: Sorted 138366.6984808717instructions/second
060210 162529 Processing pagesByMD5: Merged to new DB containing 4088028
records in 56.115 seconds
060210 162529 Processing pagesByMD5: Merged 72850.89548249131 records/second
060210 163426 Processing linksByMD5: Sorted 25926879 instructions in
536.227seconds.
060210 163426 Processing linksByMD5: Sorted 48350.56608488569instructions/second
060210 164415 Processing linksByMD5: Merged to new DB containing 50840622
records in 465.897 seconds
060210 164415 Processing linksByMD5: Merged 109124.16692960032records/second
060210 164840 Processing linksByURL: Sorted 22072525 instructions in
264.309seconds.
060210 164840 Processing linksByURL: Sorted 83510.30422724916instructions/second
060210 165830 Processing linksByURL: Merged to new DB containing 50840622
records in 478.304 seconds
060210 165830 Processing linksByURL: Merged 106293.53298320733records/second
060210 170433 Processing linksByMD5: Sorted 23422502 instructions in
362.723seconds.
060210 170433 Processing linksByMD5: Sorted 64574.0744314532instructions/second
060210 171409 Processing linksByMD5: Merged to new DB containing 50840622
records in 453.612 seconds
060210 171409 Processing linksByMD5: Merged 112079.53493293827records/second

  The weird thing is that Nutch processed linksByMD5 twice, and number of
instructions are not the same.

  Regards,
   Giang


Server list

2006-02-10 Thread Andy Morris
 
Is there any way to have a list of servers that you want to crawl?  I
want to crawl only three servers on campus.  Do I just add them to the
domain list?
Andy


Re: Server list

2006-02-10 Thread Raghavendra Prabhu
Hi

Add the domains to the url file

Rgds
Prabhu


On 2/10/06, Andy Morris [EMAIL PROTECTED] wrote:


 Is there any way to have a list of servers that you want to crawl?  I
 want to crawl only three servers on campus.  Do I just add them to the
 domain list?
 Andy



RE: How to control contents to be indexed?

2006-02-10 Thread Vanderdray, Jacob
If you control the temporary links pages, then just add a
robots meta tag.  Take a look at
http://www.robotstxt.org/wc/meta-user.html to see what your options are.

Jake.

-Original Message-
From: Elwin [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 10, 2006 4:38 AM
To: nutch-user@lucene.apache.org
Subject: How to control contents to be indexed?

In the process of crawling and indexing, some pages are just used as
temporary links  to the pages I want to index, so how can I control
those
kinds of pages not being indexed? Or which part of nutch should I
extend?


Re: How to control contents to be indexed?

2006-02-10 Thread Elwin
Thank you.
But what I want to crawl are just from the internent and certainly I can't
control them.


2006/2/10, Vanderdray, Jacob [EMAIL PROTECTED]:

If you control the temporary links pages, then just add a
 robots meta tag.  Take a look at
 http://www.robotstxt.org/wc/meta-user.html to see what your options are.

 Jake.

 -Original Message-
 From: Elwin [mailto:[EMAIL PROTECTED]
 Sent: Friday, February 10, 2006 4:38 AM
 To: nutch-user@lucene.apache.org
 Subject: How to control contents to be indexed?

 In the process of crawling and indexing, some pages are just used as
 temporary links  to the pages I want to index, so how can I control
 those
 kinds of pages not being indexed? Or which part of nutch should I
 extend?




--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。


local in hadoop-default.xml

2006-02-10 Thread Michael Nebel

Hi,

when I didn't overwrite the hadoop-default.xml and tried to use the 
default settings, I get errors like:


- bad mapred.job.tracker: local
or
- Not a host:port pair: local

is anybody using nutch/hadoop without using map-reduce?

Regards

Michael

--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/



Error while indexing (mapred)

2006-02-10 Thread Florent Gluck
Hi,

I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb.  I'm using the mapred code from
trunk  (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while  running a dedup on my indexes:


060210 061707  reduce 9%
060210 061710  reduce 10%
060210 061713  reduce 11%
060210 061717  reduce 12%
060210 061719  reduce 11%
060210 061723  reduce 10%
060210 061725  reduce 11%
060210 061726  reduce 10%
060210 061729  reduce 11%
060210 061730  reduce 9%
060210 061732  reduce 10%
060210 061736  reduce 11%
060210 061739  reduce 12%
060210 061742  reduce 10%
060210 061743  reduce 9%
060210 061745  reduce 10%
060210 061746  reduce 100%
Exception in thread main java.io.IOException: Job failed!
  at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310)
  at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329)
  at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349)

I can see a lot of these messages in the jobtracker log on the master:
...
060210 061743 Task 'task_r_4t50k4' has been lost.
060210 061743 Task 'task_r_79vn7i' has been lost.
...

On every single slave, I get this file not found exception in the
tasktracker log:
060210 061749 Server handler 0 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
at
org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121)   
at
org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45)
at
org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
at
org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93)
at
org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121)
at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:215)

I used to be able to complete the index dedupping successfully when my
segments/crawldb was smaller, but I don't see why this would be related
to the FileNotFoundException.  I'm by far not running out of disk space
and my hard discs work properly.

Has anyone encountered a similar issue or has a clue about what's happening?

Thanks,
Florent


JobTracker does not start properly

2006-02-10 Thread Mr. Udatny

hello,

JobTracker doesnt want to start.
i am using a current version from the trunk.

when trying to start jobtracker with
./bin/hadoop jobtracker

i get the following stacktrace after some operations.

Exception in thread main java.lang.NullPointerException
at 
org.apache.hadoop.mapred.JobTrackerInfoServer.init(JobTrackerInfoServer.java:56)

at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:303)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)

any ideas?

regards,

ud




nutch configuration

2006-02-10 Thread carmmello
After some time, I downloaded the latest nightly version of Nutch
(2006-02-10).  Going through nutch-default.xml I could not find,
anymore, nor fs.defaul nor marpred.reduce properties. Where to find
them?  Is the FAQ that deals about mapreduce still valid?
Tanks




Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Chris Mattmann
Hi,


   the contentTitle will be a concatenation of the titles of the RSS Channels
 that we've parsed.
   So the titles of the RSS Channels are what delivered for indexing, right?

They're certainly part of it, but not the only part. The concatenation of
the titles of the RSS Channels are what is delivered for the title portion
of indexing.

   If I want the indexer to include more information about a rss file (such
 as item descriptions), can I just concatenate them to the contentTitle?

They're already there. There is a variable called index text: ultimately
that variable includes the item descriptions, along with the channel
descriptions. That, along with the title portion of indexing is the full
set of textual data delivered by the parser for indexing. So, it already
includes that information. Check out lines 137, and 161 in the parser to see
what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

contentTitle.toString(), outlinks, content.getMetadata());

parseData.setConf(this.conf);

return new ParseImpl(indexText.toString(), parseData);

You can see that the return from the Parser, i.e., the ParseImpl, includes
both the indexText, along with the parse data (that contains the title
text).

Now, if you wanted to add any other metadata gleaned from the RSS to the
title text, or the content text, you can always modify the code to do that
in your own environment. The RSS Parser plugin returns a full channel model
and item model that can be extended and used for those purposes.

Hope that helps!

Cheers,
  Chris


 
 
 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
   That should work: however, the biggest problem will be making sure that
 text/xml is actually the content type of the RSS that you are parsing,
 which you'll have little or no control over.
 
 Check out this previous post of mine on the list to get a better idea of
 what the real issue is:
 
 http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
 G'luck!
 
 Cheers,
 Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 Phone:  818-354-8810
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss
 files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 Sure it will, you just have to configure it to do that. Pop over to
 $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
 there
 is
 an attribute called pathSuffix. Change that to handle whatever type
 of
 rss
 file you want to crawl. That will work locally. For web-based crawls,
 you
 need to make sure that the content type being returned for your RSS
 content
 matches the content type specified in the plugin.xml file that
 parse-rss
 claims to support.
 
 Note that you might not have * a lot * of success with being able to
 control the content type for rss files returned by web servers. I've
 seen
 a
 LOT of inconsistency among the way that they're configured by the
 administrators, etc. However, just to let you know, there are some
 people
 in
 the group that are working on a solution to addressing this.
 
 Hope that helps.
 
 Cheers,
 Chris
 
 
 
 On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
 Hi *Chris,*
 
 The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
 it
 automatically as a rss file?
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
 parse-rss is based on commons-feedparser
 (http://jakarta.apache.org/commons/sandbox/feedparser). From the
 feedparser
 website:
 
 ...commons-feedparser supports all versions of RSS (0.9, 0.91,
 0.92,
 1.0,
 and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
 extension
 and RSS 1.0 modules capability...
 
 Hope that helps.
 
 

Re: nutch configuration

2006-02-10 Thread Stefan Groschupf

in the apache hadoop configuration, this inside the hadoop jar.

Am 10.02.2006 um 17:33 schrieb carmmello:


After some time, I downloaded the latest nightly version of Nutch
(2006-02-10).  Going through nutch-default.xml I could not find,
anymore, nor fs.defaul nor marpred.reduce properties. Where to find
them?  Is the FAQ that deals about mapreduce still valid?
Tanks





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: nutch configuration

2006-02-10 Thread Michael Nebel

Hi,

and if you want to change any of the parameters - just create a 
hadoop-site.xml. And attention ndfs.* is now called dfs.* :-)


Regards

Michael

Stefan Groschupf wrote:


in the apache hadoop configuration, this inside the hadoop jar.

Am 10.02.2006 um 17:33 schrieb carmmello:


After some time, I downloaded the latest nightly version of Nutch
(2006-02-10).  Going through nutch-default.xml I could not find,
anymore, nor fs.defaul nor marpred.reduce properties. Where to find
them?  Is the FAQ that deals about mapreduce still valid?
Tanks





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net






Re: The latest svn version is not stable

2006-02-10 Thread Doug Cutting

Rafit Izhak_Ratzin wrote:

I just check out the latest svn version (376446), I built it from scratch.

When I tried to run the jobtrucker I got the next message in the 
jobtracker log file:


060209 164707 Property 'sun.cpu.isalist' is
Exception in thread main java.lang.NullPointerException


Okay.  I think I just fixed this.  Please give it a try.

Thanks,

Doug


Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting

Michael Nebel wrote:
I upgraded to the last version from the svn today. After having some 
nuts and bolts fixes (missing hadoop-site.xml, webapps-dir).


I just fixed these issues.

I finally 
tried to inject a new set of urls. Doing so, I get the exception below.


I am not seeing this.  Are you still seeing it, with the current 
sources?  If so, can you provide more details?  What OS, JVM?


Thanks,

Doug


Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting

Michael Nebel wrote:

Now it's complaining about a missing class

org/apache/nutch/util/LogFormatter :-(


That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter.

Doug


Re: nutch inject problem with hadoop

2006-02-10 Thread Michael Nebel
have you ever heard of someone, who forgot to make ant clean and to 
update the plugin-directory? no? now you have! My mistake. Sorry


Michael

Doug Cutting wrote:

Michael Nebel wrote:


Now it's complaining about a missing class

org/apache/nutch/util/LogFormatter :-(



That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter.

Doug





Need nutch 0.5

2006-02-10 Thread Scott Simpson
I need a copy of nutch 0.5 which is no longer on the Apache servers. Can
someone send me a copy? Thanks a million.

Scott

 



Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Elwin
According to the code:
theOutlinks.add(new Outlink(r.getLink(), r
.getDescription()));
I can see that item description is also included.

However, when I tried with this feed:
http://kgrimm.bravejournal.com/feed.rss
I can only get the title and description for channel and failed to search
the words in item description.

From the above code, the item description is combined with outlink url, is
it used as contentTitle for that url? When the outlink is fetched and
parsed, I think new data about that url will be generated.


在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi,


the contentTitle will be a concatenation of the titles of the RSS
 Channels
  that we've parsed.
So the titles of the RSS Channels are what delivered for indexing,
 right?

 They're certainly part of it, but not the only part. The concatenation of
 the titles of the RSS Channels are what is delivered for the title
 portion
 of indexing.

If I want the indexer to include more information about a rss file
 (such
  as item descriptions), can I just concatenate them to the contentTitle?

 They're already there. There is a variable called index text: ultimately
 that variable includes the item descriptions, along with the channel
 descriptions. That, along with the title portion of indexing is the full
 set of textual data delivered by the parser for indexing. So, it already
 includes that information. Check out lines 137, and 161 in the parser to
 see
 what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

 contentTitle.toString(), outlinks, content.getMetadata());

 parseData.setConf(this.conf);

 return new ParseImpl(indexText.toString(), parseData);

 You can see that the return from the Parser, i.e., the ParseImpl, includes
 both the indexText, along with the parse data (that contains the title
 text).

 Now, if you wanted to add any other metadata gleaned from the RSS to the
 title text, or the content text, you can always modify the code to do that
 in your own environment. The RSS Parser plugin returns a full channel
 model
 and item model that can be extended and used for those purposes.

 Hope that helps!

 Cheers,
 Chris


 
 
  在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
 
That should work: however, the biggest problem will be making sure
 that
  text/xml is actually the content type of the RSS that you are
 parsing,
  which you'll have little or no control over.
 
  Check out this previous post of mine on the list to get a better idea
 of
  what the real issue is:
 
  http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
  G'luck!
 
  Cheers,
  Chris
 
 
  __
  Chris A. Mattmann
  [EMAIL PROTECTED]
  Staff Member
  Modeling and Data Management Systems Section (387)
  Data Management Systems and Technologies Group
 
  _
  Jet Propulsion LaboratoryPasadena, CA
  Office: 171-266BMailstop:  171-246
  Phone:  818-354-8810
  ___
 
  Disclaimer:  The opinions presented within are my own and do not
 reflect
  those of either NASA, JPL, or the California Institute of Technology.
 
  -Original Message-
  From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
  Sent: Saturday, February 04, 2006 11:40 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Which version of rss does parse-rss plugin support?
 
  Hi Chris
 
 
  How do I change the plugin.xml? For example, if I want to crawl rss
  files
  end with xml, just add a new element?
 
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=rss/
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=xml/
 
  Am I right?
 
 
 
  在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
  there
  is
  an attribute called pathSuffix. Change that to handle whatever type
  of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
  you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that
  parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
  seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just to