date:20060207

Re: Nutch-general digest, Vol 1 #935 - 8 msgs

2006-02-07 Thread Saravanaraj Duraisamy

Hi David,

Thanks... Is there a way in nutch to reindex the files based on the last
modified date???
I have large numbers of pdf's and doc's in a folder. Do i need to reindex
all the files every time i want to update my index?

On 2/8/06, David Wallace <[EMAIL PROTECTED]> wrote:
>
> Hi Saravanaraj,
> For each URL, Nutch reads your filter file from top to bottom, until it
> finds a line (+ or -) that matches the URL.  Then it stops reading.
> Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED,
> because they match the line that says +^file:/E:/Index Samples/.
>
> I suggest you swap over the two lines in the filter file: put
> -^file:/E:/Index Samples/Index/ BEFORE +^file:/E:/Index Samples/; so
> that Nutch encounters it first, when deciding whether to include files
> in that directory.
>
> Regards,
> David.
>
>
> On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote:
> > Hi i am using nutch to index files in local FS and FTP.
> >
> > my filter file is
> >
> > -^(http|ftp|mailto):
> >
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
> > [EMAIL PROTECTED]
> > -.*(/.+?)/.*?\1/.*?\1/
> > +^file:/E:/Index Samples/
> > -^file:/E:/Index Samples/Index/
> >
> > but nutch crawls the forbidden folders also. is there a web db kind
> of thing
> > for files also. is it possible to make nutch to index files based on
> the
> > last modified date.
> >
> > can anybody suggest the datastructure for webdb (filedb??) for files.
> it
> > will be good to group files and create seperate segements for each
> group. so
> > if some files are changed, only those segments can be replaced.
> >
> > Rgds,
> > D.Saravanaraj
>
>
>
>
> 
> This email may contain legally privileged information and is intended only
> for the addressee. It is not necessarily the official view or
> communication of the New Zealand Qualifications Authority. If you are not
> the intended recipient you must not use, disclose, copy or distribute this
> email or
> information in it. If you have received this email in error, please
> contact the sender immediately. NZQA does not accept any liability for
> changes made to this email or attachments after sending by NZQA.
>
> All emails have been scanned for viruses and content by MailMarshal.
> NZQA reserves the right to monitor all email communications through its
> network.
>
>
> 
>
>

Re: Nutch-general digest, Vol 1 #935 - 8 msgs

2006-02-07 Thread David Wallace

Hi Saravanaraj,
For each URL, Nutch reads your filter file from top to bottom, until it
finds a line (+ or -) that matches the URL.  Then it stops reading. 
Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED,
because they match the line that says +^file:/E:/Index Samples/.  

I suggest you swap over the two lines in the filter file: put
-^file:/E:/Index Samples/Index/ BEFORE +^file:/E:/Index Samples/; so
that Nutch encounters it first, when deciding whether to include files
in that directory.

Regards,
David.

On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote:
> Hi i am using nutch to index files in local FS and FTP.
> 
> my filter file is
> 
> -^(http|ftp|mailto):
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
> [EMAIL PROTECTED]
> -.*(/.+?)/.*?\1/.*?\1/
> +^file:/E:/Index Samples/
> -^file:/E:/Index Samples/Index/
> 
> but nutch crawls the forbidden folders also. is there a web db kind
of thing
> for files also. is it possible to make nutch to index files based on
the
> last modified date.
> 
> can anybody suggest the datastructure for webdb (filedb??) for files.
it
> will be good to group files and create seperate segements for each
group. so
> if some files are changed, only those segments can be replaced.
> 
> Rgds,
> D.Saravanaraj

This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

Re: Speeding up initial searches using cache

2006-02-07 Thread Chris Lamprecht

Just out of curiousity, does anyone here know how well query caching
works in general with an extremely high-volume search engine?

It seems like as your search volume goes up, and the number of unique
queries goes up with it, the cache hit rate would go down, and caching
would help less and less.  Urs Hoelzle (Google) mentioned this in a
talk he gave at UW in 2002:

http://rakaposhi.eas.asu.edu/f02-cse494-mailarchive/msg00138.html
(link to video on this page)

-chris

On 2/7/06, Byron Miller <[EMAIL PROTECTED]> wrote:
> I use OSCache with great success.
>
> I would an amazing amount (more then i assumed) of
> queries we get are duplicate of one fashion or another
> so on top of warming things up as much as possible to
> the OS buffer cache we use OSCache as well.
>
> You could also use Squid to cache pages for x amount
> of time to offload your hotspots to free up cpu time
> for those ad-hoc/random queries. (as long as you
> aren't forcing content expire in your headers)
>
> -byron
>
>
> --- "Insurance Squared Inc."
> <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Running nutch 0.71on Mandrake linux 2006 (P4 with a
> > 2 sata drives on
> > raid 0, 2 gigs of ram, about 4 million pages, but
> > expecting to hit 10+),
> > and finding that our initial queries take up to
> > 15-20 seconds to return
> > results.  I'd like to get that speeded up and am
> > seeking thoughts on how
> > to do so.
> >
>

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin


After copying teh build directory from hadoop to nutch
I can run the crawl cycle, however I get the next Exeption (in jobtracking 
file) lot of times:


060207 215603 Server connection on port 50020 from IP...
caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.mapred.TaskTrackerStatus
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.mapred.TaskTrackerStatus
   at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:183)

   at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:88)
   at org.apache.hadoop.ipc.Server$Connection.run(Server.java:138)
060207 215603 Server connection on port 50020 from IP... : exiting
060207 215708 Server connection on port 50020 from IP...: starting




From: Mike Smith <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: new svn version:NoClassDefFoundError - JobTracker
Date: Tue, 7 Feb 2006 15:45:26 -0800

I finally finished my first successfull experinece with nutch/hadoop, I
started from 70,000  seeds and this is results of 1 cycle crawl.

060207 153956 Statistics for CrawlDb: t1/crawldb
060207 153956 TOTAL urls:   850726
060207 153956 avg score:1.037
060207 153956 max score:133.269
060207 153956 min score:1.0
060207 153956 retry 0:  849588
060207 153956 retry 1:  1138
060207 153956 status 1 (DB_unfetched):  788522
060207 153956 status 2 (DB_fetched):60703
060207 153956 status 3 (DB_gone):   1501
060207 153956 CrawlDb statistics: done

But, I still have problem with hadoop.jar. I have to use the built classes
instead!

Thanks, Mike





On 2/7/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
> Rafit,
>
> get the hadoop project make that and use the build folder instead of the
> jar file. It will work fine then. Something probabely is missing from 
the

> hadoop jar.
>
> M
>
>
> On 2/7/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
> >
> > I am still getting teh next Exception:
> >
> > Exception in thread "main" java.lang.NullPointerException
> >at
> > org.apache.hadoop.mapred.JobTrackerInfoServer.(
> > JobTrackerInfoServer.java:56)
> >at org.apache.hadoop.mapred.JobTracker.(JobTracker.java
> > :303)
> >at
> > org.apache.hadoop.mapred.JobTracker.startTracker (JobTracker.java:50)
> >at 
org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)

> > 060207 161923 Server handler 9 on 50020: starting
> >
> >
> >
> >
> >
> > >From: Doug Cutting < [EMAIL PROTECTED]>
> > >Reply-To: nutch-user@lucene.apache.org
> > >To: nutch-user@lucene.apache.org
> > >Subject: Re: new svn version:NoClassDefFoundError - JobTracker
> > >Date: Tue, 07 Feb 2006 13:04:37 -0800
> > >
> > >Mike Smith wrote:
> > >>The problem is that jetty jar files are missing from the SVN., I
> > replaced
> > >>the Jetty jar files but I get another exception:
> > >
> > >I just restored the jetty lib and the jetty-ext libs.  Does that 
help?

> > >
> > >Doug
> >
> > _
> > FREE pop-up blocking with the new MSN Toolbar - get it now!
> > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
> >
> >
>


_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith

I finally finished my first successfull experinece with nutch/hadoop, I
started from 70,000  seeds and this is results of 1 cycle crawl.

060207 153956 Statistics for CrawlDb: t1/crawldb
060207 153956 TOTAL urls:   850726
060207 153956 avg score:1.037
060207 153956 max score:133.269
060207 153956 min score:1.0
060207 153956 retry 0:  849588
060207 153956 retry 1:  1138
060207 153956 status 1 (DB_unfetched):  788522
060207 153956 status 2 (DB_fetched):60703
060207 153956 status 3 (DB_gone):   1501
060207 153956 CrawlDb statistics: done

But, I still have problem with hadoop.jar. I have to use the built classes
instead!

Thanks, Mike





On 2/7/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
> Rafit,
>
> get the hadoop project make that and use the build folder instead of the
> jar file. It will work fine then. Something probabely is missing from the
> hadoop jar.
>
> M
>
>
> On 2/7/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
> >
> > I am still getting teh next Exception:
> >
> > Exception in thread "main" java.lang.NullPointerException
> >at
> > org.apache.hadoop.mapred.JobTrackerInfoServer.(
> > JobTrackerInfoServer.java:56)
> >at org.apache.hadoop.mapred.JobTracker.(JobTracker.java
> > :303)
> >at
> > org.apache.hadoop.mapred.JobTracker.startTracker (JobTracker.java:50)
> >at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)
> > 060207 161923 Server handler 9 on 50020: starting
> >
> >
> >
> >
> >
> > >From: Doug Cutting < [EMAIL PROTECTED]>
> > >Reply-To: nutch-user@lucene.apache.org
> > >To: nutch-user@lucene.apache.org
> > >Subject: Re: new svn version:NoClassDefFoundError - JobTracker
> > >Date: Tue, 07 Feb 2006 13:04:37 -0800
> > >
> > >Mike Smith wrote:
> > >>The problem is that jetty jar files are missing from the SVN., I
> > replaced
> > >>the Jetty jar files but I get another exception:
> > >
> > >I just restored the jetty lib and the jetty-ext libs.  Does that help?
> > >
> > >Doug
> >
> > _
> > FREE pop-up blocking with the new MSN Toolbar - get it now!
> > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
> >
> >
>

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith

Rafit,

get the hadoop project make that and use the build folder instead of the jar
file. It will work fine then. Something probabely is missing from the hadoop
jar.

M


On 2/7/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
>
> I am still getting teh next Exception:
>
> Exception in thread "main" java.lang.NullPointerException
>at
> org.apache.hadoop.mapred.JobTrackerInfoServer.(
> JobTrackerInfoServer.java:56)
>at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
>at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)
>at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)
> 060207 161923 Server handler 9 on 50020: starting
>
>
>
>
>
> >From: Doug Cutting <[EMAIL PROTECTED]>
> >Reply-To: nutch-user@lucene.apache.org
> >To: nutch-user@lucene.apache.org
> >Subject: Re: new svn version:NoClassDefFoundError - JobTracker
> >Date: Tue, 07 Feb 2006 13:04:37 -0800
> >
> >Mike Smith wrote:
> >>The problem is that jetty jar files are missing from the SVN., I
> replaced
> >>the Jetty jar files but I get another exception:
> >
> >I just restored the jetty lib and the jetty-ext libs.  Does that help?
> >
> >Doug
>
> _
> FREE pop-up blocking with the new MSN Toolbar - get it now!
> http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
>
>

Re: Please remove NUTCH149 as bug

2006-02-07 Thread Chris Mattmann

Hi Prabhu,

 I've closed the bug in JIRA.

Thanks!

Cheers,
  Chris

On 2/7/06 1:08 PM, "Raghavendra Prabhu" <[EMAIL PROTECTED]> wrote:

> Hi
> 
> Can anyone with jira Access remove NUTCH-149
> 
> The bug was filed by me
> 
> The problem is no longer there and the bug is fixed
> 
> So a bug will reduce in Open list
> 
> I will also try to find other invalid bugs if any
> 
> Rgds
> Prabhu

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin


I am still getting teh next Exception:

Exception in thread "main" java.lang.NullPointerException
   at 
org.apache.hadoop.mapred.JobTrackerInfoServer.(JobTrackerInfoServer.java:56)

   at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
   at 
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)

   at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)
060207 161923 Server handler 9 on 50020: starting






From: Doug Cutting <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: new svn version:NoClassDefFoundError - JobTracker
Date: Tue, 07 Feb 2006 13:04:37 -0800

Mike Smith wrote:

The problem is that jetty jar files are missing from the SVN., I replaced
the Jetty jar files but I get another exception:


I just restored the jetty lib and the jetty-ext libs.  Does that help?

Doug


_
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

bug fixes

2006-02-07 Thread Raghavendra Prabhu

Hi

I think even

NUTCH-94 and NUTCH-96 have been fixed

Rgds
Prabhu

Please remove NUTCH149 as bug

2006-02-07 Thread Raghavendra Prabhu

Hi

Can anyone with jira Access remove NUTCH-149

The bug was filed by me

The problem is no longer there and the bug is fixed

So a bug will reduce in Open list

I will also try to find other invalid bugs if any

Rgds
Prabhu

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith

I could make it work in a strange way. There should be a problem with hadoop
jar file.

I downloaded the hadoop project and made the project using ant. then I could
start the jobtracker successfully, but when I removed the build folder and
just used the hadoop jar file, it failed again. So I copied all the
hadoop/lib  files into nutch/lib and I also copied the build folder of
hadoop into nutch folder. Now nutch is using build of hadoop (the
classes) instead of its jar file!!! But it worked fine.

I just started a crawl over 7 seeds and it is going through! I will let
you know the results.

Thanks, Mike



On 2/7/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
> The problem is that jetty jar files are missing from the SVN., I replaced
> the Jetty jar files but I get another exception:
>
> 060207 123447 Property 'sun.cpu.isalist' is
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.hadoop.mapred.JobTrackerInfoServer.(
> JobTrackerInfoServer.java:56)
> at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
> at org.apache.hadoop.mapred.JobTracker.startTracker(
> JobTracker.java:50)
> at org.apache.hadoop.mapred.JobTracker.main (JobTracker.java:813)
>
> Thanks, Mike
>
>
>  On 2/7/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I am trying to run with the new svn version (375414), I am working under
> > nutch/trunk directory.
> >
> > When I ran the next command "bin/hadoop jobtracker" or "bin/hadoop-
> > daemon.sh
> > start jobtracker"
> > I got the next message,
> >
> > Exception in thread "main" java.lang.NoClassDefFoundError :
> > org/mortbay/http/HttpListener
> >at org.apache.hadoop.mapred.JobTracker.(JobTracker.java
> > :303)
> >at
> > org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)
> >at org.apache.hadoop.mapred.JobTracker.main (JobTracker.java:813)
> >
> > Thanks,
> > Rafit
> >
> > _
> > FREE pop-up blocking with the new MSN Toolbar - get it now!
> > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
> >
> >
>

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Doug Cutting


Mike Smith wrote:

The problem is that jetty jar files are missing from the SVN., I replaced
the Jetty jar files but I get another exception:


I just restored the jetty lib and the jetty-ext libs.  Does that help?

Doug

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith

The problem is that jetty jar files are missing from the SVN., I replaced
the Jetty jar files but I get another exception:

060207 123447 Property 'sun.cpu.isalist' is
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTrackerInfoServer.(
JobTrackerInfoServer.java:56)
at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java
:50)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)

Thanks, Mike


On 2/7/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am trying to run with the new svn version (375414), I am working under
> nutch/trunk directory.
>
> When I ran the next command "bin/hadoop jobtracker" or "bin/hadoop-
> daemon.sh
> start jobtracker"
> I got the next message,
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/mortbay/http/HttpListener
>at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
>at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)
>at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)
>
> Thanks,
> Rafit
>
> _
> FREE pop-up blocking with the new MSN Toolbar - get it now!
> http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
>
>

new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin


Hi,

I am trying to run with the new svn version (375414), I am working under 
nutch/trunk directory.


When I ran the next command "bin/hadoop jobtracker" or "bin/hadoop-daemon.sh 
start jobtracker"

I got the next message,

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/mortbay/http/HttpListener

   at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
   at 
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)

   at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)

Thanks,
Rafit

_
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

Hadoop Jobtracker fails

2006-02-07 Thread Mike Smith

I am using the new nutch with hadoop, jobtracker fails at initilazation with
this exceptopn:

060207 121953 Property 'file.separator' is /
060207 121953 Property 'java.vendor.url.bug' is
http://java.sun.com/cgi-bin/bugreport.cgi
060207 121953 Property 'sun.io.unicode.encoding' is UnicodeLittle
060207 121953 Property 'sun.cpu.endian' is little
060207 121953 Property 'sun.cpu.isalist' is
Exception in thread "main" java.lang.NoClassDefFoundError:
org/mortbay/http/HttpListener
at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:303)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java
:50)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813)
060207 121953 Server handler 9 on 50010: starting




Thank, Mike

Re: Categorizing content

2006-02-07 Thread Byron Miller



--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> There is - if it's an HTML page, add HTMLFilter. If
> it's other type of 
> content, I'm afraid there is no general
> post-processing hook to add plugins.

I'll check that out! Thanks for pointing me to this.


> > I'd like to also look at bayesian filtering during
> the
> > parse phase to look for hidden font (text same
> color
> > as background) and spammy pages or for sites with
> 3+
> > adsense ads or other particulars and score
> > appropriately.
> >
> > Has anyone experiemented with this?
> >   
> 
> Again, HTMLFilters is the place to add such things.
> 
> Now, an interesting thing would be to keep this
> categorization around, 
> so that next time you can skip/demote pages, which
> are known as spam. 
> This is the purpose of the "CrawlDatum metadata"
> patch... coming soon, I 
> hope :-)

That's what i'm waiting (Rather excited) for :)

Looking to initially flag adult related pages, but use
existing filtering processing to look for patterns to
flag as spam as well.

-byron

Re: Categorizing content

2006-02-07 Thread Andrzej Bialecki


Byron Miller wrote:
Is there an easy way to categorize content on parse? 
I have an extensive list of adult terms and i would

like to update meta info on the page if the
combination of terms exist to flag it as adult content
so i can exclude it from the search results unless
people opt in.
  


There is - if it's an HTML page, add HTMLFilter. If it's other type of 
content, I'm afraid there is no general post-processing hook to add plugins.



I'd like to also look at bayesian filtering during the
parse phase to look for hidden font (text same color
as background) and spammy pages or for sites with 3+
adsense ads or other particulars and score
appropriately.

Has anyone experiemented with this?
  


Again, HTMLFilters is the place to add such things.

Now, an interesting thing would be to keep this categorization around, 
so that next time you can skip/demote pages, which are known as spam. 
This is the purpose of the "CrawlDatum metadata" patch... coming soon, I 
hope :-)


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: hadoop-default.xml

2006-02-07 Thread Doug Cutting

The file packaged in the jar is used for the defaults.  It is read from 
the jar file.  So it should not need to be committed to Nutch.


Mike Smith wrote:

There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into hadoop jar file.


Thanks, Mike.

hadoop-default.xml

2006-02-07 Thread Mike Smith

There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into hadoop jar file.


Thanks, Mike.

Re: Categorizing content

2006-02-07 Thread 盖世豪侠

It sounds OK.
But I think if you don't check it on line, maybe you will get many
unrequired contents in your index.


2006/2/8, Jack Tang <[EMAIL PROTECTED]>:
>
> Hi Byron
>
> I am thinking will it be faster to do this offline? I mean you can
> re-visit webdb and link db and generate the index.
>
> /Jack
>
> On 2/8/06, Byron Miller <[EMAIL PROTECTED]> wrote:
> > Is there an easy way to categorize content on parse?
> > I have an extensive list of adult terms and i would
> > like to update meta info on the page if the
> > combination of terms exist to flag it as adult content
> > so i can exclude it from the search results unless
> > people opt in.
> >
> > I'd like to also look at bayesian filtering during the
> > parse phase to look for hidden font (text same color
> > as background) and spammy pages or for sites with 3+
> > adsense ads or other particulars and score
> > appropriately.
> >
> > Has anyone experiemented with this?
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>



--
《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。

opensearch support

2006-02-07 Thread Geraint Williams

Is OpenSearch being developed?

I am using nutch 0.7 and it seems to have some opensearch support.

However, I failed to get either a python or perl opensearch client
library (admittedly these are also in early development).  The perl
library seemed to choke at not finding the OpenSearchDescription, I
didn't have enough time to investigate.

I can of course, just post and parse the xml search results manually.

Thanks,
Geraint

Re: Categorizing content

2006-02-07 Thread Jack Tang

Hi Byron

I am thinking will it be faster to do this offline? I mean you can
re-visit webdb and link db and generate the index.

/Jack

On 2/8/06, Byron Miller <[EMAIL PROTECTED]> wrote:
> Is there an easy way to categorize content on parse?
> I have an extensive list of adult terms and i would
> like to update meta info on the page if the
> combination of terms exist to flag it as adult content
> so i can exclude it from the search results unless
> people opt in.
>
> I'd like to also look at bayesian filtering during the
> parse phase to look for hidden font (text same color
> as background) and spammy pages or for sites with 3+
> adsense ads or other particulars and score
> appropriately.
>
> Has anyone experiemented with this?
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Categorizing content

2006-02-07 Thread 盖世豪侠

Hi
I think you have to hack the parsed content from the parse-html plugin and
filter the string with your terms?
It will of course contain modifying  or adding some codes.


2006/2/8, Byron Miller <[EMAIL PROTECTED]>:
>
> Is there an easy way to categorize content on parse?
> I have an extensive list of adult terms and i would
> like to update meta info on the page if the
> combination of terms exist to flag it as adult content
> so i can exclude it from the search results unless
> people opt in.
>
> I'd like to also look at bayesian filtering during the
> parse phase to look for hidden font (text same color
> as background) and spammy pages or for sites with 3+
> adsense ads or other particulars and score
> appropriately.
>
> Has anyone experiemented with this?
>



--
《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。

Categorizing content

2006-02-07 Thread Byron Miller

Is there an easy way to categorize content on parse? 
I have an extensive list of adult terms and i would
like to update meta info on the page if the
combination of terms exist to flag it as adult content
so i can exclude it from the search results unless
people opt in.

I'd like to also look at bayesian filtering during the
parse phase to look for hidden font (text same color
as background) and spammy pages or for sites with 3+
adsense ads or other particulars and score
appropriately.

Has anyone experiemented with this?

Re: Speeding up initial searches using cache

2006-02-07 Thread Byron Miller

I use OSCache with great success.  

I would an amazing amount (more then i assumed) of
queries we get are duplicate of one fashion or another
so on top of warming things up as much as possible to
the OS buffer cache we use OSCache as well.

You could also use Squid to cache pages for x amount
of time to offload your hotspots to free up cpu time
for those ad-hoc/random queries. (as long as you
aren't forcing content expire in your headers)

-byron

--- "Insurance Squared Inc."
<[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Running nutch 0.71on Mandrake linux 2006 (P4 with a
> 2 sata drives on 
> raid 0, 2 gigs of ram, about 4 million pages, but
> expecting to hit 10+), 
> and finding that our initial queries take up to
> 15-20 seconds to return 
> results.  I'd like to get that speeded up and am
> seeking thoughts on how 
> to do so.
>

Re: Installing nutch

2006-02-07 Thread Bernd Fehling


Zaheed Haque schrieb:

Sorry you should update your nutch

svn update

to revision 375624

Cheers



Thanks will do that and redo all from scratch.

Bernd

Re: Installing nutch

2006-02-07 Thread Zaheed Haque

Sorry you should update your nutch

svn update

to revision 375624

Cheers

On 2/7/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> Hi:
>
> Have you looked at the nutch-default.xml config file under
> searcher.dir ??
> You need to modify this to reflect DFS where your crawl directory is I
> think you will have something like /user/nutch etc etc.. you can find
> it by trying the following
>
> bin/hadoop dfs
>
> and
>
> bin/hadoop dfs -ls
>
> do you see anything there?? (Previously NDFS)
>
> I am not sure this will help...
>
> On 2/7/06, Bernd Fehling <[EMAIL PROTECTED]> wrote:
> > For those of you who are also reinventing the wheel like me
> > getting nutch-0.8-dev with MapReduce running on a single box
> > here are some updates.
> > This is about revision #374443.
> >
> > The DmozParser class mentioned in "quick tutorial for nutch
> > 0.8 and later" seams to be in "org.apache.nutch.tools.DmozParser"
> > and not "org.apache.nutch.crawl.DmozParser"
> >
> > Against all odd I managed to get a single web page fetched
> > as the log from my web server tells and also the tasktracker
> > log.
> >
> > Set all named properties in file nutch-default.xml containing
> > the substring "verbose" to "true" to get more info from the
> > log files.
> >
> > As far as I could figure out, there will be no index under
> > "/tmp/nutch/mapred/local/index/" directory.
> > It think it will be included in a file named "/tmp/nutch/ndfs/name/edits"
> >
> > The user interface is running and I keep the ROOT/WEB-INF/classes
> > in sync with nutch/conf/ directory. The footer.html file
> > is missing in each language directory. So copy it from e.g.
> > include/footer.html to en/include/footer.html.
> >
> > What I didn't manage is getting access to the index from the
> > user interface. How does the user interface know that I
> > named my index "myindexTargetFolder" as in the tutorial?
> > Mystery...
> > Maybe a property to set somewhere...
> >
> > Regards,
> > Bernd
> >
> >
> > Bernd Fehling schrieb:
> > > Went through the tutorial for nutch 0.8.
> > > No further error messages.
> > > All seams to run fine but where is the index?
> > >
> > > Used a single URL to start with but searching for
> > > any term from that site gives no results.
> > > I guess there is no index at all?
> > >
> > > Where to find a crawler log file?
> > >
> > > Bernd
> > >
> > > Stefan Groschupf schrieb:
> > >
> > >>> Is it just a simple text file with one URL per line?
> > >>
> > >>
> > >> Yes
> > >>
> > >
> > >
> >
>

RE: How deep to go

2006-02-07 Thread Vanderdray, Jacob

If you only want to crawl www.woodward.edu, then change

+^http://([a-z0-9]*\.)*woodward.edu/

To:

+^http://www.woodward.edu/

Jake.

-Original Message-
From: Andy Morris [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 06, 2006 9:00 PM
To: nutch-user@lucene.apache.org
Subject: RE: How deep to go

Stefan,
Thanks for the info.
I want to limit the crawl to our new site I have change the
+^http://([a-z0-9]*\.)*apache.org/ to
+^http://([a-z0-9]*\.)*woodward.edu/  but I get to one server and it is
stuck in an never ending crawl, it's a calendar server.  I want to limit
the crawl to one server and that's all.  I need to only crawl the main
site of "www" which is www.woodward.edu and not everything with
*.woodward.edu we have too many servers that I don't need to get info
for. Any suggestions on this.
andy
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 06, 2006 6:58 PM
To: nutch-user@lucene.apache.org
Subject: Re: How deep to go

Instead of using the crawl command I personal prefer the manually
commands.
Than I use a small script that runs
http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
in a never ending loop where a wait for a day for each iteration.
This will make sure that you have all links that match you url filter.
Just dont miss to remove old segments and merge indexes together, more
about such things can be found in the mail archive.
Also don't miss to add the plugins (e.g. pdf parser).

HTH
Stefan

Am 05.02.2006 um 19:54 schrieb Andy Morris:

> How deep should a good intranet crawl be...10-20?
> I still can't get all of my site searchable..
>
> Here is my situation...
> I want to crawl just a local site for our intranet.   We have just
> rolled out an asp only website from a pure html site.  I ran nutch on
> the old site and got great results.  Since moving to this new site  
> I am
> have a devil of a time retrieving good information and missing a  
> ton of
> info all together.  I am not sure what settings I need to change to  
> get
> good results.  One setting that I have set does produce good  
> results but
> it seems to crawl other website and not just my domain.  The last line
> of the crawl-urlfilter file I just replace the - with + so it does not
> ignore other information. Our site is www.woodward.edu I was wondering
> if someone on this list can crawl this site and only this domain  
> and see
> what they come up with.  Woodward.edu is the domain.  I am just  
> stumped
> as what to do next.  I am running a nightly build from January 26th
> 2006.
>
> My criteria for our local search is to be able to search PDF, images,
> doc, and web content.  You can go here and see what the search page
> pulls up http://search.woodward.edu .
>
> Thanks for any help this list can provide.
> Andy Morris
>

---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net

Re: Installing nutch

2006-02-07 Thread Zaheed Haque

Hi:

Have you looked at the nutch-default.xml config file under
searcher.dir ??
You need to modify this to reflect DFS where your crawl directory is I
think you will have something like /user/nutch etc etc.. you can find
it by trying the following

bin/hadoop dfs

and

bin/hadoop dfs -ls

do you see anything there?? (Previously NDFS)

I am not sure this will help...

On 2/7/06, Bernd Fehling <[EMAIL PROTECTED]> wrote:
> For those of you who are also reinventing the wheel like me
> getting nutch-0.8-dev with MapReduce running on a single box
> here are some updates.
> This is about revision #374443.
>
> The DmozParser class mentioned in "quick tutorial for nutch
> 0.8 and later" seams to be in "org.apache.nutch.tools.DmozParser"
> and not "org.apache.nutch.crawl.DmozParser"
>
> Against all odd I managed to get a single web page fetched
> as the log from my web server tells and also the tasktracker
> log.
>
> Set all named properties in file nutch-default.xml containing
> the substring "verbose" to "true" to get more info from the
> log files.
>
> As far as I could figure out, there will be no index under
> "/tmp/nutch/mapred/local/index/" directory.
> It think it will be included in a file named "/tmp/nutch/ndfs/name/edits"
>
> The user interface is running and I keep the ROOT/WEB-INF/classes
> in sync with nutch/conf/ directory. The footer.html file
> is missing in each language directory. So copy it from e.g.
> include/footer.html to en/include/footer.html.
>
> What I didn't manage is getting access to the index from the
> user interface. How does the user interface know that I
> named my index "myindexTargetFolder" as in the tutorial?
> Mystery...
> Maybe a property to set somewhere...
>
> Regards,
> Bernd
>
>
> Bernd Fehling schrieb:
> > Went through the tutorial for nutch 0.8.
> > No further error messages.
> > All seams to run fine but where is the index?
> >
> > Used a single URL to start with but searching for
> > any term from that site gives no results.
> > I guess there is no index at all?
> >
> > Where to find a crawler log file?
> >
> > Bernd
> >
> > Stefan Groschupf schrieb:
> >
> >>> Is it just a simple text file with one URL per line?
> >>
> >>
> >> Yes
> >>
> >
> >
>

Re: Installing nutch

2006-02-07 Thread Bernd Fehling


For those of you who are also reinventing the wheel like me
getting nutch-0.8-dev with MapReduce running on a single box
here are some updates.
This is about revision #374443.

The DmozParser class mentioned in "quick tutorial for nutch
0.8 and later" seams to be in "org.apache.nutch.tools.DmozParser"
and not "org.apache.nutch.crawl.DmozParser"

Against all odd I managed to get a single web page fetched
as the log from my web server tells and also the tasktracker
log.

Set all named properties in file nutch-default.xml containing
the substring "verbose" to "true" to get more info from the
log files.

As far as I could figure out, there will be no index under
"/tmp/nutch/mapred/local/index/" directory.
It think it will be included in a file named "/tmp/nutch/ndfs/name/edits"

The user interface is running and I keep the ROOT/WEB-INF/classes
in sync with nutch/conf/ directory. The footer.html file
is missing in each language directory. So copy it from e.g.
include/footer.html to en/include/footer.html.

What I didn't manage is getting access to the index from the
user interface. How does the user interface know that I
named my index "myindexTargetFolder" as in the tutorial?
Mystery...
Maybe a property to set somewhere...

Regards,
Bernd


Bernd Fehling schrieb:

Went through the tutorial for nutch 0.8.
No further error messages.
All seams to run fine but where is the index?

Used a single URL to start with but searching for
any term from that site gives no results.
I guess there is no index at all?

Where to find a crawler log file?

Bernd

Stefan Groschupf schrieb:


Is it just a simple text file with one URL per line?



Yes

Re: nutch 0.8-devel and url redirect

2006-02-07 Thread Enrico Triolo

Thank you for your reply.

My crawl-urlfilter.txt file allows any url, since I set only this rule:

+.

btw, this is the same rule I set for the 0.7 version.


On 2/7/06, Raghavendra Prabhu <[EMAIL PROTECTED]> wrote:
>
> Check the url filters
>
> Crawl-filter.txt
>
> see whether the rule is allowed
>
> see whether the link below matches with url pattern there in the
> crawl-filter.txt file
>
> http://*punto-informatico.it *
>
>
>
> On 2/7/06, Enrico Triolo <[EMAIL PROTECTED]> wrote:
> >
> > I'm switching to nutch-0.8 but I'm facing a problem with url redirects.
> > To let you understand better I'll explain my problem with a real
> example:
> >
> > I created an 'urls' directory and inside it I created an 'urls.txt' file
> > containing only this line: "http://www.punto-informatico.it";.
> > If pointed to this url the webserver sends a 30x response redirecting to
> "
> > http://punto-informatico.it";.
> >
> > If I run nutch 0.8 with this command:
> >
> > nutch urls/ -dir pi -depth 2 -threads 1
> >
> > it can't retrieve any page...
> >
> > I tried the same command with nutch-0.7 and it retrieved 41 pages.
> >
> > Is it an issue or am I missing something?
> >
> > Thanks,
> > Enrico
> >
> >
>
>

Re: nutch 0.8-devel and url redirect

2006-02-07 Thread Raghavendra Prabhu

Check the url filters

Crawl-filter.txt

see whether the rule is allowed

see whether the link below matches with url pattern there in the
crawl-filter.txt file

http://*punto-informatico.it *



On 2/7/06, Enrico Triolo <[EMAIL PROTECTED]> wrote:
>
> I'm switching to nutch-0.8 but I'm facing a problem with url redirects.
> To let you understand better I'll explain my problem with a real example:
>
> I created an 'urls' directory and inside it I created an 'urls.txt' file
> containing only this line: "http://www.punto-informatico.it";.
> If pointed to this url the webserver sends a 30x response redirecting to "
> http://punto-informatico.it";.
>
> If I run nutch 0.8 with this command:
>
> nutch urls/ -dir pi -depth 2 -threads 1
>
> it can't retrieve any page...
>
> I tried the same command with nutch-0.7 and it retrieved 41 pages.
>
> Is it an issue or am I missing something?
>
> Thanks,
> Enrico
>
>

nutch 0.8-devel and url redirect

2006-02-07 Thread Enrico Triolo

I'm switching to nutch-0.8 but I'm facing a problem with url redirects.
To let you understand better I'll explain my problem with a real example:

I created an 'urls' directory and inside it I created an 'urls.txt' file
containing only this line: "http://www.punto-informatico.it";.
If pointed to this url the webserver sends a 30x response redirecting to "
http://punto-informatico.it";.

If I run nutch 0.8 with this command:

nutch urls/ -dir pi -depth 2 -threads 1

it can't retrieve any page...

I tried the same command with nutch-0.7 and it retrieved 41 pages.

Is it an issue or am I missing something?

Thanks,
Enrico

Re: Plugins: directory not found: plugins

2006-02-07 Thread Raghavendra Prabhu

SET NUTCH_HOME=F:\nutch-0.7
SET LIB=%NUTCH_HOME%\lib
java -classpath %NUTCH_HOME%\conf\;%NUTCH_HOME%\nutch-0.7.jar
;%NUTCH_HOME%\plugins\;%LIB%\concurrent-1.3.4.jar;%LIB%\jakarta-
oro-2.0.7.jar;%LIB%\jetty-5.1.2.jar;%LIB%\junit-3.8.1.jar;%LIB%\lucene-
1.9-rc1-dev.jar;%LIB%\lucene-misc-1.9-rc1-dev.jar;%LIB%\servlet-api.jar
;%LIB%\taglibs-i18n.jar;%LIB%\xerces-2_6_2-apis.jar;%LIB%\xerces-2_6_2.jar
org.apache.nutch.tools.CrawlTool urls.txt -dir test.out -depth 2 -threads 3



In the above thing  ,it should be
;%NUTCH_HOME%\build\plugins\


I guess

Rgds
Prabhu
On 2/7/06, Saravanaraj Duraisamy <[EMAIL PROTECTED]> wrote:
>
> u build the application
> and u will get a build folder
> add that folder to your class path
>
> On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> >
> > Hi
> >
> > Do you mean I should create a dir called build and move dir plugins in?
> > It seems it doesn't work either
> >
> >
> > 2006/2/7, Saravanaraj Duraisamy <[EMAIL PROTECTED]>:
> > >
> > > Add build\plugins
> > > to your classpath
> > >
> > > On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I try to run nutch using command line and I've add the plugins dir
> to
> > > the
> > > > classpath.
> > > >
> > > > SET NUTCH_HOME=F:\nutch-0.7
> > > > SET LIB=%NUTCH_HOME%\lib
> > > > java -classpath %NUTCH_HOME%\conf\;%NUTCH_HOME%\nutch-0.7.jar
> > > > ;%NUTCH_HOME%\plugins\;%LIB%\concurrent-1.3.4.jar;%LIB%\jakarta-
> > > > oro-2.0.7.jar;%LIB%\jetty-5.1.2.jar;%LIB%\junit-3.8.1.jar
> > ;%LIB%\lucene-
> > > > 1.9-rc1-dev.jar;%LIB%\lucene-misc-1.9-rc1-dev.jar;%LIB%\servlet-
> > api.jar
> > > > ;%LIB%\taglibs-i18n.jar;%LIB%\xerces-2_6_2-apis.jar
> > > ;%LIB%\xerces-2_6_2.jar
> > > > org.apache.nutch.tools.CrawlTool urls.txt -dir test.out -depth 2
> > > -threads
> > > > 3
> > > >
> > > > But I get the following error:
> > > > 060207 141014 Plugins: directory not found: plugins
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> >
> >
> 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。
> >
>

Re: Plugins: directory not found: plugins

2006-02-07 Thread Saravanaraj Duraisamy

u build the application
and u will get a build folder
add that folder to your class path

On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> Do you mean I should create a dir called build and move dir plugins in?
> It seems it doesn't work either
>
>
> 2006/2/7, Saravanaraj Duraisamy <[EMAIL PROTECTED]>:
> >
> > Add build\plugins
> > to your classpath
> >
> > On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> > >
> > > I try to run nutch using command line and I've add the plugins dir to
> > the
> > > classpath.
> > >
> > > SET NUTCH_HOME=F:\nutch-0.7
> > > SET LIB=%NUTCH_HOME%\lib
> > > java -classpath %NUTCH_HOME%\conf\;%NUTCH_HOME%\nutch-0.7.jar
> > > ;%NUTCH_HOME%\plugins\;%LIB%\concurrent-1.3.4.jar;%LIB%\jakarta-
> > > oro-2.0.7.jar;%LIB%\jetty-5.1.2.jar;%LIB%\junit-3.8.1.jar
> ;%LIB%\lucene-
> > > 1.9-rc1-dev.jar;%LIB%\lucene-misc-1.9-rc1-dev.jar;%LIB%\servlet-
> api.jar
> > > ;%LIB%\taglibs-i18n.jar;%LIB%\xerces-2_6_2-apis.jar
> > ;%LIB%\xerces-2_6_2.jar
> > > org.apache.nutch.tools.CrawlTool urls.txt -dir test.out -depth 2
> > -threads
> > > 3
> > >
> > > But I get the following error:
> > > 060207 141014 Plugins: directory not found: plugins
> > >
> > >
> >
> >
>
>
> --
>
> 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。
>

Re: Plugins: directory not found: plugins

2006-02-07 Thread Jack Tang

Please specify "plugin.folders"(in nutch-default/site.xml) to the real
plugin built destination dir. Of course, you can use absolutely path.

/Jack

On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> Hi
>
> Do you mean I should create a dir called build and move dir plugins in?
> It seems it doesn't work either
>
>
> 2006/2/7, Saravanaraj Duraisamy <[EMAIL PROTECTED]>:
> >
> > Add build\plugins
> > to your classpath
> >
> > On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> > >
> > > I try to run nutch using command line and I've add the plugins dir to
> > the
> > > classpath.
> > >
> > > SET NUTCH_HOME=F:\nutch-0.7
> > > SET LIB=%NUTCH_HOME%\lib
> > > java -classpath %NUTCH_HOME%\conf\;%NUTCH_HOME%\nutch-0.7.jar
> > > ;%NUTCH_HOME%\plugins\;%LIB%\concurrent-1.3.4.jar;%LIB%\jakarta-
> > > oro-2.0.7.jar;%LIB%\jetty-5.1.2.jar;%LIB%\junit-3.8.1.jar;%LIB%\lucene-
> > > 1.9-rc1-dev.jar;%LIB%\lucene-misc-1.9-rc1-dev.jar;%LIB%\servlet-api.jar
> > > ;%LIB%\taglibs-i18n.jar;%LIB%\xerces-2_6_2-apis.jar
> > ;%LIB%\xerces-2_6_2.jar
> > > org.apache.nutch.tools.CrawlTool urls.txt -dir test.out -depth 2
> > -threads
> > > 3
> > >
> > > But I get the following error:
> > > 060207 141014 Plugins: directory not found: plugins
> > >
> > >
> >
> >
>
>
> --
> 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Plugins: directory not found: plugins

2006-02-07 Thread 盖世豪侠

Hi

Do you mean I should create a dir called build and move dir plugins in?
It seems it doesn't work either


2006/2/7, Saravanaraj Duraisamy <[EMAIL PROTECTED]>:
>
> Add build\plugins
> to your classpath
>
> On 2/7/06, 盖世豪侠 <[EMAIL PROTECTED]> wrote:
> >
> > I try to run nutch using command line and I've add the plugins dir to
> the
> > classpath.
> >
> > SET NUTCH_HOME=F:\nutch-0.7
> > SET LIB=%NUTCH_HOME%\lib
> > java -classpath %NUTCH_HOME%\conf\;%NUTCH_HOME%\nutch-0.7.jar
> > ;%NUTCH_HOME%\plugins\;%LIB%\concurrent-1.3.4.jar;%LIB%\jakarta-
> > oro-2.0.7.jar;%LIB%\jetty-5.1.2.jar;%LIB%\junit-3.8.1.jar;%LIB%\lucene-
> > 1.9-rc1-dev.jar;%LIB%\lucene-misc-1.9-rc1-dev.jar;%LIB%\servlet-api.jar
> > ;%LIB%\taglibs-i18n.jar;%LIB%\xerces-2_6_2-apis.jar
> ;%LIB%\xerces-2_6_2.jar
> > org.apache.nutch.tools.CrawlTool urls.txt -dir test.out -depth 2
> -threads
> > 3
> >
> > But I get the following error:
> > 060207 141014 Plugins: directory not found: plugins
> >
> >
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。

Re: Installing nutch

2006-02-07 Thread Bernd Fehling


Hi Andy,
use svn command to download it.
The following command should be on a single line. So ignore
the line break after trunk but put a space after it.

svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk 
nutch_mapred_sources


It is hard to find a way through all of this nutch stuff and
also hard to get it somehow running. To many changes everywhere.
No working tutorial. They just ripped the MapReduce apart from
Nutch and made a new seperate Lucene sub-project from it.
The NDFS is now HDSF. And so on and so on...

I really would like to help and write a really working tutorial
but currently I would be happy to have it running and get a
single page indexed.

Regards,
Bernd

Andy Morris schrieb:

I see the contents, how can I download it?
andy 


-Original Message-
From: Bernd Fehling [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 06, 2006 10:15 AM

To: nutch-user@lucene.apache.org
Subject: Re: Installing nutch

I used this one:
http://svn.apache.org/repos/asf/lucene/nutch/trunk

Has revision #374883 with age of 47 hours.

Regards,
Bernd


Andy Morris schrieb:


Where did you find nutch 0.8 version?
http://cvs.apache.org/dist/lucene/nutch/nightly/   ???

Thanks,
Andy

-Original Message-
From: Bernd Fehling [mailto:[EMAIL PROTECTED]
Sent: Monday, February 06, 2006 9:47 AM
To: nutch-user@lucene.apache.org
Subject: Re: Installing nutch

Went through the tutorial for nutch 0.8.
No further error messages.
All seams to run fine but where is the index?

Used a single URL to start with but searching for any term from that 
site gives no results.

I guess there is no index at all?

Where to find a crawler log file?

Bernd

Stefan Groschupf schrieb:



Is it just a simple text file with one URL per line?


Yes

38 matches

Mail list logo