from:"Matthew Holt"

Re: Query Error

2006-10-26 Thread Matthew Holt


Howie/Andrezj,
 Thanks a bunch for the quick response, it turns out that the analyzer 
was splitting the word on non-alphabetical characters. Unless their is a 
fix out any time soon I'll try to keep it on my todo list to look into 
solving the issue soon as my school work lightens up.. Thanks again.
 Matt  


On 10/19/2006 07:07 PM, Howie Wang wrote:

Sorry I'm sending this private, but my messages were getting bounced from
the list.
 
I think there's a problem indexing or querying words that have numbers 
in them.

I wouldn't be surprised if the analyzer splits the word when it sees a
non-alphabetical character in a word.
 
I'm not sure if this is happening in the indexer or when parsing the query
or both. Like Andrzej said, you can look at your index to see if your 
words

are being indexed as you want.
 
You can also print out the Query object in search.jsp and see if it's 
getting
parsed correctly. If this is your problem, you might have to write 
your own

query plugin.

Howie




All-in-one security and maintenance for your PC.  Get a free 90-day 
trial! Learn more!

[Fwd: Query Error?]

2006-10-19 Thread Matthew Holt

Does anyone have any ideas? Or is more clarification needed? This 
problems been bugging me for a good while now...


Thanks,
  Matt
--- Begin Message ---

Hi all,
 I have a large index and am having several querying issues. Can anyone 
shed some light on why it's not returning the below expected results?


1.
   Query: W3X
   Expected results: Numerous pages containing the term W3X.
   Actual results: JPG images that do not even have the term W3X in 
their URL.


2.
   Query: W3X Team
   Expected results: A page entitled "W3X:Team - W3XWiki" which also 
contains the individual terms W3X and Team. Also numerous other pages 
with the terms should have been returned.

   Actual results: None.


3.
   Query: "W3X Team"
   Expected: A page entitled "W3X:Team - W3XWiki" which also contains 
the individual terms W3X and Team. Also numerous other pages with the 
terms should have been returned.
   Actual: 872 results with the expected page ("W3X:Team - W3XWiki") 
being the second result.


   What I'm really trying to figure out is why a general search didn't 
work but the search with ""'s did. The individual words both exist in 
numerous pages. Any ideas?

  Thanks,
Matt

--- End Message ---

Query Error?

2006-10-18 Thread Matthew Holt


Hi all,
 I have a large index and am having several querying issues. Can anyone 
shed some light on why it's not returning the below expected results?


1.
   Query: W3X
   Expected results: Numerous pages containing the term W3X.
   Actual results: JPG images that do not even have the term W3X in 
their URL.


2.
   Query: W3X Team
   Expected results: A page entitled "W3X:Team - W3XWiki" which also 
contains the individual terms W3X and Team. Also numerous other pages 
with the terms should have been returned.

   Actual results: None.


3.
   Query: "W3X Team"
   Expected: A page entitled "W3X:Team - W3XWiki" which also contains 
the individual terms W3X and Team. Also numerous other pages with the 
terms should have been returned.
   Actual: 872 results with the expected page ("W3X:Team - W3XWiki") 
being the second result.


   What I'm really trying to figure out is why a general search didn't 
work but the search with ""'s did. The individual words both exist in 
numerous pages. Any ideas?

  Thanks,
Matt

Re: Querying Fields

2006-08-09 Thread Matthew Holt

Ignore the last email. I ended up doing the same as Benjamin Higgins. 
Works great, use his email for reference if you are trying to accomplish 
the same thing.

Matt

Matthew Holt wrote:
Thanks for the reply. I've added the plugins you suggested. However, 
some of the plugins need to be modified to search for fields such as 
date (see previous email from Benjamin Higgins). I am currently 
modifying the query-basic DateQueryFilter.java so one is allowed to 
add query.date.boost to the nutch-site.xml to enable the date field 
search.


I'll try and post my results, or commit them.
Matt

Lukas Vlcek wrote:

Hi,

To allow more formats to be indexed you need to modify nutch-site.xml
and update/add plugin.includes property (see nutch-default.xml for
default settings). The following is what I have in nutch-site.xml:


 plugin.includes
nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic 




[parse-*] is used to parse various formats, [query-more] allows you to
use [type:] filter in nutch queries.

Regards,
Lukas

On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:

Hi Lukas and everybody!

Do you know which file in nutch 0.7.2 should I edit to add some 
field in my

index (i.e. file type - PDF, Word or html)?'

On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am not sure if I can give you any useful hint but the follwoing is
> what once worked for me.
> Example of query: url:http date:20060801
>
> date: and type: options can be used in combination with url:
> Filer url:http should select all documents (unless you allowed file,
> ftp protocols). Plain date ot type filter select onthing if they are
> used alone.
>
> And be sure you don't introduce any space between filter name and its
> value ([date: 20060801] is not the same as [date:20060801])
>
> Lukas
>
> On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > Howie,
> >I inspected my index using Luke and 20060801 shows up several 
times
> > in the index. I'm unable to query pretty much any field. Several 
people
> > seem to be having the same problem. Does anyone know whats going 
on?

> >
> > This is one of the last things I have to resolve to have Nutch 
deployed
> > successfully at my organization. Unfortunately, Friday is my 
last day.

> > Can anyone offer any assistance??
> > Thanks,
> >   Matt
> >
> > Howie Wang wrote:
> > > I think that I have problems querying for numbers and
> > > words with digits in them. Now that I think of it, is it
> > > possible it has something to do with the stemming in
> > > either the query filter or indexing? In either case, I would
> > > print out the text that is being indexed and the phrases
> > > added to the query. You could also using luke to inspect
> > > your index and see whether 20060801 shows up anywhere.
> > >
> > > Howie
> > >
> > >> I tried looked for a page that had the date 20060801 and the 
text

> > >> "test" in the page. I tried the following:
> > >>
> > >> date: 20060801 test
> > >>
> > >> and
> > >>
> > >> date 20060721-20060803 test
> > >>
> > >> Neither worked, any ideas??
> > >>
> > >> Matt
> > >>
> > >> Matthew Holt wrote:
> > >>> Thanks Jake,
> > >>>   However, it seems to me that it makes most sense that a query
> > >>> should return all pages that match the query, instead of 
acting as a
> > >>> content filter. However, I know its something easy to 
suggest when

> > >>> you're not having to implement it, so just a suggestion.
> > >>>
> > >>> Matt
> > >>>
> > >>> Vanderdray, Jacob wrote:
> > >>>> Try querying with both the date and something you'd expect 
to find
> > >>>> in the content.  The field query filter is just a filter.  
It only
> > >>>> restricts your results to things that match the basic query 
and has

> > >>>> the contents you require in the field.  So if you query for
> > >>>> "date:2006080 text" you'll be searching for documents that 
contain
> > >>>> "text" in one of the default query fields and has the value 
2006080

> > >>>> in the date field.  Leaving out text in that example would
> > >>>> essentially be asking for nothing in the default fields and 
2006080

> > >>&g

Re: Querying Fields

2006-08-09 Thread Matthew Holt

Thanks for the reply. I've added the plugins you suggested. However, 
some of the plugins need to be modified to search for fields such as 
date (see previous email from Benjamin Higgins). I am currently 
modifying the query-basic DateQueryFilter.java so one is allowed to add 
query.date.boost to the nutch-site.xml to enable the date field search.


I'll try and post my results, or commit them.
Matt

Lukas Vlcek wrote:

Hi,

To allow more formats to be indexed you need to modify nutch-site.xml
and update/add plugin.includes property (see nutch-default.xml for
default settings). The following is what I have in nutch-site.xml:


 plugin.includes
nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic 




[parse-*] is used to parse various formats, [query-more] allows you to
use [type:] filter in nutch queries.

Regards,
Lukas

On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:

Hi Lukas and everybody!

Do you know which file in nutch 0.7.2 should I edit to add some field 
in my

index (i.e. file type - PDF, Word or html)?'

On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am not sure if I can give you any useful hint but the follwoing is
> what once worked for me.
> Example of query: url:http date:20060801
>
> date: and type: options can be used in combination with url:
> Filer url:http should select all documents (unless you allowed file,
> ftp protocols). Plain date ot type filter select onthing if they are
> used alone.
>
> And be sure you don't introduce any space between filter name and its
> value ([date: 20060801] is not the same as [date:20060801])
>
> Lukas
>
> On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > Howie,
> >I inspected my index using Luke and 20060801 shows up several 
times
> > in the index. I'm unable to query pretty much any field. Several 
people

> > seem to be having the same problem. Does anyone know whats going on?
> >
> > This is one of the last things I have to resolve to have Nutch 
deployed
> > successfully at my organization. Unfortunately, Friday is my last 
day.

> > Can anyone offer any assistance??
> > Thanks,
> >   Matt
> >
> > Howie Wang wrote:
> > > I think that I have problems querying for numbers and
> > > words with digits in them. Now that I think of it, is it
> > > possible it has something to do with the stemming in
> > > either the query filter or indexing? In either case, I would
> > > print out the text that is being indexed and the phrases
> > > added to the query. You could also using luke to inspect
> > > your index and see whether 20060801 shows up anywhere.
> > >
> > > Howie
> > >
> > >> I tried looked for a page that had the date 20060801 and the text
> > >> "test" in the page. I tried the following:
> > >>
> > >> date: 20060801 test
> > >>
> > >> and
> > >>
> > >> date 20060721-20060803 test
> > >>
> > >> Neither worked, any ideas??
> > >>
> > >> Matt
> > >>
> > >> Matthew Holt wrote:
> > >>> Thanks Jake,
> > >>>   However, it seems to me that it makes most sense that a query
> > >>> should return all pages that match the query, instead of 
acting as a
> > >>> content filter. However, I know its something easy to suggest 
when

> > >>> you're not having to implement it, so just a suggestion.
> > >>>
> > >>> Matt
> > >>>
> > >>> Vanderdray, Jacob wrote:
> > >>>> Try querying with both the date and something you'd expect 
to find
> > >>>> in the content.  The field query filter is just a filter.  
It only
> > >>>> restricts your results to things that match the basic query 
and has

> > >>>> the contents you require in the field.  So if you query for
> > >>>> "date:2006080 text" you'll be searching for documents that 
contain
> > >>>> "text" in one of the default query fields and has the value 
2006080

> > >>>> in the date field.  Leaving out text in that example would
> > >>>> essentially be asking for nothing in the default fields and 
2006080

> > >>>> in the date field which is why it doesn't return any results.
> > >>>>
> > >>>> Hope that helps,
> > >>>> Jake.
> > &

Error in 0.8 regex-urlfilter.txt

2006-08-09 Thread Matthew Holt

I was doing a search and noticed that a 'png' file was indexed. I 
checked the crawl-urlfilter.txt and it had the following line preventing 
the index of a png file:

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

I then looked at regex-urlfilter.txt, and the line was similar, but 
lacked the 'png' definition. So apparently the recrawl was indexing the 
png files. The original regex-urlfilter.txt line is below:

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

It needs to be modified in trunk to match the line from crawl-urlfilter.txt.

Matt

Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-08 Thread Matthew Holt


Lukas,
 Not stupid at all. I was already experiencing some issues with the 
script due to tomcat not releasing its lock on some of the directories I 
was trying to delete. This isn't the most efficient solution, but I 
believe it to be the most stable.


Thanks for bringing the issues to my attention and keep me posted on the 
change. The job at which I'm using Nutch is about to end here on Friday, 
but I will try to join the nutch-user mailing list using my personal 
email address and keep up with development. If you don't hear from me 
and need me for any reason, my personal email address is mholtATelonDOTedu


Take care,
 Matt

Lukas Vlcek wrote:

Hi Matthew,

Thanks for your work on this! And I do apologize if my stupid
questions caused your discomfort. Anyway, I already started testing
former version of your script so I will look closer at your updated
version as well and will keep you posted.

As for the segment merging, the more I search/read on web about it the
more I think it should work as you expected, on the other hand I know
I can't be sure until I hack the source code (hacking nutch is still a
pain for me).

Thanks and regards!
Lukas

On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:

It's not needed.. you use the bin/nutch script to generate the initial
crawl..

details here:
http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling

Fred Tyre wrote:
> First of all, thanks for the recrawl script.
> I believe it will save me a few headaches.
>
> Secondly, is there a reason that there isn't a crawl script posted 
on the

> FAQ?
>
> As far as I can tell, you could take your recrawl script and add in 
the

> following line after you setup the crawl subdirectories.
>$FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 
3 -topN

> 50
>
> Obviously, the threads, depth and topN could be parameters as well.
>
> Thanks again.
>
> -Original Message-
> From: Matthew Holt [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 08, 2006 2:00 PM
> To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
> Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]
>
>
> Since it wasn't really clear whether my script approached the 
problem of

> deleting segments correctly, I refactored it so it generates the new
> number of segments, merges them into one, then deletes the "new"
> segments. Not as efficient disk space wise, but still removes a large
> number of the segments that are not being referenced by anything 
due to

> not being indexed yet.
>
> I reupdated the wiki. Unless there is any more clarification regarding
> the issue, hopefully I won't have to bombard your inbox with any more
> emails regarding this.
>
> Matt
>
> Lukas Vlcek wrote:
>
>> Hi again,
>>
>> I just found related discussion here:
>> http://www.nabble.com/NullPointException-tf2045994r1.html
>>
>> I think these guys are discussing similar problem and if I understood
>> the conclusion correctly then the only solution right now is to write
>> some code and test which segments are used in index and which are 
not.

>>
>> Regards,
>> Lukas
>>
>> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>>
>>> Matthew,
>>>
>>> In fact I didn't realize you are doing merge stuff (sorry for that)
>>> but frankly I don't know how exactly merging works and if this
>>> strategy would work in the long time perspective and whether it is
>>> universal approach in all variability of cases which may occur 
during
>>> crawling (-topN, threads frozen, pages unavailable, crawling 
dies, ...

>>> etc), may be it is correct path. I would appreciate if anybody can
>>> answer this question precisely.
>>>
>>> Thanks,
>>> Lukas
>>>
>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>>
>>>> If anyone doesnt mind taking a look...
>>>>
>>>>
>>>>
>>>> -- Forwarded message --
>>>> From: Matthew Holt <[EMAIL PROTECTED]>
>>>> To: nutch-user@lucene.apache.org
>>>> Date: Fri, 04 Aug 2006 10:07:57 -0400
>>>> Subject: Re: 0.8 Recrawl script updated
>>>> Lukas,
>>>>Thanks for your e-mail. I assumed I could drop the $depth 
number of

>>>> oldest segments because I first merged them all into one segment
>>>>
>>> (which
>>>
>>>> I don't drop). Am I incorrect in my assumption and can this cause
>>>> problems in the future? If so, then I'll go back to the original
>>>>
>>> version
>>>
>

Re: Querying Fields

2006-08-08 Thread Matthew Holt


Howie,
  I inspected my index using Luke and 20060801 shows up several times 
in the index. I'm unable to query pretty much any field. Several people 
seem to be having the same problem. Does anyone know whats going on?


This is one of the last things I have to resolve to have Nutch deployed 
successfully at my organization. Unfortunately, Friday is my last day. 
Can anyone offer any assistance??

Thanks,
 Matt

Howie Wang wrote:

I think that I have problems querying for numbers and
words with digits in them. Now that I think of it, is it
possible it has something to do with the stemming in
either the query filter or indexing? In either case, I would
print out the text that is being indexed and the phrases
added to the query. You could also using luke to inspect
your index and see whether 20060801 shows up anywhere.

Howie

I tried looked for a page that had the date 20060801 and the text 
"test" in the page. I tried the following:


date: 20060801 test

and

date 20060721-20060803 test

Neither worked, any ideas??

Matt

Matthew Holt wrote:

Thanks Jake,
  However, it seems to me that it makes most sense that a query 
should return all pages that match the query, instead of acting as a 
content filter. However, I know its something easy to suggest when 
you're not having to implement it, so just a suggestion.


Matt

Vanderdray, Jacob wrote:
Try querying with both the date and something you'd expect to find 
in the content.  The field query filter is just a filter.  It only 
restricts your results to things that match the basic query and has 
the contents you require in the field.  So if you query for 
"date:2006080 text" you'll be searching for documents that contain 
"text" in one of the default query fields and has the value 2006080 
in the date field.  Leaving out text in that example would 
essentially be asking for nothing in the default fields and 2006080 
in the date field which is why it doesn't return any results.


Hope that helps,
Jake.


-Original Message-
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields
 I am unable to query fields in my index in the method that has 
been suggested. I used Luke to examine my index and the following 
field types exist:
anchor, boost, content, contentLength, date, digest, host, 
lastModified, primaryType, segment, site, subType, title, type, url


However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top 
term in the date field which is '20060801'. I then searched using 
the following query:

date: 20060801

Unfortunately, nothing was returned. The correct plugins are 
enabled, here is an excerpt from my nutch-site.xml:



  plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic 



  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints 
plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  



Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,


   Matt

parse-oo plugin

2006-08-08 Thread Matthew Holt


Hey there,
 Hope all has been going well for you. I noticed a small issue with the 
parse-oo plugin. It parses the documents correctly, however, when you 
find a open office document as a result and click "cached", it returns 
with a NullPointerException error. I looked into it and the line in 
cached.jsp that is throwing the NPE is below:


String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);

So apparently the parse-oo plugin does not store the CONTENT_TYPE of the 
document. I looked and modified around line 100 and changed:


   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, metadata);

   return new ParseImpl(text, parseData);

to:

   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, content.getMetadata(), metadata);

   parseData.setConf(this.conf);
   return new ParseImpl(text, parseData);

This fixes the problem of the cached.jsp throwing an exception, but 
instead it displays every document type as either [octet-stream] or 
[oleobject].


So it seems as if it's not interpreting the mime types correctly. Do you 
know how to fix both the cached.jsp issue and the mime-type issue 
concurrently??

Thanks,
 Matt

Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-08 Thread Matthew Holt

It's not needed.. you use the bin/nutch script to generate the initial 
crawl..


details here: 
http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling


Fred Tyre wrote:

First of all, thanks for the recrawl script.
I believe it will save me a few headaches.

Secondly, is there a reason that there isn't a crawl script posted on the
FAQ?

As far as I can tell, you could take your recrawl script and add in the
following line after you setup the crawl subdirectories.
   $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 3 -topN
50

Obviously, the threads, depth and topN could be parameters as well.

Thanks again.

-Original Message-----
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 08, 2006 2:00 PM
To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]


Since it wasn't really clear whether my script approached the problem of
deleting segments correctly, I refactored it so it generates the new
number of segments, merges them into one, then deletes the "new"
segments. Not as efficient disk space wise, but still removes a large
number of the segments that are not being referenced by anything due to
not being indexed yet.

I reupdated the wiki. Unless there is any more clarification regarding
the issue, hopefully I won't have to bombard your inbox with any more
emails regarding this.

Matt

Lukas Vlcek wrote:
  

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:


Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
  

If anyone doesnt mind taking a look...



-- Forwarded message --
From: Matthew Holt <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Date: Fri, 04 Aug 2006 10:07:57 -0400
Subject: Re: 0.8 Recrawl script updated
Lukas,
   Thanks for your e-mail. I assumed I could drop the $depth number of
oldest segments because I first merged them all into one segment


(which
  

I don't drop). Am I incorrect in my assumption and can this cause
problems in the future? If so, then I'll go back to the original


version
  

of my script when I kept all the segments without merging. However, it
just seemed like if that is the case, it will be a problem after


enough
  

number of recrawls due to the large amount of segments being kept.

 Thanks,
  Matt

Lukas Vlcek wrote:


Hi Matthew,

I am surious about one thing. How do you know you can just drop
  

$depth
  

number of the most oldest segments in the end? I haven't studied
  

nutch
  

code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my
  

knowledge).
  

Also I don't think this question has been discussed on dev/user
  

lists
  

in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to
  

various
  

technical reasons).

Anyway, never mind if you don't want to bother about my silly
  

question
  

:-)

Regards,
Lukas

On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
  

Last email regarding this script. I found a bug in it that is


sporadic
  

(i think it only affected different setups). However, since it


would be
  

a problem sometimes, I refactored the script. I'd suggest you


redownload
  

the script if you are using it.

Matt

Matthew Holt wrote:


I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online
  

now. I
  

temporarily modified it on the website yesterday when I ran
  

into some
  

problems, but I further tested it and the

Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-08 Thread Matthew Holt

Since it wasn't really clear whether my script approached the problem of 
deleting segments correctly, I refactored it so it generates the new 
number of segments, merges them into one, then deletes the "new" 
segments. Not as efficient disk space wise, but still removes a large 
number of the segments that are not being referenced by anything due to 
not being indexed yet.


I reupdated the wiki. Unless there is any more clarification regarding 
the issue, hopefully I won't have to bombard your inbox with any more 
emails regarding this.


Matt

Lukas Vlcek wrote:

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:

Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> If anyone doesnt mind taking a look...
>
>
>
> ------ Forwarded message --
> From: Matthew Holt <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Date: Fri, 04 Aug 2006 10:07:57 -0400
> Subject: Re: 0.8 Recrawl script updated
> Lukas,
>Thanks for your e-mail. I assumed I could drop the $depth number of
> oldest segments because I first merged them all into one segment 
(which

> I don't drop). Am I incorrect in my assumption and can this cause
> problems in the future? If so, then I'll go back to the original 
version

> of my script when I kept all the segments without merging. However, it
> just seemed like if that is the case, it will be a problem after 
enough

> number of recrawls due to the large amount of segments being kept.
>
>  Thanks,
>   Matt
>
> Lukas Vlcek wrote:
> > Hi Matthew,
> >
> > I am surious about one thing. How do you know you can just drop 
$depth
> > number of the most oldest segments in the end? I haven't studied 
nutch

> > code regarding this topic yet but I thought that segment can be
> > dropped once you are sure that all its content is already crawled in
> > some newer segments (which should be checked somehow via some
> > function/script - which hasen't been yet implemented to my 
knowledge).

> >
> > Also I don't think this question has been discussed on dev/user 
lists

> > in detail yet so I just wanted to ask you about your opinion. The
> > situation could get even more complicated if people add -topN
> > parameter into script (which can happen because some might prefer
> > crawling in ten smaller bunches over to two huge crawls due to 
various

> > technical reasons).
> >
> > Anyway, never mind if you don't want to bother about my silly 
question

> > :-)
> >
> > Regards,
> > Lukas
> >
> > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> Last email regarding this script. I found a bug in it that is 
sporadic
> >> (i think it only affected different setups). However, since it 
would be
> >> a problem sometimes, I refactored the script. I'd suggest you 
redownload

> >> the script if you are using it.
> >>
> >> Matt
> >>
> >> Matthew Holt wrote:
> >> > I'm currently pretty busy at work. If I have I'll do it later.
> >> >
> >> > The version 0.8 recrawl script has a working version online 
now. I
> >> > temporarily modified it on the website yesterday when I ran 
into some

> >> > problems, but I further tested it and the actual working code is
> >> > modified now. So if you got it off the web site any time 
yesterday, I

> >> > would redownload the script.
> >> >
> >> > Matt
> >> >
> >> > Lourival Júnior wrote:
> >> >> Hi Matthew!
> >> >>
> >> >> Could you update the script to the version 0.7.2 with the same
> >> >> functionalities? I write a scritp that do this, but it don't 
work

> >> very
> >> >> well...
> >> >>
> >> >> Regards!
> >> >>
> >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> >>>
> >> >>> Just letting everyone know that I updated the recrawl script 
on the

> >> >>> Wiki. It now merges the created segments them deletes the old
> >> segs to
> >> >>> prevent a lot of unneeded data remaining/growing on the hard 
drive.

> >> >>>   Matt
> >> >>>
> >> >>>
> >> >>>
> >> 
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 


> >>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>
>
>
>

boosting

2006-08-08 Thread Matthew Holt

So I am attempting to modify the boosts so more weight is put on the 
actual content within the documents (instead of url, title, host). In 
the servlet instance, I modified the query.phrase.boost and restarted 
tomcat.  However, none of my results have changed. Am I going about this 
right??


Thanks,
 Matt

Re: NullPointException

2006-08-07 Thread Matthew Holt

Couldn't you just merge all the segments into one, then reindex the one 
segment and delete the rest?


Or does reindexing not generate a fresh index, but just adds the new 
indexed material the old index (that still contains references to the 
old segments)?


Marko Bauhardt wrote:


Am 03.08.2006 um 18:52 schrieb Lourival Júnior:




My questions:
Why it occurs? How can I know which segments can be deleted?



You must know which segment are indexed. You can not index all 
segments and after that delete these segments.
The Indexer index the name of the segment that the searcher knows 
where is the parsed text for a given indexed Term. But the segment 
name is only stored. You can not search on a segment name. But you can 
write some lines of code that prints the segment names which are 
stored in index. After that you know which segment are indexed. But i 
think  this way is not really cool.


Sorry
Marko

Re: Querying Fields

2006-08-04 Thread Matthew Holt

I tried looked for a page that had the date 20060801 and the text "test" 
in the page. I tried the following:


date: 20060801 test

and

date 20060721-20060803 test

Neither worked, any ideas??

Matt

Matthew Holt wrote:

Thanks Jake,
  However, it seems to me that it makes most sense that a query should 
return all pages that match the query, instead of acting as a content 
filter. However, I know its something easy to suggest when you're not 
having to implement it, so just a suggestion.


Matt

Vanderdray, Jacob wrote:
Try querying with both the date and something you'd expect to find in 
the content.  The field query filter is just a filter.  It only 
restricts your results to things that match the basic query and has 
the contents you require in the field.  So if you query for 
"date:2006080 text" you'll be searching for documents that contain 
"text" in one of the default query fields and has the value 2006080 
in the date field.  Leaving out text in that example would 
essentially be asking for nothing in the default fields and 2006080 
in the date field which is why it doesn't return any results.


Hope that helps,
Jake.


-Original Message-
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields
 
I am unable to query fields in my index in the method that has been 
suggested. I used Luke to examine my index and the following field 
types exist:
anchor, boost, content, contentLength, date, digest, host, 
lastModified, primaryType, segment, site, subType, title, type, url


However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top 
term in the date field which is '20060801'. I then searched using the 
following query:

date: 20060801

Unfortunately, nothing was returned. The correct plugins are enabled, 
here is an excerpt from my nutch-site.xml:



  plugin.includes
 protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic 


  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints 
plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  



Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,


   Matt

Re: Querying Fields

2006-08-04 Thread Matthew Holt


Thanks Jake,
  However, it seems to me that it makes most sense that a query should 
return all pages that match the query, instead of acting as a content 
filter. However, I know its something easy to suggest when you're not 
having to implement it, so just a suggestion.


Matt

Vanderdray, Jacob wrote:

Try querying with both the date and something you'd expect to find in the content.  The field query 
filter is just a filter.  It only restricts your results to things that match the basic query and 
has the contents you require in the field.  So if you query for "date:2006080 text" 
you'll be searching for documents that contain "text" in one of the default query fields 
and has the value 2006080 in the date field.  Leaving out text in that example would essentially be 
asking for nothing in the default fields and 2006080 in the date field which is why it doesn't 
return any results.

Hope that helps,
Jake.


-Original Message-
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields
 
I am unable to query fields in my index in the method that has been 
suggested. I used Luke to examine my index and the following field types 
exist:
anchor, boost, content, contentLength, date, digest, host, lastModified, 
primaryType, segment, site, subType, title, type, url


However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top term 
in the date field which is '20060801'. I then searched using the 
following query:

date: 20060801

Unfortunately, nothing was returned. The correct plugins are enabled, 
here is an excerpt from my nutch-site.xml:



  plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic
  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  



Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,


   Matt

Re: 0.8 Recrawl script updated

2006-08-04 Thread Matthew Holt

Lukas,
  Thanks for your e-mail. I assumed I could drop the $depth number of 
oldest segments because I first merged them all into one segment (which 
I don't drop). Am I incorrect in my assumption and can this cause 
problems in the future? If so, then I'll go back to the original version 
of my script when I kept all the segments without merging. However, it 
just seemed like if that is the case, it will be a problem after enough 
number of recrawls due to the large amount of segments being kept.

Thanks,
 Matt

Lukas Vlcek wrote:

Hi Matthew,

I am surious about one thing. How do you know you can just drop $depth
number of the most oldest segments in the end? I haven't studied nutch
code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my knowledge).

Also I don't think this question has been discussed on dev/user lists
in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to various
technical reasons).

Anyway, never mind if you don't want to bother about my silly question 
:-)

Regards,
Lukas

On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:

Last email regarding this script. I found a bug in it that is sporadic
(i think it only affected different setups). However, since it would be
a problem sometimes, I refactored the script. I'd suggest you redownload
the script if you are using it.

Matt

Matthew Holt wrote:
> I'm currently pretty busy at work. If I have I'll do it later.
>
> The version 0.8 recrawl script has a working version online now. I
> temporarily modified it on the website yesterday when I ran into some
> problems, but I further tested it and the actual working code is
> modified now. So if you got it off the web site any time yesterday, I
> would redownload the script.
>
> Matt
>
> Lourival Júnior wrote:
>> Hi Matthew!
>>
>> Could you update the script to the version 0.7.2 with the same
>> functionalities? I write a scritp that do this, but it don't work 
very

>> well...
>>
>> Regards!
>>
>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>>
>>> Just letting everyone know that I updated the recrawl script on the
>>> Wiki. It now merges the created segments them deletes the old 
segs to

>>> prevent a lot of unneeded data remaining/growing on the hard drive.
>>>   Matt
>>>
>>>
>>> 
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 

>>>
>>>
>>
>>
>>
>

Re: 0.8 Recrawl script updated

2006-08-03 Thread Matthew Holt

Last email regarding this script. I found a bug in it that is sporadic 
(i think it only affected different setups). However, since it would be 
a problem sometimes, I refactored the script. I'd suggest you redownload 
the script if you are using it.


Matt

Matthew Holt wrote:

I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online now. I 
temporarily modified it on the website yesterday when I ran into some 
problems, but I further tested it and the actual working code is 
modified now. So if you got it off the web site any time yesterday, I 
would redownload the script.


Matt

Lourival Júnior wrote:

Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work very
well...

Regards!

On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old segs to
prevent a lot of unneeded data remaining/growing on the hard drive.
  Matt


http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

Re: 0.8 Recrawl script updated

2006-08-03 Thread Matthew Holt


I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online now. I 
temporarily modified it on the website yesterday when I ran into some 
problems, but I further tested it and the actual working code is 
modified now. So if you got it off the web site any time yesterday, I 
would redownload the script.


Matt

Lourival Júnior wrote:

Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work very
well...

Regards!

On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old segs to
prevent a lot of unneeded data remaining/growing on the hard drive.
  Matt


http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

0.8 Recrawl script updated

2006-08-02 Thread Matthew Holt

Just letting everyone know that I updated the recrawl script on the 
Wiki. It now merges the created segments them deletes the old segs to 
prevent a lot of unneeded data remaining/growing on the hard drive.

 Matt

http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

Querying Fields

2006-08-02 Thread Matthew Holt

I am unable to query fields in my index in the method that has been 
suggested. I used Luke to examine my index and the following field types 
exist:
anchor, boost, content, contentLength, date, digest, host, lastModified, 
primaryType, segment, site, subType, title, type, url


However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top term 
in the date field which is '20060801'. I then searched using the 
following query:

date: 20060801

Unfortunately, nothing was returned. The correct plugins are enabled, 
here is an excerpt from my nutch-site.xml:



 plugin.includes
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic
 Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 



Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,


  Matt

Re: Questions about (re)crawling

2006-08-01 Thread Matthew Holt

If you look at my script.. the way I accomplished this without 
restarting tomcat was by touching the web.xml file within the servlet 
nutch dir

Benjamin Higgins wrote:

On 8/1/06, Matthew Holt <[EMAIL PROTECTED]> wrote:

> Also, can Nutch still service search queries while it is going through
> the
> whole generate, fetch, updatedb, index, dedup process?  At what point
> do the
> new segments become searchable -- right after indexing?
>
Yes.. however the new segments only become searchable once the recrawl
process is done (indexes are merged i believe) and the tomcat instance
reloads the webapp.

Ah - so Tomcat *must* be reloaded?  I was hoping to avoid that.  Oh well.

Ben

Re: Questions about (re)crawling

2006-08-01 Thread Matthew Holt




Benjamin Higgins wrote:

Hello,

Is it neccesary or desirable to run updatesegs and/or merge for a
single-machine setup that crawls ~1 million pages?



Merge you  need too i know.. you have to sync the segs with the crawldb 
if that is what you are referring to.
I ask because it appears that the 'crawl' tool, specifically for 
intranets,
runs these commands, but they aren't included in the whole-web 
instructions

in the tutorial.

Also, can Nutch still service search queries while it is going through 
the
whole generate, fetch, updatedb, index, dedup process?  At what point 
do the

new segments become searchable -- right after indexing?

Yes.. however the new segments only become searchable once the recrawl 
process is done (indexes are merged i believe) and the tomcat instance 
reloads the webapp.



Thanks!

Ben

Re: 0.8 much slower than 0.7

2006-07-31 Thread Matthew Holt

Fetcher for one, and the mapreduce takes forever... IE the mapreduce is 
kind of annoying... is it possible to disable it if I'm not running on a 
DFS?

Matt

06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce


Sami Siren wrote:
Are you experiencing slowness in general or just on some parts of the 
process.


Current fetcher is deadslow and it should be given immediate 
attention. there have been some talk about the issue but I havent seen 
any code yet.


--
 Sami Siren

Matthew Holt wrote:
I agree. Is there anyway to disable something to speed it up? IE is 
the map reduce currently needed if we're not on a DFS?


Matt

Vasja Ocvirk wrote:


Hello,

I'm wondering if anyone can help. We injected 1000 seed URLs into 
Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and 
it processed them in just few hours. We just switched to 0.8 with 
same configuration, same URLs, but it seems everything slowed down 
significantly. Crawl script has 60 threads -- same as before but now 
it works much slower.


Thanks!

Best,
Vasja

__ NOD32 1.1533 (20060512) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com

[Fwd: Recrawling... methodology?]

2006-07-31 Thread Matthew Holt

Can anyone offer any insight into this? If I am correct and the recrawl
script is currently not working properly, I will update the script and
make it available to the community. Thanks..

Matt
--- Begin Message ---
I need some help clarifying if recrawling is doing exactly what I think
it is. Here's the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script
found below:
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

In my recrawl, I also specify a depth of 2. It reindexes each of the
pages before, and if they have changed update the pages content. If they
have changed and new links exist, the links are followed to a maximum
depth of 2.

This is how I think a typical recrawl should work. However, when I
recrawl using the script linked to above, tons of new pages are indexed,
whether they have changed or not. It seems as if I crawl the content
with a depth of 2, and then come back and recrawl with a depth of 2, it
really adds a couple of crawl depth levels and the outcome is that I
have done a crawl with a depth of 4 (instead of crawl with a depth of 2
and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:
for (how many depth levels specified)

$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment

invertlinks
index
dedup
merge

Basically what made me wonder is that it took me 2 minutes to do the
crawl. It's taken me over 3 hours and still going to do the recrawl
(same depth levels specified). After I recrawl once, I believe it then
speeds up.

Thanks for any feedback you can offer,
Matt

--- End Message ---

Re: stemming - RESOLVED

2006-07-31 Thread Matthew Holt


We could, although other than readability, it won't make any difference.

[EMAIL PROTECTED] wrote:

Hi, Matthew

I think we should use fieldName instead of field, or not...

===stemming code begin===

public TokenStream tokenStream(String field, Reader reader) {
Analyzer analyzer;
if ("anchor".equals(field)) {
analyzer = ANCHOR_ANALYZER;
}
else {
analyzer = CONTENT_ANALYZER;

TokenStream ts = analyzer.tokenStream(field, reader);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
}
else {
return ts;
}
}
}

===stemming code end===

P.S. this patch doesn't take any effect on russian language.

Regards,
Alexey

--

Howie,
   Thanks for all the help configuring your stemming addon for version 
0.8. I compared query-basic and query-stemmer and the only new feature 
that was added is a "host" boost. I made the changes and everything 
works perfect.


I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
access it at the below URL..


http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c

Take care,
  Matt

Howie Wang wrote:
  

Hi, Matt,

In 0.7, you wouldn't miss anything. That code was written to
replace the basic query filter, and handled all the fields that
basic query filter was handling. For 0.8, I'm really not sure.
I'm guessing the code is fairly simple still in 0.8. You can probably
figure out if query-basic in 0.8 is doing something appreciably different
than query-stemmer by just visually comparing the files.

Howie



Howie,
 The query-stemmer works great as long as query-basic is not enabled. 
However, if I don't have query-basic enabled, won't I be missing some 
needed functionality?

 Matt

Howie Wang wrote:
  

Hi,

The settings look reasonable. But for testing purposes, I would get 
rid of

the other query filters and put in some print statements in the
query-stemmer to see what's happening.

Howie



In my nutch-site.xml I overrode the plugin.includes property as below:


 plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic 



 Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints 
plugin. By

 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 



However, it is still only letting me search for the stemmed term 
(IE "Interview" returns results but "interviewed" doesnt, even 
though thats the word thats actually on the page).


I tried a different approach and removed the query-stemmer value 
from nutch-site.xml to attempt to disable the plugin. I reran the 
crawl and it didn't load the plugin. However, it still had the same 
stemming functionality. I'm guessing this is due to editing the 
main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
Should I attempt too copy the needed methods into 
StemmerQueryFilter.java and try to isolate all functionality to the 
plugin alone?


Thanks,
   Matt

Howie Wang wrote:
  

It sounds like the query-stemmer is not being called.
The query string "interviews" needs to be processed
into "interview". Are you sure that your nutch-default.xml
is including the query-stemmer correctly? Put print statements
in to see if it's getting there.

By the way, someone recently told me that they
were able to put all the stemming code into an indexing
filter without touching any of the main code. All they
did was to copy some of the code that is being done
in NutchDocumentAnalyzer and CommonGrams into
their custom index filter. Haven't tried it myself.

HTH
Howie


Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
make up for changes from .7.2 to .8 - mostly having to do with 
the Configuration type being needed).


It partially works.

If the page I'm trying to index contains the word "interviews" 
and I type in the search engine "interview", the stemming takes 
place and the page with the word "interviews" is returned.
However, if I type in the word "interviews" no page is returned. 
(The page with the word interviews on it should be returned).


Any ideas??
Matt

Dima Mazmanov wrote:
  

Hi, .

I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to put in the main analyzer. It
would be nice if later releases could add support for plugging
in a custom stemmer/analyzer.

The first change I made is in NutchDocumen

Recrawling... methodology?

2006-07-28 Thread Matthew Holt

I need some help clarifying if recrawling is doing exactly what I think
it is. Here's the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script
found below:
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

The current steps of the recrawl are as follows:
for (how many depth levels specified)

$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment

invertlinks
index
dedup
merge

Thanks for any feedback you can offer,
Matt

Re: 0.8 much slower than 0.7

2006-07-28 Thread Matthew Holt

I agree. Is there anyway to disable something to speed it up? IE is the 
map reduce currently needed if we're not on a DFS?


Matt

Vasja Ocvirk wrote:

Hello,

I'm wondering if anyone can help. We injected 1000 seed URLs into 
Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and it 
processed them in just few hours. We just switched to 0.8 with same 
configuration, same URLs, but it seems everything slowed down 
significantly. Crawl script has 60 threads -- same as before but now 
it works much slower.


Thanks!

Best,
Vasja

__ NOD32 1.1533 (20060512) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com

Re: missing, but declared functionality

2006-07-28 Thread Matthew Holt

I'm having similar problems, luke says a search of date:20051125 works, 
and gives me results within luke. However, when I try the same search in 
nutch, nothing comes back. Does nutch handle query searches differently? 
Or to better rephrase my question, how should I be searching based on 
dates in nutch (the proper query plugins are enabled)?

Thanks,
Matt

Tomi NA wrote:

Sorry for the long silence and thanks for the help.
I've found the plugins you mentioned and set up nutch to use them. The
result is somewhat confusing, though. For one thing, my date: and
type: queries still returned no results. Weirder still, using luke to
inspect the index contents, I saw the new fields, luke would display
the top ranking terms by both "date" and "type" fields, a search like
"date:20051030" would yield dozens of results, but the "string value"
of the "date" and "type" fields was not availableeven thought I
found the documents in question using that exact field as a key.

I'll see what I come up with using 0.8 as I need the .xls and .zip
support, anyway.

t.n.a.

On 7/20/06, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:

You'd have to enable index-more and query-more plugins, I believe.

> -Original Message-
> From: Tomi NA [mailto:[EMAIL PROTECTED]
> Sent: 2006-7-19 10:01
> To: nutch-user@lucene.apache.org
> Subject: missing, but declared functionality
>
> These kinds of queries return no results:
>
> date:19980101-20061231
> type:pdf
> type:application/pdf
>
> From the release changes documents (0.7-0.7.2), I assumed
> these would work.
> Upon index inspection (using the luke tool), I see there are no fields
> marked "date" or "type" (althought I gather this is interpreted as
> url:*.pdf). The fields I have are:
> anchor
> boost
> content
> digest
> docNo
> host
> segment
> site
> title
> url
>
> I ran the index process with very little special configurationsome
> filetype filtering and the like.
> Am I missing something?
> The files are served over a samba share: I plan to serve them through
> a web server because of security implications of using the file://
> protocol. Can the creation and last modification date be retrieved
> over http:// at all?
>
> TIA,
> t.n.a.
>

Re: Starting Nutch in init.d?

2006-07-28 Thread Matthew Holt

You don't need to cd to the nutch directory for the startup script. All 
you need to do is edit the nutch-site.xml that is found within the nutch 
servlet and include a "searcher directory" property that tells tomcat 
where to look for the crawl db.


So if you have nutch 0.8, edit the file 
TOMCAT_PATH/webapps/NUTCH_DIR/WEB-INF/classes/nutch-site.xml and include 
the following:



   searcher.dir
   /your_index_folder_path
 


I believe the "your_index_folder_path" is the path to your crawl 
directory.  However, if that doesn't work, make it the path to the index 
folder within your crawl directory.


Now, save that and make sure your script just starts tomcat on init and 
everything should work fine for you.


Matt


Bill Goffe wrote:

I'd like to start Nutch automatically when I reboot. I wrote a real rough
script (see below) that works on my Debian system when the system is up,
but I get nothing on a reboot (and the links are set to the
/etc/init.d/nutch).  Any hints, ideas, or suggestions? I checked the FAQ
and the archive but didn't see anything. In addition, it would be great to
get messages going into /var/log to help figure out what is going on but
I've had no luck doing that.

Thanks,

   Bill

## Start and stop Nutch. Note how specific it is to
## (i) Tomcat (typically $CATALINA_HOME/bin/shutdown.sh
## or $CATALINA_HOME/bin/startup.sh) and (ii) the
## directory with the most recent fetch results.

## PATH stuff
PATH="/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games"
PATH=$PATH:/usr/local/SUNWappserver/bin
CLASSPATH=/usr/local/SUNWappserver/jdk/jre/lib
JAVA_HOME=/usr/local/SUNWappserver/jdk
CATALINA_HOME=/usr/local/jakarta-tomcat-5
JAVA_OPTS="-Xmx1024m -Xms512m"

case "$1" in
start)
  cd /home/bgoffe/nc/40  ## start in correct directory
  /usr/local/jakarta-tomcat-5/bin/startup.sh
  ;;

stop)
 /usr/local/jakarta-tomcat-5/bin/shutdown.sh
 ;;

force-reload|restart)
  /usr/local/jakarta-tomcat-5/bin/shutdown.sh
  cd /home/bgoffe/nc/40
  /usr/local/jakarta-tomcat-5/bin/startup.sh
  ;;

*)
echo "Usage: /etc/init.d/nutch {start|stop|force-reload|restart}"
exit 1
;;

esac

exit 0

Re: stemming - RESOLVED

2006-07-28 Thread Matthew Holt


Howie,
  Thanks for all the help configuring your stemming addon for version 
0.8. I compared query-basic and query-stemmer and the only new feature 
that was added is a "host" boost. I made the changes and everything 
works perfect.


I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
access it at the below URL..


http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c

Take care,
 Matt

Howie Wang wrote:

Hi, Matt,

In 0.7, you wouldn't miss anything. That code was written to
replace the basic query filter, and handled all the fields that
basic query filter was handling. For 0.8, I'm really not sure.
I'm guessing the code is fairly simple still in 0.8. You can probably
figure out if query-basic in 0.8 is doing something appreciably different
than query-stemmer by just visually comparing the files.

Howie


Howie,
 The query-stemmer works great as long as query-basic is not enabled. 
However, if I don't have query-basic enabled, won't I be missing some 
needed functionality?

 Matt

Howie Wang wrote:

Hi,

The settings look reasonable. But for testing purposes, I would get 
rid of

the other query filters and put in some print statements in the
query-stemmer to see what's happening.

Howie


In my nutch-site.xml I overrode the plugin.includes property as below:


 plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic 



 Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints 
plugin. By

 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 



However, it is still only letting me search for the stemmed term 
(IE "Interview" returns results but "interviewed" doesnt, even 
though thats the word thats actually on the page).


I tried a different approach and removed the query-stemmer value 
from nutch-site.xml to attempt to disable the plugin. I reran the 
crawl and it didn't load the plugin. However, it still had the same 
stemming functionality. I'm guessing this is due to editing the 
main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
Should I attempt too copy the needed methods into 
StemmerQueryFilter.java and try to isolate all functionality to the 
plugin alone?


Thanks,
   Matt

Howie Wang wrote:

It sounds like the query-stemmer is not being called.
The query string "interviews" needs to be processed
into "interview". Are you sure that your nutch-default.xml
is including the query-stemmer correctly? Put print statements
in to see if it's getting there.

By the way, someone recently told me that they
were able to put all the stemming code into an indexing
filter without touching any of the main code. All they
did was to copy some of the code that is being done
in NutchDocumentAnalyzer and CommonGrams into
their custom index filter. Haven't tried it myself.

HTH
Howie

Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
make up for changes from .7.2 to .8 - mostly having to do with 
the Configuration type being needed).


It partially works.

If the page I'm trying to index contains the word "interviews" 
and I type in the search engine "interview", the stemming takes 
place and the page with the word "interviews" is returned.
However, if I type in the word "interviews" no page is returned. 
(The page with the word interviews on it should be returned).


Any ideas??
Matt

Dima Mazmanov wrote:

Hi, .

I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to put in the main analyzer. It
would be nice if later releases could add support for plugging
in a custom stemmer/analyzer.

The first change I made is in NutchDocumentAnalyzer.java.

Import the following classes at the top of the file:
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

Change tokenStream to:

   public TokenStream tokenStream(String field, Reader reader) {
TokenStream ts = CommonGrams.getFilter(new 
NutchDocumentTokenizer(reader),

field);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
} else {
return ts;
}
   }

The second change is in CommonGrams.java.
Import the following classes near the top:

import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

In optimizePhrase, after this line:

   TokenStream ts = getFilter(new ArrayTokens(phrase), field);

Add:

   ts = new PorterStemFilter(new LowerCaseFi

Plugin Documentation

2006-07-27 Thread Matthew Holt


Hey All,
 I was looking through the wiki plugin page and noticed that a number 
of the plugins didn't have much documentation. I was trying to find help 
on how to query using the query-basic plugin. If anyone can reply with 
the list of queries that this plugin supports, I'll update the wiki.


The same goes for all the other plugins that aren't self-explanatory. If 
you created the plugin, it'd be great if you could update it on the wiki 
with a page describing functionality. If you don't want to go to the 
wiki, just reply to the list and I'll add it to the wiki for you.


Thanks...
Matt

Re: stemming

2006-07-27 Thread Matthew Holt

Actually, ignore my earlier posts. Thanks for your help Howie, I found a 
dumb mistake on my end. I had the parse-stemmer plugin activated in my 
local directory but not in my servlet directory.. Thanks!!

Matt

[EMAIL PROTECTED] wrote:

Hi,
I think we should wait when Eugen can share his code. In his version
of stemming everything works. Also the pagination is realized too.
The best way is to develop Eugen's code - this is my opinion. I think
that Jerome Charron also interested in that code - because of
highlighting of results.

What is Your opinion about aforesaid?

to Eugene: Can You tell us: when You can share with us Your code?

Regards,
Alexey

Re: stemming

2006-07-27 Thread Matthew Holt


In my nutch-site.xml I overrode the plugin.includes property as below:


 plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic

 Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 



However, it is still only letting me search for the stemmed term (IE 
"Interview" returns results but "interviewed" doesnt, even though thats 
the word thats actually on the page).


I tried a different approach and removed the query-stemmer value from 
nutch-site.xml to attempt to disable the plugin. I reran the crawl and 
it didn't load the plugin. However, it still had the same stemming 
functionality. I'm guessing this is due to editing the main files such 
as CommonGrams.java and NutchDocumentAnalyzer.java. Should I attempt too 
copy the needed methods into StemmerQueryFilter.java and try to isolate 
all functionality to the plugin alone?


Thanks,
   Matt

Howie Wang wrote:

It sounds like the query-stemmer is not being called.
The query string "interviews" needs to be processed
into "interview". Are you sure that your nutch-default.xml
is including the query-stemmer correctly? Put print statements
in to see if it's getting there.

By the way, someone recently told me that they
were able to put all the stemming code into an indexing
filter without touching any of the main code. All they
did was to copy some of the code that is being done
in NutchDocumentAnalyzer and CommonGrams into
their custom index filter. Haven't tried it myself.

HTH
Howie

Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
make up for changes from .7.2 to .8 - mostly having to do with the 
Configuration type being needed).


It partially works.

If the page I'm trying to index contains the word "interviews" and I 
type in the search engine "interview", the stemming takes place and 
the page with the word "interviews" is returned.
However, if I type in the word "interviews" no page is returned. (The 
page with the word interviews on it should be returned).


Any ideas??
Matt

Dima Mazmanov wrote:

Hi, .

I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to put in the main analyzer. It
would be nice if later releases could add support for plugging
in a custom stemmer/analyzer.

The first change I made is in NutchDocumentAnalyzer.java.

Import the following classes at the top of the file:
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

Change tokenStream to:

   public TokenStream tokenStream(String field, Reader reader) {
TokenStream ts = CommonGrams.getFilter(new 
NutchDocumentTokenizer(reader),

field);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
} else {
return ts;
}
   }

The second change is in CommonGrams.java.
Import the following classes near the top:

import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

In optimizePhrase, after this line:

   TokenStream ts = getFilter(new ArrayTokens(phrase), field);

Add:

   ts = new PorterStemFilter(new LowerCaseFilter(ts));

And the rest is a new QueryFilter plugin that I'm calling 
query-stemmer.

Here's the full source for the Java file. You can copy the build.xml
and plugin.xml from query-basic, and alter the names for query-stemmer.

/* Copyright (c) 2003 The Nutch Organization.  All rights 
reserved.   */
/* Use subject to the conditions in 
http://www.nutch.org/LICENSE.txt. */


package org.apache.nutch.searcher.stemmer;

import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

import org.apache.nutch.analysis.NutchDocumentAnalyzer;
import org.apache.nutch.analysis.CommonGrams;

import org.apache.nutch.searcher.QueryFilter;
import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.Query.*;

import java.io.IOException;
import java.util.HashSet;
import java.io.StringReader;

/** The default query filter.  Query terms in the default query 
field are

* expanded to search t

Re: stemming

2006-07-26 Thread Matthew Holt

Ok. I did this for Nutch 0.8 (had to edit the listed code some to make 
up for changes from .7.2 to .8 - mostly having to do with the 
Configuration type being needed).


It partially works.

If the page I'm trying to index contains the word "interviews" and I 
type in the search engine "interview", the stemming takes place and the 
page with the word "interviews" is returned.
However, if I type in the word "interviews" no page is returned. (The 
page with the word interviews on it should be returned).


Any ideas??
Matt

Dima Mazmanov wrote:

Hi, .

I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to put in the main analyzer. It
would be nice if later releases could add support for plugging
in a custom stemmer/analyzer.

The first change I made is in NutchDocumentAnalyzer.java.

Import the following classes at the top of the file:
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

Change tokenStream to:

   public TokenStream tokenStream(String field, Reader reader) {
TokenStream ts = CommonGrams.getFilter(new NutchDocumentTokenizer(reader),
field);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
} else {
return ts;
}
   }

The second change is in CommonGrams.java.
Import the following classes near the top:

import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

In optimizePhrase, after this line:

   TokenStream ts = getFilter(new ArrayTokens(phrase), field);

Add:

   ts = new PorterStemFilter(new LowerCaseFilter(ts));

And the rest is a new QueryFilter plugin that I'm calling query-stemmer.
Here's the full source for the Java file. You can copy the build.xml
and plugin.xml from query-basic, and alter the names for query-stemmer.

/* Copyright (c) 2003 The Nutch Organization.  All rights reserved.   */
/* Use subject to the conditions in http://www.nutch.org/LICENSE.txt. */

package org.apache.nutch.searcher.stemmer;

import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;

import org.apache.nutch.analysis.NutchDocumentAnalyzer;
import org.apache.nutch.analysis.CommonGrams;

import org.apache.nutch.searcher.QueryFilter;
import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.Query.*;

import java.io.IOException;
import java.util.HashSet;
import java.io.StringReader;

/** The default query filter.  Query terms in the default query field are
* expanded to search the url, anchor and content document fields.*/
public class StemmerQueryFilter implements QueryFilter {

  private static float URL_BOOST = 4.0f;
  private static float ANCHOR_BOOST = 2.0f;

  private static int SLOP = Integer.MAX_VALUE;
  private static float PHRASE_BOOST = 1.0f;

  private static final String[] FIELDS = {"url", "anchor", "content",
"title"};
  private static final float[] FIELD_BOOSTS = {URL_BOOST, ANCHOR_BOOST,
1.0f, 2.0f};

  /** Set the boost factor for url matches, relative to content and anchor
   * matches */
  public static void setUrlBoost(float boost) { URL_BOOST = boost; }

  /** Set the boost factor for title/anchor matches, relative to url and
   * content matches. */
  public static void setAnchorBoost(float boost) { ANCHOR_BOOST = boost; }

  /** Set the boost factor for sloppy phrase matches relative to unordered
term
   * matches. */
  public static void setPhraseBoost(float boost) { PHRASE_BOOST = boost; }

  /** Set the maximum number of terms permitted between matching terms in a
   * sloppy phrase match. */
  public static void setSlop(int slop) { SLOP = slop; }

  public BooleanQuery filter(Query input, BooleanQuery output) {
addTerms(input, output);
addSloppyPhrases(input, output);
return output;
  }

  private static void addTerms(Query input, BooleanQuery output) {
Clause[] clauses = input.getClauses();
for (int i = 0; i < clauses.length; i++) {
  Clause c = clauses[i];

  if (!c.getField().equals(Clause.DEFAULT_FIELD))
continue; // skip non-default fields

  BooleanQuery out = new BooleanQuery();
  for (int f = 0; f < FIELDS.length; f++) {

Clause o = c;
String[] opt;

// TODO: I'm a little nervous about stemming for all default fields.
//   Should keep an eye on th

Re: Two Errors in Nutch 0.8 Tutorial?

2006-07-25 Thread Matthew Holt


n/m it's there now..
Matt

Matthew Holt wrote:
If you download the latest trunk copy of 0.8, bin/nutch will not even 
be available.. is this supposed to be this way?

Matt

Bryan Woliner wrote:
I am certainly far from a nutch expert, but it appears to me that 
there are

two errors in the current Nutch 0.8 tutorial.

First off, here is the version of Nutch 0.8 that I am using, in case 
there

has been changes made in newer version that invalidate my comments:

-bash-2.05b$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 414318
Node Kind: directory
Schedule: normal
Last Changed Author: siren
Last Changed Rev: 414306
Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)

Error #1:

Towards the end of the tutorial, the following command is found:

bin/nutch invertlinks crawl/linkdb crawl/segments


When I call this command verbatim, I get the following error:

2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(119))
- job_8ly5hf
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal: 


hadoop-site.xml
   at org.apache.hadoop.mapred.InputFormatBase.listPaths(
InputFormatBase.java:96)
   at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
SequenceFileInputFormat.java:37)
   at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:106)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:80)
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
   at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)

I think the correct syntax for the command should be:

bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
to the end).

Error #2:

The tutorial says that to index, the following command should be called:

bin/nutch index indexes crawl/linkdb crawl/segments/*

However, when I call that command I get the following error:

Usage: ...

I believe the correct syntax should be:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb 
crawl/segments/*


If these are indeed errors in the tutorial, perhaps someone with the
authority to do so would be kind enough the make the necessary
changes.

My two cents,
Bryan

Re: Two Errors in Nutch 0.8 Tutorial?

2006-07-25 Thread Matthew Holt

If you download the latest trunk copy of 0.8, bin/nutch will not even be 
available.. is this supposed to be this way?

Matt

Bryan Woliner wrote:
I am certainly far from a nutch expert, but it appears to me that 
there are

two errors in the current Nutch 0.8 tutorial.

First off, here is the version of Nutch 0.8 that I am using, in case 
there

has been changes made in newer version that invalidate my comments:

-bash-2.05b$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 414318
Node Kind: directory
Schedule: normal
Last Changed Author: siren
Last Changed Rev: 414306
Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)

Error #1:

Towards the end of the tutorial, the following command is found:

bin/nutch invertlinks crawl/linkdb crawl/segments


When I call this command verbatim, I get the following error:

2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(119))
- job_8ly5hf
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal:
hadoop-site.xml
   at org.apache.hadoop.mapred.InputFormatBase.listPaths(
InputFormatBase.java:96)
   at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
SequenceFileInputFormat.java:37)
   at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:106)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:80)
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
   at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)

I think the correct syntax for the command should be:

bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
to the end).

Error #2:

The tutorial says that to index, the following command should be called:

bin/nutch index indexes crawl/linkdb crawl/segments/*

However, when I call that command I get the following error:

Usage: ...

I believe the correct syntax should be:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

If these are indeed errors in the tutorial, perhaps someone with the
authority to do so would be kind enough the make the necessary
changes.

My two cents,
Bryan

Re: Nutch 0.8-dev?

2006-07-24 Thread Matthew Holt


[EMAIL PROTECTED] trunk]$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 425089
Node Kind: directory
Schedule: normal
Last Changed Author: ab
Last Changed Rev: 425087
Last Changed Date: 2006-07-24 11:19:33 -0400 (Mon, 24 Jul 2006)
Properties Last Updated: 2006-07-24 11:30:01 -0400 (Mon, 24 Jul 2006)


Andrzej Bialecki wrote:

Matthew Holt wrote:
Just checked out a copy of nutch-0.8, and it no longer includes the 
bin/nutch script. The directory structure seems as if it has changed 
significantly from a week ago and my recrawl script no longer works. 
What's going on? Will there be an update to the tutorial soon..


When you execute 'svn info', which revision do you see?

Nutch 0.8-dev?

2006-07-24 Thread Matthew Holt

Just checked out a copy of nutch-0.8, and it no longer includes the 
bin/nutch script. The directory structure seems as if it has changed 
significantly from a week ago and my recrawl script no longer works. 
What's going on? Will there be an update to the tutorial soon..


Thanks,
 Matt

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt

Lourival Júnior wrote:

I thing it wont work with me because i'm using the Nutch version 0.7.2.
Actually I use this script (some comments are in Portuguese):

#!/bin/bash

# A simple script to run a Nutch re-crawl
# Fonte do script:
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

#{

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

#Para o serviço do TomCat
#net stop "Apache Tomcat"

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
 echo
 echo "Fim do ciclo $i."
 echo
done

# Update segments
echo
echo "Atualizando os Segmentos..."
echo
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
echo "Indexando os segmentos..."
echo
for segment in `ls -d $segments_dir/* | tail -$depth`
do
 bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
#echo "Unindo os segmentos..."
#echo
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

chmod 777 -R $index_dir

#Inicia o serviço do TomCat
#net start "Apache Tomcat"

echo "Fim."

#} > recrawl.log 2>&1

How you suggested I used the touch command instead stops the tomcat. 
However

I get that error posted in previous message. I'm running nutch in windows
plataform with cygwin. I only get no errors when I stops the tomcat. I 
use

this command to call the script:

./recrawl crawl-legislacao 1

Could you give me more clarifications?

Thanks a lot!

On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:

Lourival Júnior wrote:
> Hi Renaud!
>
> I'm newbie with shell scripts and I know stops tomcat service is 
not the

> better way to do this. The problem is, when a run the re-crawl script
> with
> tomcat started I get this error:
>
> 060721 132224 merging segment indexes to: crawl-legislacao2\index
> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>at
> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>at 
org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)

>at
> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> :141)
>at
> org.apache.lucene.index.IndexWriter.(IndexWriter.java:225)
>at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
:92)
>at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
:160)
>
> So, I want another way to re-crawl my pages without this error and
> without
> restarting the tomcat. Could you suggest one?
>
> Thanks a lot!
>
>
Try this updated script and tell me what command exactly you run to call
the script. Let me know the error message then.

Matt

#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html 

# Modified by Matthew Holt

if [ -n "$1" ]
then
  nutch_dir=$1
else
  echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
  echo "servlet_path - Path of the nutch servlet (i.e.
/usr/local/tomcat/webapps/ROOT)"
  echo "crawl_dir - Name of the directory the crawl is located in."
  echo "[depth] - The link depth from the root page that should be
crawled."
  echo "[adddays] - Advance the clock # of days for fetchlist 
generation."

  exit 1
fi

if [ -n "$2" ]
then
  crawl_dir=$2
else
  echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
  echo "servlet_path - Path of the nutch servlet (i.e.
/usr/local/tomcat/webapps/ROOT)"
  echo "crawl_dir - Name of the directory the crawl is located in."
  echo "[depth] - The link depth from the root page that should be
crawled."
  echo "[adddays] - Advance the clock # of days for fetchlist 
generation."

  exit 1
fi

if [ -n "$3" ]
then
  depth=$3
else
  depth=5
fi

if [ -n "$4" ]
then
  adddays=$4
else
  adddays=0
fi

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt


Lourival Júnior wrote:

Hi Renaud!

I'm newbie with shell scripts and I know stops tomcat service is not the
better way to do this. The problem is, when a run the re-crawl script 
with

tomcat started I get this error:

060721 132224 merging segment indexes to: crawl-legislacao2\index
Exception in thread "main" java.io.IOException: Cannot delete _0.f0
   at 
org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)

   at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
   at 
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java

:141)
   at 
org.apache.lucene.index.IndexWriter.(IndexWriter.java:225)

   at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
   at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

So, I want another way to re-crawl my pages without this error and 
without

restarting the tomcat. Could you suggest one?

Thanks a lot!


Try this updated script and tell me what command exactly you run to call 
the script. Let me know the error message then.


Matt


#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

# Modified by Matthew Holt

if [ -n "$1" ]
then
 nutch_dir=$1
else
 echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
 echo "servlet_path - Path of the nutch servlet (i.e. 
/usr/local/tomcat/webapps/ROOT)"

 echo "crawl_dir - Name of the directory the crawl is located in."
 echo "[depth] - The link depth from the root page that should be crawled."
 echo "[adddays] - Advance the clock # of days for fetchlist generation."
 exit 1
fi

if [ -n "$2" ]
then
 crawl_dir=$2
else
 echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
 echo "servlet_path - Path of the nutch servlet (i.e. 
/usr/local/tomcat/webapps/ROOT)"

 echo "crawl_dir - Name of the directory the crawl is located in."
 echo "[depth] - The link depth from the root page that should be crawled."
 echo "[adddays] - Advance the clock # of days for fetchlist generation."
 exit 1
fi

if [ -n "$3" ]
then
 depth=$3
else
 depth=5
fi

if [ -n "$4" ]
then
 adddays=$4
else
 adddays=0
fi

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt


Renaud Richardet wrote:

Hi Matt and Lourival,

Matt, thank you for the recrawl script. Any plans to commit it to trunk?

Lourival, here's in the script what "reloads Tomcat", not the 
cleanest, but it should work

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

HTH,
Renaud


Lourival Júnior wrote:

Hi Matt!

In the article found at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou 


said the re-crawl script have a problem with updating the live search
index. In my tests with Nutch version 0.7.2 when I run the script the 
index

could not be update because the tomcat loads it to the memory. Could you
suggest a modification to this script or to the NutchBean that accepts
modifications to the index without restart tomcat (Actually, I use 
net stop

"Apache Tomcat" before the index updation...)?

Thanks

On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Thanks for putting up with all the messages to the list... Here is the
recrawl script for 0.8.0 if anyone is interested.
Matt
---

#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html 


# Modified by Matthew Holt

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi


# EDIT THIS - List the location to your nutch servlet container.
nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/

# No need to edit anything past this line #
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes







I'll commit it to trunk, just have to modify it a little so users dont 
have to edit the tomcat location in their file and can do it through the 
command line.. Kinda busy @ work with this right now, so I'll follow up 
later regarding the commit.

Matt

Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt

Thanks for putting up with all the messages to the list... Here is the 
recrawl script for 0.8.0 if anyone is interested.

   Matt
---

#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

# Modified by Matthew Holt

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi


# EDIT THIS - List the location to your nutch servlet container.
nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/

# No need to edit anything past this line #
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes

[Fwd: Reworked recrawl script for 0.8.0]

2006-07-20 Thread Matthew Holt


Stefan,
  The nutch-user mailing list seems to be down, or at least unavailable 
to my personal account. I have spent several hours looking into 
creating/modifying a Intranet recrawl script for 0.8.0. I have it where 
it does not error out, however, when I search for something using the 
recrawled database, no page is returned (no error is received also). Can 
you look at the script (included in the attached email) and see if you 
notice any steps I'm missing or incorrectly ordering? Thanks.

 Matt
--- Begin Message ---

Hi all,
I reworked the recrawl script for 0.7.2 
(http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html) 
for nutch-0.8.0-dev.


I thought I had it refactored completely, and it doesn't error out, but 
I must be calling some of the commands in the inproper order. Can you 
please take a look at it and see if you can spot what is wrong?? Thanks.


   Matt

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

mkdir $segments_dir

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index 
$new_indexes $webdb_dir $linkdb_dir


# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of argsexpected
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

--- End Message ---

Please Help.. recrawl script.. will send out to the list when finished for 0.8.0

2006-07-20 Thread Matthew Holt

I sent out a few emails regarding a recrawl script I wrote. However, if 
it' be easier for anyone to help, can you please check that all of the 
below steps are the only ones that need to be taken to recrawl? Or if 
there is a resource online that describes manually recrawling, that'd be 
great as well. Thanks.


Matt

Recrawl Steps:
generate
fetch
updatedb
invertlinks
index
dedup
merge

Reworked recrawl script for 0.8.0

2006-07-19 Thread Matthew Holt


Hi all,
I reworked the recrawl script for 0.7.2 
(http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html) 
for nutch-0.8.0-dev.


I thought I had it refactored completely, and it doesn't error out, but 
I must be calling some of the commands in the inproper order. Can you 
please take a look at it and see if you can spot what is wrong?? Thanks.


   Matt

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

mkdir $segments_dir

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index 
$new_indexes $webdb_dir $linkdb_dir


# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of argsexpected
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

Not crawling certain directories.

2006-07-15 Thread Matthew Holt

One more question.. I'm using nutch-0.8.0 and trying to index a domain 
and want to exclude a certain directory from the crawl. In the 
crawl-urlfilter.txt I have defined the following:


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy

However, the /yummy directory is still crawled. Any ideas as to what is 
going on? Thanks..

Matt

Built-in Recrawl

2006-07-15 Thread Matthew Holt

I'm sure there is a good answer for this, whether it be lack of time, or 
not enough demand, but just was wondering why there is not a 'recrawl' 
option that goes with the intranet crawl. I'm looking into making one 
for myself, and was just wondering if one is in development or there are 
other reasons that I don't understand that are preventing it from being 
created...


Thanks!

Matt

Re: Intranet Recrawl Script for 0.8.0

2006-07-15 Thread Matthew Holt


kevin wrote:

where can i download nutch version 0.8 ?? can't find it on nutch website.

Matthew Holt 写道:
Does anyone have a good Intranet recrawl script for nutch-0.8.0? 
Thanks..

Matt





From trunk in the SVN repository.

Intranet Recrawl Script for 0.8.0

2006-07-14 Thread Matthew Holt


Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks..
Matt

0.8.0 stable enough to use?

2006-07-13 Thread Matthew Holt

Just wondering what the general consensus is on using 0.8.0 in 
production. Do you think it's stable enough to use?? I would ideally 
want to use 0.7.2, but it is missing the parse-oo plugin that 0.8.0 has. 
I attempted to port the parse-oo plugin to 0.7.2, but ran into some 
complications due to a large number of dependencies being different on 
0.7.2.


Thanks,
 Matt

Re: nutch-0.8.0-dev search error

2006-07-13 Thread Matthew Holt


Timo Scheuer wrote:

Am Donnerstag, 13. Juli 2006 18:47 schrieb Matthew Holt:
  

I successfully ran the intranet crawl and my nutch/crawl dir was
generated. I then deployed the war file and stopped/started tomcat from
within the crawl directory. However, when I attempt to actually run a
search, a page with the following error is returned. Any ideas?
...



It seems that some classes cannot be found. Have you set up the Java CLASSPATH 
correctly in your environment resp. start script? Does it contain rt.jar? The 
complete log should tell you which class is missing. That helps you to find 
out which lib is not in the correct Java search path.



  

Ok got it. Thanks a bunch.

nutch-0.8.0-dev search error

2006-07-13 Thread Matthew Holt

I successfully ran the intranet crawl and my nutch/crawl dir was 
generated. I then deployed the war file and stopped/started tomcat from 
within the crawl directory. However, when I attempt to actually run a 
search, a page with the following error is returned. Any ideas?


Matt


*type* Exception report

*message*

*description* _The server encountered an internal error () that 
prevented it from fulfilling this request._


*exception*

javax.servlet.ServletException
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:272)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

*root cause*

java.lang.NoClassDefFoundError
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
java.lang.reflect.Constructor.newInstance(Constructor.java:494)
java.lang.Class.newInstance0(Class.java:350)
java.lang.Class.newInstance(Class.java:303)

org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:148)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:315)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

*note* _The full stack trace of the root cause is available in the 
Apache Tomcat/5.5.17 logs._

parse-oo plugin

2006-07-12 Thread Matthew Holt


Hi all,
 I saw that their is a parse-oo plugin available for nutch-0.8. This 
plugin parses OpenOffice documents, etc. I'm attempting to extract it 
and use it in the latest stable release (0.7.2). Has anyone already 
successfully done this? Having some issues with it looking for files 
that are in 0.8 only. Thanks...

Matt

Re: Customizing Search Results

2006-07-12 Thread Matthew Holt


Jayant Kumar Gandhi wrote:

You can modify/remove those in search.jsp .

On 7/12/06, Matthew Holt <[EMAIL PROTECTED]> wrote:

Is there any documentation available regarding the customization of the
search page/results?

I am interested in removing the (explain) and (anchors) links.

Thanks.
 Matt






Thanks a bunch.

Customizing Search Results

2006-07-12 Thread Matthew Holt

Is there any documentation available regarding the customization of the 
search page/results?


I am interested in removing the (explain) and (anchors) links.

Thanks.
Matt

Eclipse IDE

2006-07-11 Thread Matthew Holt

Can someone that has Nutch developement configured for Eclipse please 
paste their .project and .classpath files? Thanks.

 Matt

Re: Adddays confusion - easy question for the experts

2006-07-11 Thread Matthew Holt


Honda-Search Administrator wrote:

Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the 
fetchlist and not recrawl the entire webdb?

Can anyone explain to me (in simple terms) exactly what adddays does?

Long version:
My setup is simple.  I crawl a number of internet forums.  This 
requires me to scan new posts every night to stay on top of things.


I crawled all of the older posts on these forums a while ago, and now 
have to just worry about newer posts.  I have written a small script 
that injects the pages that have changed or the new pages each night.


When I run the recrawl script, I only want to crawl the pages that are 
injected into the fetchlist (via bin/nutch inject).  I have also 
changed the default nutch recrawl time interval (normally 30 days)  to 
a VERY large number to ensure that nutch will not recrawl old pages 
for a very long time.


Anyway, back to my original question.

i recrawled today hoping that nutch would ONLY recrawl the 3000 
documents I injected (via bin/nutch inject).  I used depth of 1 and 
left the adddays parameter blank (because I really can't get a clear 
idea of what it does). Depth of 1 is used because I only want to crawl 
the URLs I have injected into the fetchlist and not have nutch go 
crazy on other domains, documents, etc.  Using the regex-urlfilter I 
have also ensured that it will only crawl the domains I want it to crawl.


So my command looks something like this:

/home/nutch/recrawl.sh /home/nutch/database 1

my recrawl script can be seen here:  
http://www.honda-search.com/script.html


Much to my surprised Nutch is recrawling EVERY document in my webdb 
(plus, I assume, the newly injected documents).  Is this because the 
adddays variable is left blank?  Should I set the addays variable 
really high?  How can I ensure that it only crawls the urls that are 
injected?


Can anyone explain what adddays does (in easy to understand terms?)  
The wiki isn't very clear for a newbie like myself.


I was looking for similar info. The adddays option advances the clock 
however many days you specify. The default for page reindexing is 30 
days, so every 30 days the page will expire and nutch will reindex it. 
However, if you pass the param -adddays 31, it will advance the clock 31 
days and cause every page to be reindexed.


If you pass the param -adddays 27 and you have the default reindexing 
set to be 30 days, nutch will reindex all pages older than 3 days. 
Correct me if I'm wrong.

 Matt

OpenOffice Support?

2006-07-11 Thread Matthew Holt

Just wondering, has anyone done any work on a plugin (or aware of a 
plugin) that supports the indexing of open office documents? Thanks.

Matt

Nutch 0.8

2006-06-15 Thread Matthew Holt


Any estimate on the release date for nutch 0.8? Just wondering..
Thanks.
Matt

Re: HTTPS

2006-06-12 Thread Matthew Holt

Modify your conf/nutch-site.xml file to enable support for 
*protocol-httpclient.


http://wiki.apache.org/nutch/PluginCentral
*
Steele, Aaron wrote:

IS there a way to enable https searches? 



Thank You,

Aaron Steele
YRI Enterprise Solutions
https://ris.yumnet.com
w: 972.338.6862
c: 817.401.0831


This communication is confidential and may be legally privileged.  If you are 
not the intended recipient, (i) please do not read or disclose to others, (ii) 
please notify the sender by reply mail, and (iii) please delete this 
communication from your system.  Failure to follow this process may be 
unlawful.  Thank you for your cooperation.

Re: [Fwd: Re: intranet crawl issue]

2006-06-12 Thread Matthew Holt


Stefan,
 Thanks a bunch, turns out the ? mark was the cause of the problems. 
Able to run the search now and not having any problems with it. Thanks a 
ton!

Matt

Stefan Groschupf wrote:

Hi Matt, 
the first impression I have is that your segment has only 6 pages. 
You generated many empthy pages. 
is the website a CMS? Has it questionmarks in the URLs? You exclude 
all pages with questions marks. 
Check how many pages are after injecting in your web db.

Check how many pages are in your segment fetch list.

HTH
Stefan 


-.


Stefan Neufeind wrote:


Matthew Holt wrote:

Just fyi,.. both of the sites I am trying to crawl are under the 
same domain. The sub-domains just differ. Works for one, the other 
it o nly appears to fetch 6 or so pages then doesn't fetch anymore. 
Do you need any more information to solve the problem? I've tried 
everything and havent' had any luck.. Thanks.




What does your crawl-urlfilter.txt look like?

 Stefan

[Fwd: Re: intranet crawl issue]

2006-06-12 Thread Matthew Holt


Any ideas? I've exhausted all the ends I know to take a look at.
 matt
--- Begin Message ---

Here is my crawl-urlfilter.txt file.
Matt

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*corp.mydomain.com/

# skip everything else
-.


Stefan Neufeind wrote:


Matthew Holt wrote:

Just fyi,.. both of the sites I am trying to crawl are under the same 
domain. The sub-domains just differ. Works for one, the other it o 
nly appears to fetch 6 or so pages then doesn't fetch anymore. Do you 
need any more information to solve the problem? I've tried everything 
and havent' had any luck.. Thanks.



What does your crawl-urlfilter.txt look like?

 Stefan



--- End Message ---

Re: intranet crawl issue

2006-06-09 Thread Matthew Holt


Here is my crawl-urlfilter.txt file.
Matt

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*corp.mydomain.com/

# skip everything else
-.


Stefan Neufeind wrote:


Matthew Holt wrote:

Just fyi,.. both of the sites I am trying to crawl are under the same 
domain. The sub-domains just differ. Works for one, the other it o 
nly appears to fetch 6 or so pages then doesn't fetch anymore. Do you 
need any more information to solve the problem? I've tried everything 
and havent' had any luck.. Thanks.



What does your crawl-urlfilter.txt look like?

 Stefan

intranet crawl issue

2006-06-08 Thread Matthew Holt

Just fyi,.. both of the sites I am trying to crawl are under the same 
domain. The sub-domains just differ. Works for one, the other it o nly 
appears to fetch 6 or so pages then doesn't fetch anymore. Do you need 
any more information to solve the problem? I've tried everything and 
havent' had any luck.. Thanks.

Matt


--


Hey guys,
Alright, so I've been able to succesfully crawl a wiki on the domain 
name wwaps.blah.domain.com/.


However, I configure it to crawl all intranet web pages at 
intranet.corp.domain.com and start the crawl. However, it only seems to 
fetch 6 pages or so then says that the indexing is completed. Any ideas? 
Thanks!

Matt.


The output of the crawl is below:

060608 112059 parsing 
file:/home/rdu/mholt/Desktop/nutch-0.7.2/conf/nutch-default.xml
060608 112059 parsing 
file:/home/mholt/Desktop/nutch-0.7.2/conf/crawl-tool.xml
060608 112059 parsing 
file:/home/mholt/Desktop/nutch-0.7.2/conf/nutch-site.xml

060608 112059 No FS indicated, using default:local
060608 112059 crawl started in: crawl/
060608 112059 rootUrlFile = urls/redhat
060608 112059 threads = 10
060608 112059 depth = 10
060608 112059 Created webdb at 
LocalFS,/home/mholt/Desktop/nutch-0.7.2/crawl/db

060608 112059 Starting URL processing
060608 112059 Plugins: looking in: /home/mholt/Desktop/nutch-0.7.2/plugins
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/clustering-carrot2
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/creativecommons
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/index-basic/plugin.xml
060608 112059 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/index-more
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/language-identifier
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/nutch-extensionpoints/plugin.xml
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/ontology
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-ext
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-html/plugin.xml
060608 112059 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-js
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-msword
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-pdf
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-rss
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-text/plugin.xml
060608 112059 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-file
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-ftp
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-http/plugin.xml
060608 112059 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-httpclient
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-basic/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-more
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-site/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-url/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/urlfilter-prefix
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/urlfilter-regex/plugin.xml
060608 112059 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060608 112059 found resource crawl-urlfilter.txt at 
file:/home/mholt/Desktop/nutch-0.7.2/conf/crawl-urlfilter.txt

060608 112059 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060608 112059 Added 1 pages
060608 112059 Processing pagesByURL: Sorted 1 instructions in 0.0060 
seconds.
060608 112059 Processing pagesByURL: Sorted 166.66 
instructions/second
060608 112059 Processing pagesByURL: Merged to new DB containing 1 
records in 0.0010 seconds

060608 112059 Processing pagesByURL: Merged 1000.0 records/second
060608 112059 Processing pagesByMD5: Sorted 1 instructions in 0.0050 
seconds.

060608 112059 Processing pagesByMD5: Sorted 200.0 instructions/second
060608 112059 Processing page

Intranet Crawling

2006-06-08 Thread Matthew Holt


Hey guys,
 Alright, so I've been able to succesfully crawl a wiki on the domain 
name wwaps.blah.domain.com/.


However, I configure it to crawl all intranet web pages at 
intranet.corp.domain.com and start the crawl. However, it only seems to 
fetch 6 pages or so then says that the indexing is completed. Any ideas? 
Thanks!

Matt.


The output of the crawl is below:

060608 112059 parsing 
file:/home/rdu/mholt/Desktop/nutch-0.7.2/conf/nutch-default.xml
060608 112059 parsing 
file:/home/mholt/Desktop/nutch-0.7.2/conf/crawl-tool.xml
060608 112059 parsing 
file:/home/mholt/Desktop/nutch-0.7.2/conf/nutch-site.xml

060608 112059 No FS indicated, using default:local
060608 112059 crawl started in: crawl/
060608 112059 rootUrlFile = urls/redhat
060608 112059 threads = 10
060608 112059 depth = 10
060608 112059 Created webdb at 
LocalFS,/home/mholt/Desktop/nutch-0.7.2/crawl/db

060608 112059 Starting URL processing
060608 112059 Plugins: looking in: /home/mholt/Desktop/nutch-0.7.2/plugins
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/clustering-carrot2
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/creativecommons
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/index-basic/plugin.xml
060608 112059 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/index-more
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/language-identifier
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/nutch-extensionpoints/plugin.xml
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/ontology
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-ext
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-html/plugin.xml
060608 112059 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-js
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-msword
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-pdf
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-rss
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/parse-text/plugin.xml
060608 112059 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-file
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-ftp
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-http/plugin.xml
060608 112059 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/protocol-httpclient
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-basic/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-more
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-site/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/query-url/plugin.xml
060608 112059 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060608 112059 not including: 
/home/mholt/Desktop/nutch-0.7.2/plugins/urlfilter-prefix
060608 112059 parsing: 
/home/mholt/Desktop/nutch-0.7.2/plugins/urlfilter-regex/plugin.xml
060608 112059 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060608 112059 found resource crawl-urlfilter.txt at 
file:/home/mholt/Desktop/nutch-0.7.2/conf/crawl-urlfilter.txt

060608 112059 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060608 112059 Added 1 pages
060608 112059 Processing pagesByURL: Sorted 1 instructions in 0.0060 
seconds.
060608 112059 Processing pagesByURL: Sorted 166.66 
instructions/second
060608 112059 Processing pagesByURL: Merged to new DB containing 1 
records in 0.0010 seconds

060608 112059 Processing pagesByURL: Merged 1000.0 records/second
060608 112059 Processing pagesByMD5: Sorted 1 instructions in 0.0050 
seconds.

060608 112059 Processing pagesByMD5: Sorted 200.0 instructions/second
060608 112059 Processing pagesByMD5: Merged to new DB containing 1 
records in 0.0 seconds

060608 112059 Processing pagesByMD5: Merged Infinity records/second
060608 112059 Processing linksByMD5: Copied file (4096 bytes) in 0.0090 
secs.
060608 112059 Processing linksByURL: Copied file (4096 bytes) in 0.0080 
secs.

060608 112059 FetchListTool started
060608 112100 Process

Re: Recrawling question

2006-06-06 Thread Matthew Holt

It's writing the segments to a new directory then I believe merging them 
and the index... or am i reading the script wrong?


Stefan Neufeind wrote:


Oh sorry, I didn't look up the script again from your earlier mail. Hmm,
I guess you can live fine without the invertlinks (if I'm right). Are
you sure that your indexing works fine? I think if an index exists nutch
complains. See if there is any error with indexing. Also maybe try to
delete your current index before indexing again.

Still doesn't work?


Regards,
Stefan

Matthew Holt wrote:
 


Sorry to be asking so many questions.. Below is the current script I'm
using. It's indexing the segments.. so do I use invertlinks directly
after the fetch? I'm kind of confused.. thanks.
matt
   



[...]

 


---

Stefan Neufeind wrote:

   


You miss actually indexing the pages :-) This is done inside the
"crawl"-command which does everything in one. After you fetched
everything use:

nutch invertlinks ...
nutch index ...

Hope that helps. Otherwise let me know and I'll dig  out the complete
commandlines for you.


Regards,
Stefan

Matthew Holt wrote:


 


Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
the newly created page can not be found.

Matthew Holt wrote:

 
   


The recrawl worked this time, and I recrawled the entire db using the
-adddays argument (in my case ./recrawl crawl 10 31). However, it
didn't find a newly created page.

If I delete the database and do the initial crawl over again, the new
page is found. Any idea what I'm doing wrong or why it isn't finding
it?

Thanks!
Matt

Matthew Holt wrote:

   
 


Stefan,
Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:

 
   


Matthew Holt wrote:


   
 


Hi all,
I have already successfuly indexed all the files on my domain only
(as
specified in the conf/crawl-urlfilter.txt file).

Now when I use the below script (./recrawl crawl 10 31) to
recrawl the
domain, it begins indexing pages off of my domain (such as
wikipedia,
etc). How do I prevent this? Thanks!

 
   


Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that
urlfilter-file is
no longer used.

Re: Recrawling question

2006-06-06 Thread Matthew Holt

Sorry to be asking so many questions.. Below is the current script I'm 
using. It's indexing the segments.. so do I use invertlinks directly 
after the fetch? I'm kind of confused.. thanks.

matt

---
#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
 bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

---

Stefan Neufeind wrote:


You miss actually indexing the pages :-) This is done inside the
"crawl"-command which does everything in one. After you fetched
everything use:

nutch invertlinks ...
nutch index ...

Hope that helps. Otherwise let me know and I'll dig  out the complete
commandlines for you.


Regards,
Stefan

Matthew Holt wrote:
 


Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
the newly created page can not be found.

Matthew Holt wrote:

   


The recrawl worked this time, and I recrawled the entire db using the
-adddays argument (in my case ./recrawl crawl 10 31). However, it
didn't find a newly created page.

If I delete the database and do the initial crawl over again, the new
page is found. Any idea what I'm doing wrong or why it isn't finding it?

Thanks!
Matt

Matthew Holt wrote:

     


Stefan,
Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:

   


Matthew Holt wrote:


 


Hi all,
I have already successfuly indexed all the files on my domain only
(as
specified in the conf/crawl-urlfilter.txt file).

Now when I use the below script (./recrawl crawl 10 31) to recrawl the
domain, it begins indexing pages off of my domain (such as wikipedia,
etc). How do I prevent this? Thanks!
 
   



Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that
urlfilter-file is
no longer used.

Re: Recrawling question

2006-06-06 Thread Matthew Holt

Just FYI.. After I do the recrawl, I do stop and start tomcat, and still 
the newly created page can not be found.


Matthew Holt wrote:

The recrawl worked this time, and I recrawled the entire db using the 
-adddays argument (in my case ./recrawl crawl 10 31). However, it 
didn't find a newly created page.


If I delete the database and do the initial crawl over again, the new 
page is found. Any idea what I'm doing wrong or why it isn't finding it?


Thanks!
Matt

Matthew Holt wrote:


Stefan,
 Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:


Matthew Holt wrote:
 


Hi all,
 I have already successfuly indexed all the files on my domain only 
(as

specified in the conf/crawl-urlfilter.txt file).

Now when I use the below script (./recrawl crawl 10 31) to recrawl the
domain, it begins indexing pages off of my domain (such as wikipedia,
etc). How do I prevent this? Thanks!
  




Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that 
urlfilter-file is

no longer used.


Regards,
Stefan

Re: Recrawling question

2006-06-06 Thread Matthew Holt

The recrawl worked this time, and I recrawled the entire db using the 
-adddays argument (in my case ./recrawl crawl 10 31). However, it didn't 
find a newly created page.


If I delete the database and do the initial crawl over again, the new 
page is found. Any idea what I'm doing wrong or why it isn't finding it?


Thanks!
Matt

Matthew Holt wrote:


Stefan,
 Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:


Matthew Holt wrote:
 


Hi all,
 I have already successfuly indexed all the files on my domain only (as
specified in the conf/crawl-urlfilter.txt file).

Now when I use the below script (./recrawl crawl 10 31) to recrawl the
domain, it begins indexing pages off of my domain (such as wikipedia,
etc). How do I prevent this? Thanks!
  



Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that urlfilter-file is
no longer used.


Regards,
Stefan

Re: Recrawling question

2006-06-06 Thread Matthew Holt


Stefan,
 Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:


Matthew Holt wrote:
 


Hi all,
 I have already successfuly indexed all the files on my domain only (as
specified in the conf/crawl-urlfilter.txt file).

Now when I use the below script (./recrawl crawl 10 31) to recrawl the
domain, it begins indexing pages off of my domain (such as wikipedia,
etc). How do I prevent this? Thanks!
   



Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that urlfilter-file is
no longer used.


Regards,
Stefan

Recrawling question

2006-06-06 Thread Matthew Holt


Hi all,
  I have already successfuly indexed all the files on my domain only 
(as specified in the conf/crawl-urlfilter.txt file).


Now when I use the below script (./recrawl crawl 10 31) to recrawl the 
domain, it begins indexing pages off of my domain (such as wikipedia, 
etc). How do I prevent this? Thanks!

Matt



#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
 bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

Re: Intranet Crawling

2006-06-05 Thread Matthew Holt

Ok thanks.. as far as crawling the entire subdomain.. what exact command 
would I use?


Because depth says how many pages deep to go.. is there anyway to hit 
every single page, without specifying depth? Or should I just say 
depth=10? Also, topN is no longer used, correct?


Stefan Neufeind wrote:


Matthew Holt wrote:
 


Question,
  I'm trying to index a subdomain of my intranet. How do I make it
index the entire subdomain, but not index any pages off of the
subdomain? Thanks!
   



Have a look at crawl-urlfilter.txt in the conf/ directory.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
-.


Regards,
Stefan

Intranet Crawling

2006-06-05 Thread Matthew Holt


Question,
   I'm trying to index a subdomain of my intranet. How do I make it 
index the entire subdomain, but not index any pages off of the 
subdomain? Thanks!

Matt

Re: Intranet Crawl Demo

2006-06-05 Thread Matthew Holt

Thanks for the help.. turns out the local configuration of tomcat I'm 
using was configured differently. I installed a standard config of 
tomcat and everything works great


Matt

Sami Siren wrote:


I am confused,

you state that you go to url https://localhost:8080/ and see default 
apache installation page

- default tomcat page that is?

and then you try http://localhost/nutch-0.7.2/ (notice that's http and 
port 80)


By default tomcat listens at port 8080 with http protocol, did you 
configure it to do

something else?

If you have accesslog enabled in tomcat you could check that the 
requests are infact

reaching the tomcat and not something else.

--
Sami Siren

Matthew Holt wrote:

I checked and the war is unpacked.. I even went into the nutch-0.7.2 
directory that was unpacked by tomcat and created a test index.html 
file to see if that worked. I then went to 
https://localhost/nutch-0.7.2/index.html and the error i got was.. 
"The requested URL /nutch-0.7.2/index.html was not found on this 
server."


any ideas?
thanks.
matt

TDLN wrote:


I cannot see any other likely cause than that you did not configure
Tomcat to unpack WARs.

Rgrds. Thomas

On 6/5/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Hi all,
  Just attempting to install a demo Intranet crawl on my local 
machine.

I followed the tutorial directions step by step and ran the crawl.

I then followed the directions in the "Searching section", stopped
tomcat, installed the war file as 'ROOT.war' in the webapps directory
(both manually and by using the tomcat web manager application). I 
then

started tomcat from the current nutch install directory and tried
navigation both to https://localhost:8080/ (i get the default apache
installation page).

That didn't work, so I uploaded another copy of the war file as
nutch-0.7.2.war and tried navigating to http://localhost/nutch-0.7.2/
and the error I received was "The requested URL /nutch-0.7.2 was not
found on this server."

Any suggestions. Thanks!

Matt

Re: Intranet Crawl Demo

2006-06-05 Thread Matthew Holt

I checked and the war is unpacked.. I even went into the nutch-0.7.2 
directory that was unpacked by tomcat and created a test index.html file 
to see if that worked. I then went to 
https://localhost/nutch-0.7.2/index.html and the error i got was.. "The 
requested URL /nutch-0.7.2/index.html was not found on this server."


any ideas?
thanks.
matt

TDLN wrote:


I cannot see any other likely cause than that you did not configure
Tomcat to unpack WARs.

Rgrds. Thomas

On 6/5/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Hi all,
  Just attempting to install a demo Intranet crawl on my local machine.
I followed the tutorial directions step by step and ran the crawl.

I then followed the directions in the "Searching section", stopped
tomcat, installed the war file as 'ROOT.war' in the webapps directory
(both manually and by using the tomcat web manager application). I then
started tomcat from the current nutch install directory and tried
navigation both to https://localhost:8080/ (i get the default apache
installation page).

That didn't work, so I uploaded another copy of the war file as
nutch-0.7.2.war and tried navigating to http://localhost/nutch-0.7.2/
and the error I received was "The requested URL /nutch-0.7.2 was not
found on this server."

Any suggestions. Thanks!

Matt

Intranet Crawl Demo

2006-06-05 Thread Matthew Holt


Hi all,
 Just attempting to install a demo Intranet crawl on my local machine. 
I followed the tutorial directions step by step and ran the crawl.


I then followed the directions in the "Searching section", stopped 
tomcat, installed the war file as 'ROOT.war' in the webapps directory 
(both manually and by using the tomcat web manager application). I then 
started tomcat from the current nutch install directory and tried 
navigation both to https://localhost:8080/ (i get the default apache 
installation page).


That didn't work, so I uploaded another copy of the war file as 
nutch-0.7.2.war and tried navigating to http://localhost/nutch-0.7.2/ 
and the error I received was "The requested URL /nutch-0.7.2 was not 
found on this server."


Any suggestions. Thanks!

Matt

76 matches

Mail list logo