[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453820 ] 

Andrzej Bialecki  commented on NUTCH-339:
-

This looks weird, if anything it rather seems caused by a bug in Hadoop - are 
you able to run readseg -dump on this fetchlist?

Another idea: do you have any lease expired messages in your log about that 
time? It looks like maybe the underlying input stream has been closed.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-11-28 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] 

Sean Dean commented on NUTCH-233:
-

Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to 
.*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for the time being.

I recently created a segment with 1M urls exactly, I ran the fetch and it did 
indeed stall on the reduce part of the operation due to the regex filter. This 
was verified with a thread dump (kill -3 pid) on FreeBSD.

I then made the suggested change in the config file and re-fetched the exact 
same segment. It completed without issue.

I'm aware we might be losing some filtering functionality with this new 
expression, but is it not better then knowing there is always the chance your 
whole-web crawl fetch will fail because of this?

 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




updating index without refetching

2006-11-28 Thread DS jha

Hi All,

Is it possible to update the index without refetching everything?  I have
changed logic of one of my plugins (which also sets a custom field in the
index) - and I would like this field to get updated without refetching
everything - is it doable?


Thanks,


RE: updating index without refitting

2006-11-28 Thread Gal Nitzan
Hi,

You do not mention if the new field's data is stored as a metadata? Does the
value data being created during parse or is it added only during the index
phase?

If your new field is created during the parse process than you could delete
only the parse folders and run the parse process i.e. (delete segment/crawl
parse , segment/parse data , segment/parse text) and run bin/nutch parse
segment

Or if your field data is added during the index process than re-create your
index.

In any case it doesn't seem to me you would need to re-fetch.

HTH

Gal

-Original Message-
From: DS jha [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 28, 2006 4:11 PM
To: nutch-dev@lucene.apache.org
Subject: updating index without refetching

Hi All,

Is it possible to update the index without refetching everything?  I have
changed logic of one of my plugins (which also sets a custom field in the
index) - and I would like this field to get updated without refetching
everything - is it doable?


Thanks,




[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Alan Tanaman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 

Alan Tanaman commented on NUTCH-407:


In our team we feel that this patch would have been beneficial in practical 
terms.  In the context of the enterprise intelligence solution which we are 
gradually porting over to Nutch, the emphasis is on ease of configuration.  We 
try to avoid exposing features such as regex filter, which although are very 
powerful for a more experienced user, are perhaps confusing to the novice.  
This is because we are primarily focused on the enterprise and less on the WWW.

This is why we preconfigure the db.ignore.external.links property to true, 
and then only the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for 
specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, 
Enterprise-Database, Internet-News etc.  We have a script that generates 
multiple crawlers, each one with different sources to be crawled, and although 
possible, it isn't the most practical to change the filters for each one 
manually based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that 
FileResponse.java looks at db.ignore.external.links and decides based on this 
whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW 
embedded in documents, but when crawling an enterprise file system, you usually 
don't want to go all over the place anyway.  As I see it, file systems are 
different to the web in that they are inherently hierarchical whereas the web 
is as its name implies, non-hierarchical.  Therefore, when crawling a file 
system, going up the tree is just as much an external URI (so to speak) as a 
link to a web site.

*Ducks for cover*

Alan

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Assigned To: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Chris A. Mattmann (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453934 ] 

Chris A. Mattmann commented on NUTCH-407:
-

I'm not entirey sure what the right answer to this is. One thing that I do know 
is that a colleague at my own work ran into this exact same issue while first 
attempting to use Nutch on his enterprise search application. Confused the heck 
out of him and he ended up including in the urlfilter-regex what Andrzej 
mentions above, i.e., only crawl from the top-level down. He mentioned to me 
that he thought this was a kludge and I can't say that I disagreed with him. 
My +1 for figuring  out a better way to solve this problem...

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Assigned To: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: updating index without refitting

2006-11-28 Thread DS jha

new field's data is also stored as a meta data - value is assigned during
parse process and then during index, it reads meta-data field value and adds
it to an index. Looks like, I will have to run parse and index again.

Thanks much.


On 11/28/06, Gal Nitzan [EMAIL PROTECTED] wrote:


Hi,

You do not mention if the new field's data is stored as a metadata? Does
the
value data being created during parse or is it added only during the index
phase?

If your new field is created during the parse process than you could
delete
only the parse folders and run the parse process i.e. (delete
segment/crawl
parse , segment/parse data , segment/parse text) and run bin/nutch parse
segment

Or if your field data is added during the index process than re-create
your
index.

In any case it doesn't seem to me you would need to re-fetch.

HTH

Gal

-Original Message-
From: DS jha [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 28, 2006 4:11 PM
To: nutch-dev@lucene.apache.org
Subject: updating index without refetching

Hi All,

Is it possible to update the index without refetching everything?  I have
changed logic of one of my plugins (which also sets a custom field in the
index) - and I would like this field to get updated without refetching
everything - is it doable?


Thanks,





[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] 

Sami Siren commented on NUTCH-339:
--

perhaps thath exception is just a consequence of something other like this:

2006-11-27 07:35:09,434 INFO  fetcher.Fetcher2 - -activeThreads=296, 
spinWaiting=204, fetchQueues.totalSize=0
2006-11-27 07:35:09,434 WARN  fetcher.Fetcher2 - Aborting with 296 hung 
threads.2006-11-27 07:35:09,434 INFO  mapred.LocalJobRunner - 3821 pages, 207 
errors, 5.5 pages/s, 780 kb/s,

and the next log entry is:

2006-11-27 07:35:15,443 INFO  mapred.JobClient -  map 100% reduce 0%




 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] 

Sami Siren commented on NUTCH-339:
--

I am running with 300 thread, and in parsing mode

thread dump shows:

191 threads waiting on condition
at java.lang.Thread.sleep(Native Method)
at 
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:422)

71 waiting for monitor entry
at 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.getFetchItem(Fetcher2.java:306)
- waiting to lock 0x52fa7328 (a 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues)
at 
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:415)

rest are runnable

cpu usage starts low but very quickly in ramps up and machine gets almost 
unresponsive.

fetching speed is low because all cpu goes to something else.


 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Indexing and Re-crawling site

2006-11-28 Thread Armel T. Nene
Hi guys,

 

I have a few questions regarding the way nutch indexes and the best way a
recrawl can be implemented. 

 

1.  Why does nutch has to create a new index every time when indexing,
while it can just merge it with the old existing index? I try to change the
value in the IndexMerger class to 'false' while creating an index therefore
Lucene doesn't recreate a new index each time it is indexing. The problem
with this is, I keep on having some exception when it tries to merge the
indexes. There is a lock time out exception that is thrown by the
IndexMerger. And consequently the index that get created. Is it possible to
let nutch index by merging it with an existing index? I have to crawl about
100Gb of data and if there are only a few documents that have been changed,
I don't nutch to recreate a new index because of that but update the
existing index by merging it with the new one. I need some light on this.

 

2.  What is the best way to make nutch re-crawl? I have implemented a
class that loops the crawl process; it has a crawl interval which is set in
a property file and a running status. The running status is a Boolean
variable which is set to true if the re-crawl process is ongoing or false if
it should stop. But with this approach, it seems that the index is not being
fully generated. The values in the index cannot be queried. The re-crawl is
in java which calls an underlying ant script to run nutch. I know most
re-crawl are written as batch script but can you tell me which one do you
recommended? A batch script or a loop-based java program?  

 

3.  What is the best way of implementing nutch has a window service or
unix daemon?

 

Thanks,

 

Armel



Re: implement thai language indexing and search

2006-11-28 Thread Jérôme Charron

i used an existing ThaiAnalyzer which was in lucene package.
ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)


1. You don't have to refactor the lucene analyzer. Just to wrap it like I do
with french and german analyzers (they both use some analyzers from lucene).
2. Analyzer doesn't need ngp files... I think you misunderstood something:
2.1 In one side there is the language identifier that use NGP files to
identify language of a document
2.2 In the other sided if a suitable analyzer is found for the identified
language, it is used to analyze the document.

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/