[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-08-08 Thread Renaud Richardet (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12426579 ] 

Renaud Richardet commented on NUTCH-266:


KuroSaka, yes you can download the hadoop jar, release 0.5.0 from the project 
website: http://lucene.apache.org/hadoop/ and 
http://www.apache.org/dyn/closer.cgi/lucene/hadoop/

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-330) command line tool to search a Lucene index

2006-08-08 Thread Renaud Richardet (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-330?page=comments#action_12426629 ] 

Renaud Richardet commented on NUTCH-330:


This bug is obsolte, I just found out that Nutch already allows to search from 
the command line via
bin/nutch org.apache.nutch.searcher.NutchBean [searchterm]. It assumes that you 
call it from the base of your crawl directory.


 command line tool to search a Lucene index
 --

 Key: NUTCH-330
 URL: http://issues.apache.org/jira/browse/NUTCH-330
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.8
 Environment: ubuntu
Reporter: Renaud Richardet
Priority: Minor
 Attachments: clSearch.diff, clSearch.diff


 Tool to allow to search a Lucene index from the command line, makes 
 development and testing faster
 usage:   bin/nutch searchindex [index dir] [searchkeyword]
 example: bin/nutch searchindex crawl/index flowers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-08-08 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Sami Siren resolved NUTCH-344.
--

Fix Version/s: 0.8.1
   0.9.0
   Resolution: Fixed

I just committed this to 0.8 branch and trunk, thanks Greg!

 Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
 -

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0, 0.8.1
 Environment: All
Reporter: Greg Kim
 Fix For: 0.8.1, 0.9.0

 Attachments: cleanExpiredServerBlocks.patch


 With the recent change to the following code in HttpBase.java has tendencies 
 to block fetcher threads while one thread busy waits... 
   private static void cleanExpiredServerBlocks() {
 synchronized (BLOCKED_ADDR_TO_TIME) {
   while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
 String host = (String) BLOCKED_ADDR_QUEUE.getLast();
 long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
 if (time = System.currentTimeMillis()) {   
   BLOCKED_ADDR_TO_TIME.remove(host);
   BLOCKED_ADDR_QUEUE.removeLast();
 }
   }
 }
   }
 LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
 thread that first enters this block busy-waits until it becomes empty while 
 all other threads block on the synchronized block.  This leads to extremely 
 poor fetcher performance.  
 Since the checkin to respect crawlDelay in robots.txt, we are no longer 
 guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is 
 to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-266) hadoop bug when doing updatedb

2006-08-08 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Sami Siren resolved NUTCH-266.
--

Resolution: Fixed

I just updated hadoop versions, trunk contains 0.5.0, 0.8-branch contains 
patched 0.4.0

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.8.1, 0.9.0

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-08 Thread Matthew Holt
Since it wasn't really clear whether my script approached the problem of 
deleting segments correctly, I refactored it so it generates the new 
number of segments, merges them into one, then deletes the new 
segments. Not as efficient disk space wise, but still removes a large 
number of the segments that are not being referenced by anything due to 
not being indexed yet.


I reupdated the wiki. Unless there is any more clarification regarding 
the issue, hopefully I won't have to bombard your inbox with any more 
emails regarding this.


Matt

Lukas Vlcek wrote:

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek [EMAIL PROTECTED] wrote:

Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
 If anyone doesnt mind taking a look...



 -- Forwarded message --
 From: Matthew Holt [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Date: Fri, 04 Aug 2006 10:07:57 -0400
 Subject: Re: 0.8 Recrawl script updated
 Lukas,
Thanks for your e-mail. I assumed I could drop the $depth number of
 oldest segments because I first merged them all into one segment 
(which

 I don't drop). Am I incorrect in my assumption and can this cause
 problems in the future? If so, then I'll go back to the original 
version

 of my script when I kept all the segments without merging. However, it
 just seemed like if that is the case, it will be a problem after 
enough

 number of recrawls due to the large amount of segments being kept.

  Thanks,
   Matt

 Lukas Vlcek wrote:
  Hi Matthew,
 
  I am surious about one thing. How do you know you can just drop 
$depth
  number of the most oldest segments in the end? I haven't studied 
nutch

  code regarding this topic yet but I thought that segment can be
  dropped once you are sure that all its content is already crawled in
  some newer segments (which should be checked somehow via some
  function/script - which hasen't been yet implemented to my 
knowledge).

 
  Also I don't think this question has been discussed on dev/user 
lists

  in detail yet so I just wanted to ask you about your opinion. The
  situation could get even more complicated if people add -topN
  parameter into script (which can happen because some might prefer
  crawling in ten smaller bunches over to two huge crawls due to 
various

  technical reasons).
 
  Anyway, never mind if you don't want to bother about my silly 
question

  :-)
 
  Regards,
  Lukas
 
  On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
  Last email regarding this script. I found a bug in it that is 
sporadic
  (i think it only affected different setups). However, since it 
would be
  a problem sometimes, I refactored the script. I'd suggest you 
redownload

  the script if you are using it.
 
  Matt
 
  Matthew Holt wrote:
   I'm currently pretty busy at work. If I have I'll do it later.
  
   The version 0.8 recrawl script has a working version online 
now. I
   temporarily modified it on the website yesterday when I ran 
into some

   problems, but I further tested it and the actual working code is
   modified now. So if you got it off the web site any time 
yesterday, I

   would redownload the script.
  
   Matt
  
   Lourival JĂșnior wrote:
   Hi Matthew!
  
   Could you update the script to the version 0.7.2 with the same
   functionalities? I write a scritp that do this, but it don't 
work

  very
   well...
  
   Regards!
  
   On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:
  
   Just letting everyone know that I updated the recrawl script 
on the

   Wiki. It now merges the created segments them deletes the old
  segs to
   prevent a lot of unneeded data remaining/growing on the hard 
drive.

 Matt
  
  
  
  
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 


 
  
  
  
  
  
  
 
 









parse-oo plugin

2006-08-08 Thread Matthew Holt

Hey there,
 Hope all has been going well for you. I noticed a small issue with the 
parse-oo plugin. It parses the documents correctly, however, when you 
find a open office document as a result and click cached, it returns 
with a NullPointerException error. I looked into it and the line in 
cached.jsp that is throwing the NPE is below:


String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);

So apparently the parse-oo plugin does not store the CONTENT_TYPE of the 
document. I looked and modified around line 100 and changed:


   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, metadata);

   return new ParseImpl(text, parseData);

to:

   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, content.getMetadata(), metadata);

   parseData.setConf(this.conf);
   return new ParseImpl(text, parseData);

This fixes the problem of the cached.jsp throwing an exception, but 
instead it displays every document type as either [octet-stream] or 
[oleobject].


So it seems as if it's not interpreting the mime types correctly. Do you 
know how to fix both the cached.jsp issue and the mime-type issue 
concurrently??

Thanks,
 Matt