[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] 

Stefan Groschupf commented on NUTCH-354:


Since this issue is already closed I can not attach the patch file, so I attach 
it as text within this comment.
If you need the file let me know and I send you a offlist mail. 


Index: src/test/org/apache/nutch/crawl/TestMapWritable.java
===
--- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 
432325)
+++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy)
@@ -180,6 +180,31 @@
 assertEquals(before, after);
   }
 
+  public void testRecycling() throws Exception {
+UTF8 value = new UTF8(value);
+UTF8 key1 = new UTF8(a);
+UTF8 key2 = new UTF8(b);
+
+MapWritable writable = new MapWritable();
+writable.put(key1, value);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+
+DataOutputBuffer dob = new DataOutputBuffer();
+writable.write(dob);
+writable.clear();
+writable.put(key1, value);
+writable.put(key2, value);
+assertEquals(writable.get(key1), value);
+assertEquals(writable.get(key2), value);
+
+DataInputBuffer dib = new DataInputBuffer();
+dib.reset(dob.getData(), dob.getLength());
+writable.readFields(dib);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+  }
+  
   public static void main(String[] args) throws Exception {
 TestMapWritable writable = new TestMapWritable();
 writable.testPerformance();


 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 

Stefan Groschupf commented on NUTCH-356:


Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with 
the memory. 
There is no way around having a singleton pluginrepository. However you can 
reset the the pluginRepository by remove the cached object from the 
configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing 
a own plugin that solve your problem that should be a cleaner solution for your 
problem. 

Would you agree to close this issue since we will not be able to commit your 
changes. 
Stefan  

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation
---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  web servers. 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
---

Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

 crawling simulation
 ---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0

 Attachments: protocol-simulation-pluginV1.patch


 We recently discovered  some serious issue related to crawling and scoring. 
 Reproducing these problems is a kind of difficult, since first of all it is 
 not polite to re-crawl a set of pages again and again, secondly it is 
 difficult to catch the page that cause a problem. 
 Therefore it would be very useful to have a testbed to simulate crawls where  
 we can control the response of  web servers. 
 For the very beginning simulate very basic situation like a page points to it 
 self,  link chains or internal links would already be very usefully. 
 However later on simulate crawls against existing data collections like TREC 
 or a webgraph would be much more interesting, for instance to caculate the 
 quality of the nutch OPIC implementation against page rank scores of the 
 webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] 

Stefan Groschupf commented on NUTCH-343:


Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to the patch file. 
My personal experience is that some nutch developers have strong opitions about 
code formating, so you may be want to check your code formating. :-)

 Index MP3 SHA1 hashes
 -

 Key: NUTCH-343
 URL: http://issues.apache.org/jira/browse/NUTCH-343
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.8, 0.9.0, 0.8.1
Reporter: Hasan Diwan
 Attachments: parsemp3.pat


 Add indexing of the mp3s sha1 hash.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]

Stefan Groschupf updated NUTCH-341:
---

Attachment: doNotDeleteTmpIndexMergeDirV1.patch

+1. 
I agree it makes completly no sense to be required creating a tmp folder 
manually and nutch deletes it afterwards with all content. 
Very dangerous if a user provides  / as tmp folder. The attached patch 
rollsback the missing line and I would love to ask that one developer with 
write access can roll in this in asap!
THANKS!


 IndexMerger now deletes entire workingdir after completing
 

 Key: NUTCH-341
 URL: http://issues.apache.org/jira/browse/NUTCH-341
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Critical
 Attachments: doNotDeleteTmpIndexMergeDirV1.patch


 Change 383304 deleted the following line near Line 117 (see 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304r2=405204diff_format=h
  for details):
 workDir = new File(workDir, indexmerger-workingdir);
 Previously, if no -workingdir workingdir parameter was specified, 
 IndexMerger.main() would place an indexmerger-workingdir directory into the 
 default directory and then delete the former after completing. Now, 
 IndexMerger.main() defaults the value of its workDir to indexmerger within 
 the default directory, and deletes this workDir afterward.
 However, if -workingdir workingdir _is_ specified, IndexMerger.main() will 
 now set workDir to _this_ path and delete the _entire_ workingdir 
 afterward. Previously, IndexMerger.main() would only delete 
 workingDir/indexmerger-workingdir, without deleting workingdir itself. 
 This is because the line mentioned above always appended 
 indexmerger-workingdir to workDir.
 Our hardware configuration on the jobtracker/namenode box attempts to keep 
 all large datasets on a separate, large hard drive. Accordingly, we were 
 keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir 
 on this drive. Unfortunately, we were passing the folder containing these 
 folders in the workingdir parameter to the IndexMerger. As a result, the 
 first time we ran the IndexMerger, we ended up trashing our entire DFS!
 Perhaps the way that the IndexMerger handles its workingdir parmaeter now 
 is an acceptable design. However, given the way it handled this parameter in 
 the past, I feel that the current implementation is unacceptably dangerous.
 More importantly, perhaps there's some way that we could make hadoop more 
 robust in handling its critical data files. I plan to place a directory owned 
 by root with dr permissions into each of these critical directories 
 in order to prevent any of them from suffering the fate of our DFS. This 
 could become part of a standard hadoop installation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Attachment: respectFetcherParsePropertyV1.patch

Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a 
contributor to commit this to trunk

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
Priority: Trivial
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Priority: Major  (was: Trivial)

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
marked as STATUS_DB_GONE  
--

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical


Intranet crawls or focused crawls will fetch many pages from the same host. 
This causes that a thread will be blocked since a other thread already fetch 
from the same host. It is very likely that threads are more often blocked than 
http.max.delays. In such a case the HttpBase.blockAddr method throws a 
HttpException. This will be handled in the fetcher by increment the crawlDatum 
retries and set the status to STATUS_FETCH_RETRY. That means that at least you 
have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in 
intranet or focused crawls it is very likely that this is not enough. Increaing 
one of the involved properties dramatically slow down the fetch. 
I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem 
was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]

Stefan Groschupf updated NUTCH-350:
---

Attachment: protocolRetryV5.patch

This patch will dramatically increase the number of successfully fetched pages 
of a intranet crawl over the time. 

 urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
 marked as STATUS_DB_GONE
 

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: protocolRetryV5.patch


 Intranet crawls or focused crawls will fetch many pages from the same host. 
 This causes that a thread will be blocked since a other thread already fetch 
 from the same host. It is very likely that threads are more often blocked 
 than http.max.delays. In such a case the HttpBase.blockAddr method throws a 
 HttpException. This will be handled in the fetcher by increment the 
 crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that 
 at least you have only db.fetch.retry.max * http.max.delays chances to fetch 
 a url. But in intranet or focused crawls it is very likely that this is not 
 enough. Increaing one of the involved properties dramatically slow down the 
 fetch. 
 I suggest to not increase the CrawlDatum RetriesSinceFetch in case the 
 problem was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] 

Stefan Groschupf commented on NUTCH-322:


I think this is a serious problem. Page A server side redirect to Page B. Page 
A is never writen to the output. That causes that Page A does not change the 
state or the next fetch time, what means that page A is fetched again, again, 
again ... ∞

I suggest that we write out Page A with a status change to STATUS_DB_GONE.


 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
pages that serverside forwards will be refetched every time
---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1
 Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back 
into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is 
not polite and refetching the forwarding and target page in each segment 
iteration. Also it effects the scoring since the forward page contribute it's 
score to all outlinks.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Stefan Groschupf updated NUTCH-353:
---

Attachment: doNotRefecthForwarderPagesV1.patch

Since we discussed that nutch need to be more polite we should fix that asap. 

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Stefan Groschupf resolved NUTCH-322.


Resolution: Duplicate

duplicate of NUTCH-353

 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] 

Stefan Groschupf commented on NUTCH-347:


Please submit this patch! 
Thanks!

 Build: plugins' Jars not found
 --

 Key: NUTCH-347
 URL: http://issues.apache.org/jira/browse/NUTCH-347
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Otis Gospodnetic
 Attachments: nutch_build_plugins_patch.txt


 While building Nutch, I noticed several places where various Jars from 
 plugins' lib directories could not be found, for example:
 $ ant package
 ...
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-log4j/lib-log4j.jar to copy.
 init:
 init-plugin:
 compile:
 jar:
 deps-test:
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-nekohtml/lib-nekohtml.jar 
 to copy.
 ...
 The problem is, these lib-.jar files do not exist.  Instead, those Jars 
 are typically named with a version in the name, like log4j-1.2.11.jar.  I 
 could not find where this lib- prefix comes from, nor where the version is 
 dropped from the name.  Anyone knows?
 In order to avoid these errors I had to make symbolic links and fake things:
 e.g.
   ln -s log4j-1.2.11.jar lib-log4j.jar
 But this should really be fixed somewhere, I just can't see where... :(
 Note that this doesn't completely break the build, but missing Jars can't be 
 a good thing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] 

Stefan Groschupf commented on NUTCH-346:


+1
I agree, can you please create a patch file and attach it to this bug. 
Thanks

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor

 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] 

Stefan Groschupf commented on NUTCH-345:


Shouldn't the DeflateUtils also be part of the protocol-http plugin? 
Also since it is a larger contribution and not just a small bug fix it would be 
great to have a junit test within the patch. 
Thanks for the contribution.



 Add support for Content-Encoding: deflated
 --

 Key: NUTCH-345
 URL: http://issues.apache.org/jira/browse/NUTCH-345
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Pascal Beis
Priority: Minor
 Attachments: nutch-deflate.patch


 Add support for the deflated content-encoding, next to the already
 implemented GZIP content-encoding. Patch attached. See also the
 Patch: deflate encoding thread on nutch-dev on August 7/8 2006.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] 

Stefan Groschupf commented on NUTCH-349:


my vote goes to #2.
Having a tool that need to be started manually would be better than complicate 
the already fragile code from my point of view. 

 Port Nutch to use Hadoop Text instead of UTF8
 -

 Key: NUTCH-349
 URL: http://issues.apache.org/jira/browse/NUTCH-349
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 

 Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
 This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
 instead. Sooner or later we will need to move Nutch to use this class instead 
 of UTF8.
 This raises numerous issues regarding the compatibility of existing data in 
 CrawlDB, LinkDB and segments. I can see two ways to solve this:
 * add code in readers of respective formats to convert UTF8-Text on the fly. 
 New writers would only use Text. This is less than ideal, because it 
 complicates the code, and also at some point in time the UTF8 class will be 
 removed.
 * create a converter (to be maintaines as long as UTF8 exists), which 
 converts existing data in bulk from UTF8 to Text. This requires an additional 
 processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] 

Stefan Groschupf commented on NUTCH-233:


Hi Otis, 
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls that for example comes from link farms the 
crawler runs into. 

 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
---

Attachment: sortPatchV1.patch

What people think about this kind of solution?

 Generator is building fetch list using *lowest* scoring URLs
 

 Key: NUTCH-348
 URL: http://issues.apache.org/jira/browse/NUTCH-348
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider
 Attachments: sortPatchV1.patch


 Ever since revision 391271, when the CrawlDatum key was replaced by a 
 FloatWritable key, the Generator.Selector.reduce method has been outputting 
 the *lowest* scoring URLs! The CrawlDatum class has a Comparator that 
 essentially treats higher scoring CrawlDatum objects as less than lower 
 scoring CrawlDatum objects, so the higher scoring ones would appear first in 
 a sequence file sorted using this as the key.
 When a FloatWritable based on the score itself (as returned from 
 scfilters.generatorSortValue) became the sort key, it should have been 
 negated in Generator.Selector.map to have the same result. Curiously, there 
 is a comment to this effect immediately before the FloatWritable is set:
   // sort by decreasing score
   sortValue.set(sort);
 It seems like the simplest way to fix this is to just negate the score, and 
 this seems to work for me:
   // sort by decreasing score
   // 2006-08-15 CSc REALLY sort by decreasing score
   sortValue.set(-sort);
 Unfortunately, this means that any crawls that have been done using 
 Generator.java after revision 391271 should be discarded, as they were 
 focused on fetching the lowest scoring unfetched URLs in the crawldb, 
 essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)
doubling score causes by page internal anchors.
---

 Key: NUTCH-332
 URL: http://issues.apache.org/jira/browse/NUTCH-332
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of 
anchors the scores of the page are distributed to its outlinks. But all this 
outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this 
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double 
the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] 

Stefan Groschupf commented on NUTCH-318:


Yes this happens only in a distributed environment. Please also see my last 
mail in the hadoop dev list. I think there are more general logging problems, 
that only occurs in a distributed environment. So you will not track them down 
using local runner.

 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] 

Stefan Groschupf commented on NUTCH-318:


Shouldn't that be fixed in .8 since by today this tool just produce no output?!


 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] 

Stefan Groschupf commented on NUTCH-233:


I think this should be fixed in .8 too, since everybody that does real whole 
web crawl with over a 100 Mio pages will run into this problem. The problems 
are for example from spam bot generated urls. 



 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9-dev


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
not in plugin.includes
---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


In URLFilters constructor we use an array as long as we have filters defined in 
the urlfilter.order property. 
In case those filters are not included in the plugin.include property end up 
putting null entries into the array.

This cause a NPE in URLFilters line 82.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]

Stefan Groschupf updated NUTCH-325:
---

Attachment: UrlFiltersNPE.patch

A patch that uses a Arralist instead of an array and put only entries into the 
list when the entry is not null. Means only loaded Urlfilter that are loaded 
will be also stored into the filters array that is cached into the 
Configuration object. 

 UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
 not in plugin.includes
 ---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev

 Attachments: UrlFiltersNPE.patch


 In URLFilters constructor we use an array as long as we have filters defined 
 in the urlfilter.order property. 
 In case those filters are not included in the plugin.include property end up 
 putting null entries into the array.
 This cause a NPE in URLFilters line 82.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]

Stefan Groschupf updated NUTCH-323:
---

Attachment: MapWritableCopyConstructor.patch

Attached patch add a copy constructor to  the map writable and use it in the 
CrawlDatum.set methode. However there are more methods in the code where meta 
data are passed from one CrawlDatum to a other, but I don't can see any risk of 
concurent usage of the mapWritable there. 


 CrawlDatum.set just reference a mapWritable of a other object but not copy it.
 --

 Key: NUTCH-323
 URL: http://issues.apache.org/jira/browse/NUTCH-323
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev

 Attachments: MapWritableCopyConstructor.patch


 Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to 
 a other. 
 Also a reference of the MapWritable is passed. Means both project share the 
 same mapWritable and its content. 
 This causes problems with concurent manipulate mapWritables and its key-value 
 tuples. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
db.score.link.internal and db.score.link.external are ignored
-

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical


Configuration properties db.score.link.external and db.score.link.internal  are 
ignored.
In case of e.g. message board webpages or pages that have larger navigation 
menus on each page having a lower impact of internal links makes a lot of sense 
for scoring.
Also for web spam this is a serious problem, since now spammers can setup just 
one domain with dynamically generated pages and this highly manipulate the 
nutch scores. 
So I also suggest that we give db.score.link.internal by default a value of 
something like 0.25. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]

Stefan Groschupf updated NUTCH-324:
---

Attachment: InternalAndExternalLinkScoreFactor.patch

Multiply the score of a page during distributeScoreToOutlink with 
db.score.link.internal or db.score.link.external.

 db.score.link.internal and db.score.link.external are ignored
 -

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: InternalAndExternalLinkScoreFactor.patch


 Configuration properties db.score.link.external and db.score.link.internal  
 are ignored.
 In case of e.g. message board webpages or pages that have larger navigation 
 menus on each page having a lower impact of internal links makes a lot of 
 sense for scoring.
 Also for web spam this is a serious problem, since now spammers can setup 
 just one domain with dynamically generated pages and this highly manipulate 
 the nutch scores. 
 So I also suggest that we give db.score.link.internal by default a value of 
 something like 0.25. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]

Stefan Groschupf resolved NUTCH-319.


Resolution: Won't Fix

Sorry, that is bogus since it is wriiten to the logging stream.

 OPICScoringFilter should use logging API instead of printStackTrace
 ---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


 OPICScoringFilter line 107 should be a logging not a   
 e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
OPICScoringFilter should use logging API instead of printStackTrace
---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


OPICScoringFilter line 107 should be a logging not a   
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
log4j not proper configured, readdb doesnt give any information
---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


In the latest .8 sources the readdb command doesn't dump any information 
anymore. 
This is realeated to the miss configured log4j.properties file. 
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps the information to the console, but not in a nice way. 

What makes me wonder  is that these information should be also in the log file, 
but the arn't, so there are may be even here problems.
Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV5.patch

Release Candidate 1 of this patch.

This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4]) 
+ a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded
+ add a property to define if the IpAddress Resolver should be started as a 
part of the crawlDb update tool to update the parseoutput folder (contains 
CrawlDatum Status Linked) of a segment before updating the crawlDb.
+ using cached IP during Generation

Please review this patch and give me any improvement suggestion, I think this 
is a very important issue, since it helps to do _real_ whole web crawls and not 
end up in a honey pot after some fetch iterations.
Also if you like please vote for this issue. :-) Thanks.

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, 
 ipInCrawlDatumDraftV5.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV4.patch

Attached a patch that does only use any time 4 byte for the ip. Means we do 
ignore ipv6. This save us a 4 byte in each crawldatum for now.
I tested the resolver tool with a 200++mio crawldb and in average a performance 
of 500 IP lookups / sec per box is possible by using 1000 threads.

I really would love to get this into the sources as the basic version of having 
the IP address in  the crawlDatum, since I'm working on a tool set of spam 
detectors that all need ip adresses somehow.
May be let's exclude the tool but start with the crawlDatum? :-?
Any improvement suggestions?
Thanks.


 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

2006-06-07 Thread Stefan Groschupf (JIRA)
java doc of CrawlDb is wrong


 Key: NUTCH-302
 URL: http://issues.apache.org/jira/browse/NUTCH-302
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.8-dev


CrawlDb has the same java doc as Injector. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]

Stefan Groschupf updated NUTCH-301:
---

Attachment: CommonGramsCacheV1.patch

Cache HashMap COMMON_TERMS in configuration instance.

 CommonGrams loads analysis.common.terms.file for each query
 ---

  Key: NUTCH-301
  URL: http://issues.apache.org/jira/browse/NUTCH-301
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.8-dev
 Reporter: Chris Schneider
  Attachments: CommonGramsCacheV1.patch

 The move away from static objects toward instance variables has resulted in 
 CommonGrams constructor parsing its analysis.common.terms.file for each 
 query. I'm not certain how large a performance impact this really is, but it 
 seems like something you'd want to avoid doing for each query. Perhaps the 
 solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] 

Stefan Groschupf commented on NUTCH-293:


Any comments? There was already a posting in the nutch agent mailing list, 
where someone had banned nutch since nutch does not support crawl-delay.
Becasue nutch tries to be polite from my point of view this is a small but 
important change.
If there are no improvement suggestions can someone of the committers take care 
of that _please_? :-) 

 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] 

Stefan Groschupf commented on NUTCH-293:


Hi Andrzej, 
I agree but writing a queue based fetcher is a big step. I already have some 
basic code (nio based).
Also I don't think that a new fetcher will be as stable as that we can put it 
into a .8 release. Since we plan to have .8 release it think it is a good idea 
for now to add this functionality. Maybe we do it configurable and switch it 
off by default?

In any case I suggest that we solve NUTCH-289 first and than getting the  
fetcher done.


 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] 

Stefan Groschupf commented on NUTCH-258:


Scott, 
I agree with you. However we need a clean patch to solve the problem, we can 
not just comment things out of the code.
So I vote for the issue and I vote to reopen this issue.

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV1.patch

To keep the discussion alive attached a _first draft_ for storing the ip in the 
crawlDatum for public discussion.

Some notes. 
The IP is stored as byte[] in the crawlDatum itself not in the meta data.
There is a IpAddressResolver maprunnable tool to update a crawlDb using 
multithreaded ip lookups.
In case a IP is available in the crawlDatum the Generator use the cached ip. 

To discuss:
I don't like the idea of post process the complete crawlDb any time after a 
update. 
Processing crawlDb is expansive in storage usage and time. 
We can have a property ipLookups with possible values 
never|duringParsing|postUpdateDb.
Than we can add also some code to lookup the IP in the ParseOutputFormat as 
discussed or we start IpAddressResolver as job in the updateDb tool code.

In the moment I write the ip address bytes like this:
out.writeInt(ipAddress.length);
out.write(ipAddress); 
I think for now we can define that byte[] ipAddress is everytime 4 bytes long, 
or should we be IPv6 compatible by today?

Please give me some comments I have a strong interest to get this issue fixed 
asap and I'm willing to improve things as required. :-)

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 
404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

 if a 404 for a robots.txt is returned a NPE is thrown
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
if a 404 for a robots.txt is returned no page is fetched at all from the host
-

 Key: NUTCH-298
 URL: http://issues.apache.org/jira/browse/NUTCH-298
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
 Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do  
robotRules = EMPTY_RULES;  (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used 
and we call isAllowed in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks 
and initialisations.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Attachment: fixNpeRobotRuleSet.patch

fix the npe in RobotRuleSet happen in case we use a empthy RuleSet

 if a 404 for a robots.txt is returned no page is fetched at all from the host
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] 

Stefan Groschupf commented on NUTCH-282:


Is that related to host grouping we discussed? Can we in this case close this 
bug?

 Showing too few results on a page (Paging not correct)
 --

  Key: NUTCH-282
  URL: http://issues.apache.org/jira/browse/NUTCH-282
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 I did a search and got back the  value itemsPerPage from opensearch. But 
 the output shows results 1-8 and I have a total of 46 searchresults.
 Same happens for the webinterface.
 Why aren't enough results fetched?
 The problem might be somewhere in the area of where Nutch should only display 
 a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] 

Stefan Groschupf commented on NUTCH-286:


This is difficult to realize since the http error code is readed from response 
in the fetcher and setted into the protocol status , content analysis can only 
done during parsing. 
Also normally such pages do not get a high OPIC score and should be not in the 
top search results. 
However this is a wrong configured http server response, so you may should open 
a bug in the typo3 issue tracking. 
Should we close this issue?

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:


+1, Can someone create a clean patch file?

 OpenSearchServlet: OutOfMemoryError: Java heap space
 

  Key: NUTCH-292
  URL: http://issues.apache.org/jira/browse/NUTCH-292
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical
  Attachments: summarizer.diff

 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
   
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 The URL I use is:
 [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url
 It seems to be a problem specific to the date I'm working with. Moving the 
 start from 0 to 10 or changing the query works fine.
 Or maybe it doesn't have to do with sorting but it's just that I hit one bad 
 search-result that has a broken summary?
 !! The problem is repeatable. So if anybody has an idea where to search / 
 what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] 

Stefan Groschupf commented on NUTCH-291:


lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified and date is stored in the 
index, please submit a patch for MoreIndexingFilter.

 OpenSearchServlet should return date as well as lastModified
 

  Key: NUTCH-291
  URL: http://issues.apache.org/jira/browse/NUTCH-291
  Project: Nutch
 Type: Improvement

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-291-unfinished.patch

 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + key + :  + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, , new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return ;
}
  }
 So the Problem should be somewhere else.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
 
Stefan Groschupf closed NUTCH-287:
--

Resolution: Won't Fix

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html

 Exception when searching with sort
 --

  Key: NUTCH-287
  URL: http://issues.apache.org/jira/browse/NUTCH-287
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical


 Running a search with  sort=url works.
 But when usingsort=title   I get the following exception.
 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
 jsp threw exception
 java.lang.RuntimeException: Unknown sort value type!
 at 
 org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
 at 
 org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
 at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
 at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 What is in those lines is:
   WritableComparable sortValue;   // convert value to writable
   if (sortField == null) {
 sortValue = new FloatWritable(scoreDocs[i].score);
   } else {
 Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
 if (raw instanceof Integer) {
   sortValue = new IntWritable(((Integer)raw).intValue());
 } else if (raw instanceof Float) {
   sortValue = new FloatWritable(((Float)raw).floatValue());
 } else if (raw instanceof String) {
   sortValue = new UTF8((String)raw);
 } else {
   throw new RuntimeException(Unknown sort value type!);
 }
   }
 So I thought that maybe raw is an instance of something strange and tried 
 raw.getClass().getName() or also raw.toString() to track the cause down - but 
 that always resulted in a NullPointerException. So it seems I'm having raw 
 being null for some strange reason.
 When I try with title2 (or something none-existing) I get a different error 
 that title2 is unknown / not indexed. So I suspect that title 

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
 
Stefan Groschupf closed NUTCH-284:
--

Resolution: Won't Fix

Yes, I was missing index-basic.

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] 

Stefan Groschupf commented on NUTCH-284:


Please try discuss such things first in the user mailing list than open a 
issue. 
Maintaining the issue tracking is very time consuming. But if there is a bug 
please continue open bug reports. :)
Thanks.


 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] 

Stefan Groschupf commented on NUTCH-281:


Can you submit a patch file?

 cached.jsp: base-href needs to be outside comments
 --

  Key: NUTCH-281
  URL: http://issues.apache.org/jira/browse/NUTCH-281
  Project: Nutch
 Type: Bug

   Components: web gui
 Reporter: Stefan Neufeind
 Priority: Trivial


 see cached.jsp
 base href=...
 does not take effect when showing a cached page because of the comments 
 around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] 

Stefan Groschupf commented on NUTCH-274:


Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the 
Injector?

 Empty row in/at end of URL-list results in error
 

  Key: NUTCH-274
  URL: http://issues.apache.org/jira/browse/NUTCH-274
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: nightly-2006-05-20
 Reporter: Stefan Neufeind
 Priority: Minor


 This is minor - but it's a little unclean :-)
 Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
 an empty line.
 Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
 fine - but second is empty and therefor fails proper protocol-detection.
 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
 060521 022639 found resource parse-plugins.xml at 
 file:/home/mm/nutch-nightly/conf/parse-plugins.xml
 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
 060521 022639 fetching http://www.bild.de/
 060521 022639 fetching 
 060521 022639 fetch of  failed with: 
 org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
 no protocol: 
 060521 022639 http.proxy.host = null
 060521 022639 http.proxy.port = 8080
 060521 022639 http.timeout = 1
 060521 022639 http.content.limit = 65536
 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 022639 fetcher.server.delay = 1000
 060521 022639 http.max.delays = 1000
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 022640  map 0%  reduce 0%
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
 
Stefan Groschupf closed NUTCH-286:
--

Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response 
codes based on responded html content.
Prune the index is a good idea to solve the problem.

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
support for Crawl-delay in Robots.txt
-

 Key: NUTCH-293
 URL: http://issues.apache.org/jira/browse/NUTCH-293
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical


Nutch need support for Crawl-delay defined in robots.txt, it is not a standard 
but a de-facto standard.
See:
http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-293?page=all ]

Stefan Groschupf updated NUTCH-293:
---

Attachment: crawlDelayv1.patch

A frist darft of a crawl delay support for nutch. The problem I see is that in 
case ip based delay is configured it can happen that we use the crawl delay of 
one host for a other host running on the same ip.
Feedback is welcome.

 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] 

Stefan Groschupf commented on NUTCH-289:


+1
Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as 
Doug suggested.
The biggest problem nutch has at the moment is spam. The most often seen spam 
method is to setup a dns return the same  ip for all subdomains and than 
deliver dynamically generated content. 
Than spammers just randomly generate subdomains within the content. Also it 
happens often that they have many url but all of them pointing to the same 
server == ip. 
Buying more ip addresses is possible but in the moment more expansive than 
buying more domains. 

Limit the urls by Ip is  a great approach to prevent the crawler staying in 
honey pots with ten thousends of urls pointing to the same ip. 
However to do so  we need to have the ip already until generation and not 
lookup it when fetching. 
We would be able to reuse the ip in the fetcher, also we can try catch the 
parts in the fetcher and in case the ip is not available we can re lookup the 
ip. 
I don't think round robbing dns are huge problem, since only large sites have 
them and in such a case each ip is able to handle requests.
In any case storing the ip in crawl-datum and use it for urls by ip limitations 
will be a gib step forward to in the fight against web spam.

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-249) black- white list url filtering

2006-04-26 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12376477 ] 

Stefan Groschupf commented on NUTCH-249:


I mean the Class and method naming isn't very well.
Blacklist or blocklist? Whitelist or positivivelist?
Does this answer the question?

 black- white list url filtering
 ---

  Key: NUTCH-249
  URL: http://issues.apache.org/jira/browse/NUTCH-249
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch

 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-251) Administration GUI

2006-04-21 Thread Stefan Groschupf (JIRA)
Administration GUI
--

 Key: NUTCH-251
 URL: http://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


Having a web based administration interface would help to make nutch 
administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-249) black- white list url filtering

2006-04-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-249?page=all ]

Stefan Groschupf updated NUTCH-249:
---

Attachment: blackWhiteListV2.patch

A concept tryout of black- white list filtering. I'm looking for beta tester 
and improvement suggestions. (Especially I'm looking for terminus suggestions)
Such a filter mechanism can be very useful for vertical search deployments of 
nutch with very large filter sets.

A black-White Url pattern database can be created and used to filter urls until 
updating a crawldb. So the crawlDb contains only urls that passes the black 
white list. In case a url match a black url prefix it will not written to the 
crawlDb. In case a url match a white prefix it is written to the crawlDb. 
In case a url does not match a white or black prefix it is also not written to 
the crawlDb.

Url filtering happens on a host level so a url only need to be filtered by all 
patterns for the same host. 

Usage: 
// inject prefix url patterns (a text file in a folder) that a url should not 
match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/negativeUrls/ 
-black 
// injkect prefix url patterns that a url is allowed to match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/positiveUrls/ 
-white 
// update a fetched segment into a database (only urls will be added to the db 
that pass the black white filter)
bin/nutch org.apache.nutch.crawl.bw.BWUpdateDb testCrawlDb bwdb 
segments/20060416181635/ 

Known Issues:
Hadoop does not allow to have different formats for one job, so some overhead 
format converting is required that currently slow down the processing. 

Any comments are welcome!

 black- white list url filtering
 ---

  Key: NUTCH-249
  URL: http://issues.apache.org/jira/browse/NUTCH-249
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: blackWhiteListV2.patch

 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-13 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]

Stefan Groschupf updated NUTCH-246:
---

Attachment: injectWithCurTimeMapper.patch

setFetchTime moved to Mapper.

 segment size is never as big as topN or crawlDB size in a distributed 
 deployement
 -

  Key: NUTCH-246
  URL: http://issues.apache.org/jira/browse/NUTCH-246
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Minor
  Fix For: 0.8-dev
  Attachments: injectWithCurTime.patch, injectWithCurTimeMapper.patch

 I didn't reopen NUTCH-136 since it is may related to the hadoop split.
 I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
 and 9 ttracks and 1 jobtracker).
 Defining map and reduce task number in a mapred-default.xml does not solve 
 the problem. (is in nutch/conf on all boxes)
 We verified that it is not  a problem of maximum urls per hosts and also not 
 a problem of the url filter.
 Looks like the first job of the Generator (Selector) already got to less 
 entries to process. 
 May be this is somehow releasted to split generation or configuration inside 
 the distributed jobtracker since it runs in a different jvm as the jobclient.
 However we was not able to find the source for this problem.
 I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Stefan Groschupf (JIRA)
segment size is never as big as topN or crawlDB size in a distributed 
deployement
-

 Key: NUTCH-246
 URL: http://issues.apache.org/jira/browse/NUTCH-246
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


I didn't reopen NUTCH-136 since it is may related to the hadoop split.
I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
and 9 ttracks and 1 jobtracker).
Defining map and reduce task number in a mapred-default.xml does not solve the 
problem. (is in nutch/conf on all boxes)
We verified that it is not  a problem of maximum urls per hosts and also not a 
problem of the url filter.

Looks like the first job of the Generator (Selector) already got to less 
entries to process. 
May be this is somehow releasted to split generation or configuration inside 
the distributed jobtracker since it runs in a different jvm as the jobclient.
However we was not able to find the source for this problem.

I think that should be fixed before  publishing a nutch 0.8. 




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-247) robot parser to restrict.

2006-04-11 Thread Stefan Groschupf (JIRA)
robot parser to restrict.
-

 Key: NUTCH-247
 URL: http://issues.apache.org/jira/browse/NUTCH-247
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


If the agent name and the robots agents are not proper configure the Robot rule 
parser uses LOG.severe to log the problem but solve it also. 
Later on the fetcher thread checks for severe errors and stop if there is one.


RobotRulesParser:

if (agents.size() == 0) {
  agents.add(agentName);
  LOG.severe(No agents listed in 'http.robots.agents' property!);
} else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
  agents.add(0, agentName);
  LOG.severe(Agent we advertise ( + agentName
 + ) not listed first in 'http.robots.agents' property!);
}

Fetcher.FetcherThread:
 if (LogFormatter.hasLoggedSevere()) // something bad happened
break;  

I suggest to use warn or something similar instead of severe to log this 
problem.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] 

Stefan Groschupf commented on NUTCH-233:


Sorry, I haven't such url since it happens until reducing a fetch. Reducing 
provides no logging and map data will be deleted if the job fails because a 
timeout. :(


 wrong regular expression hang reduce process for ever
 -

  Key: NUTCH-233
  URL: http://issues.apache.org/jira/browse/NUTCH-233
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Fix For: 0.8-dev


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-15 Thread Stefan Groschupf (JIRA)
wrong regular expression hang reduce process for ever 
--

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
wasn't compatible with java.util.regex that is actually used in the regex url 
filter. 
May be it was missed to change it when the regular expression packages was 
changed.
The problem was that until reducing a fetch map output the reducer hangs 
forever since the outputformat was applying the urlfilter a url that causes the 
hang.
060315 230823 task_r_3n4zga at 
java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga at 
java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga at 
java.util.regex.Pattern$Curly.match1(Pattern.java:

I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old 
regex would match :
abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old 
regex would also match :
abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
improved handling of plugin folder configuration


 Key: NUTCH-229
 URL: http://issues.apache.org/jira/browse/NUTCH-229
 Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


Currently nutch only supports absoluth path or realative path that are part of 
the classpath. 
There are cases where it would be useful to be able using relative paaths that  
are not in the classpath for example have a centralized plugin repository on a 
shared hdd in cluster or running nutch inside a ide etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-229?page=all ]

Stefan Groschupf updated NUTCH-229:
---

Attachment: pluginFolder.patch

A patch to be able using relative path that are not in the classpath.

 improved handling of plugin folder configuration
 

  Key: NUTCH-229
  URL: http://issues.apache.org/jira/browse/NUTCH-229
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: pluginFolder.patch

 Currently nutch only supports absoluth path or realative path that are part 
 of the classpath. 
 There are cases where it would be useful to be able using relative paaths 
 that  are not in the classpath for example have a centralized plugin 
 repository on a shared hdd in cluster or running nutch inside a ide etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
CrawlDb Filter tool
---

 Key: NUTCH-226
 URL: http://issues.apache.org/jira/browse/NUTCH-226
 Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Minor


A tool to filter a existing crawlDb

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-226?page=all ]

Stefan Groschupf updated NUTCH-226:
---

Attachment: crawlDbFilter.patch

Patch with tool to filter a existing crawlDb. In any case backup your crawlDb 
first.

 CrawlDb Filter tool
 ---

  Key: NUTCH-226
  URL: http://issues.apache.org/jira/browse/NUTCH-226
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Minor
  Attachments: crawlDbFilter.patch

 A tool to filter a existing crawlDb

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-222) Exception in thread main java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-222?page=all ]
 
Stefan Groschupf closed NUTCH-222:
--

Resolution: Fixed

Hi, 
I guess it is a typo, try invertlinks in case the nutch script does not know 
the command as in your case invertlink it tries to execute such a class.

 Exception in thread main java.lang.NoClassDefFoundError: invertlink
 -

  Key: NUTCH-222
  URL: http://issues.apache.org/jira/browse/NUTCH-222
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7.1
  Environment: Windows, Cygwin, etc.
 Reporter: Richard Braman


 When trying to invertlinks before indexing, following the tutorial, I get the 
 following error.
 [EMAIL PROTECTED] /cygdrive/t/nutch-0.7.1
 $ bin/nutch invertlink taxcrawl/db/ -dir taxcrawl/segments/*
 run java in C:\Program Files\Java\jdk1.5.0_04
 Exception in thread main java.lang.NoClassDefFoundError: invertlink

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367991 ] 

Stefan Groschupf commented on NUTCH-204:


Jérôme,
After taking a look to the HitDetails object again - after a some time - I 
notice I completely had overseen that there are already all values in key:value 
tuples in the HitDetais object. 
The problem is more that public String getValue(String field)  just returns the 
first field matching the field name. Accessing all values is already possible 
using  getLength, getField and getValue.
Isn't it?

From my point of view should keep things as lightweight as possible and may 
just  add one method getValues to the HitDetails object that could looks like 
this:
public String[] getValues(String field) {
  ArrayList arrayList = new ArrayList();
  for (int i = 0; i  length; i++) {
if (fields[i].equals(field))
  arrayList.addvalues[i]);
}
  if(arrayList.size()0){
return (String[]) arrayList.toArray(new String[arrayList.size()]);
  }
  return null;
}
So I think introduce a new Property object, that needs to be instantiated  and 
serialized any time is just more overhead we should not introduce. 
HitDetails has influence of the search performance and with having one object 
instantiated more for each HitDetails we will slow down this by calling gc 
doubled often than before.
Would you agree just adding a method getValues to the HitDetails object?



 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12368038 ] 

Stefan Groschupf commented on NUTCH-204:


Yes that is a good idea. Thanks for getting this into the sources.
Cheers, 
Stefan

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367520 ] 

Stefan Groschupf commented on NUTCH-204:


There is something I don't understand with this patch. The way Lucene manage 
multi-valued fields is to have many mono-valued Field objects with the same 
name. My interrogation, is why not keeping this logic? 

Sure that would be possible. My idea was that we don't need these many 
identically keys, they just eat some bytes we do not really need to transfer 
over the neztwork. 
HitDetails is a writable and in case of multiple searchservers distributed in a 
network it makes sense to minimize the network io since getting details should 
be as fast as possible. 
Would you agree? however I agree there are other ways to realize that, if you 
see space for improvements feel free in any case I really would love to see the 
feature in the sources. 

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367539 ] 

Stefan Groschupf commented on NUTCH-204:


Woudn't you end with something very similar as it is now, having one key and 
multiple values per key?
The Lucene Document provides a getValues so I do not see any changes to the 
lucene API concepts as you mentioned in your first post.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#getValues(java.lang.String)
Sorry, I still do not understand your improvement suggestion can you give some 
more details?

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367552 ] 

Stefan Groschupf commented on NUTCH-204:


Make sense, I see, thanks for the clarification.

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
checkstyle
--

 Key: NUTCH-213
 URL: http://issues.apache.org/jira/browse/NUTCH-213
 Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor


Adding checkstyle target to ant build file to support contributers verifying 
whitespace problems.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-213?page=all ]

Stefan Groschupf updated NUTCH-213:
---

Attachment: checkstyle.patch
checkstyle-all-4.1.jar

As part of  my learning lesson 'whitespace' I added a checkstyle target to the 
build scrip. The check-style setup by now only checks whitespace but other 
checks can be added later. This target  can be helpful for contributors to 
verify that new code has a correct formating.
It is a own target that can called by 'ant checkstyle'. The result is rendered 
to build/checkstyle/checkstyle_report.html
The patch file contains the text changes and text documents, the jar need to be 
copied to the lib folder.

 checkstyle
 --

  Key: NUTCH-213
  URL: http://issues.apache.org/jira/browse/NUTCH-213
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Minor
  Attachments: checkstyle-all-4.1.jar, checkstyle.patch

 Adding checkstyle target to ant build file to support contributers verifying 
 whitespace problems.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366645 ] 

Stefan Groschupf commented on NUTCH-211:


Raghavendra, I'm not sure if I also close the linkDB reader, may be I missed 
that. I will check this later today and may come with a improved version is I 
missed it. Thanks for catching this.

 FetchedSegments leave readers open
 --

  Key: NUTCH-211
  URL: http://issues.apache.org/jira/browse/NUTCH-211
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: closeFetchSegments.patch

 I have a case here where the NutchBean is instantiated more than once, 
 however I do cache the nutch bean, but in some situations the bean needs to 
 re created. The problem is the  FetchedSegments leaves open all reads it 
 uses. So a nio Exception is thrown as soon I try to create the NutchBean 
 again. 
 I would suggest to add a close method to  FetchedSegments and all involved 
 objects to be able cleanly shutting down the NutchBean.
 Any comments? Would a patch be welcome?
 Caused by: java.nio.channels.ClosedChannelException
 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
 at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
 at 
 org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
 at 
 org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
 at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
 at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
 at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
 at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
 at 
 org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
 at 
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
 at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]

Stefan Groschupf updated NUTCH-211:
---

Attachment: closeable160206.patch

Now also closing linkdb reader and file system, thanks to Raghavendra.

 FetchedSegments leave readers open
 --

  Key: NUTCH-211
  URL: http://issues.apache.org/jira/browse/NUTCH-211
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: closeFetchSegments.patch, closeable160206.patch

 I have a case here where the NutchBean is instantiated more than once, 
 however I do cache the nutch bean, but in some situations the bean needs to 
 re created. The problem is the  FetchedSegments leaves open all reads it 
 uses. So a nio Exception is thrown as soon I try to create the NutchBean 
 again. 
 I would suggest to add a close method to  FetchedSegments and all involved 
 objects to be able cleanly shutting down the NutchBean.
 Any comments? Would a patch be welcome?
 Caused by: java.nio.channels.ClosedChannelException
 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
 at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
 at 
 org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
 at 
 org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
 at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
 at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
 at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
 at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
 at 
 org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
 at 
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
 at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
FetchedSegments leave readers open 
---

 Key: NUTCH-211
 URL: http://issues.apache.org/jira/browse/NUTCH-211
 Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


I have a case here where the NutchBean is instantiated more than once, however 
I do cache the nutch bean, but in some situations the bean needs to re created. 
The problem is the  FetchedSegments leaves open all reads it uses. So a nio 
Exception is thrown as soon I try to create the NutchBean again. 
I would suggest to add a close method to  FetchedSegments and all involved 
objects to be able cleanly shutting down the NutchBean.
Any comments? Would a patch be welcome?

Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
at 
org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
at 
org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
at 
org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
at 
org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
at 
org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
at 
org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
at 
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-15 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12366472 ] 

Stefan Groschupf commented on NUTCH-204:


Any improvment suggestions or negative comments? If not it would be great if 
one with write access to the svn can commit this since I have a meta data 
related patch I want to contribute that depends on this patch. Also this was a 
user request.
Thanks!

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Assigned: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]

Stefan Groschupf reassigned NUTCH-211:
--

Assign To: Stefan Groschupf

 FetchedSegments leave readers open
 --

  Key: NUTCH-211
  URL: http://issues.apache.org/jira/browse/NUTCH-211
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev


 I have a case here where the NutchBean is instantiated more than once, 
 however I do cache the nutch bean, but in some situations the bean needs to 
 re created. The problem is the  FetchedSegments leaves open all reads it 
 uses. So a nio Exception is thrown as soon I try to create the NutchBean 
 again. 
 I would suggest to add a close method to  FetchedSegments and all involved 
 objects to be able cleanly shutting down the NutchBean.
 Any comments? Would a patch be welcome?
 Caused by: java.nio.channels.ClosedChannelException
 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
 at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
 at 
 org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
 at 
 org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
 at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
 at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
 at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
 at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
 at 
 org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
 at 
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
 at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]

Stefan Groschupf updated NUTCH-211:
---

Attachment: closeFetchSegments.patch

NutchBean, FetchedSegments,FetchedSegments.Segment IndexSearcher and HitContent 
now extends / implements the hadoop Closeable interface.
A NutchBean should now be able to cleanly shutdown without leaving open file 
handles or cocket clients.

 FetchedSegments leave readers open
 --

  Key: NUTCH-211
  URL: http://issues.apache.org/jira/browse/NUTCH-211
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: closeFetchSegments.patch

 I have a case here where the NutchBean is instantiated more than once, 
 however I do cache the nutch bean, but in some situations the bean needs to 
 re created. The problem is the  FetchedSegments leaves open all reads it 
 uses. So a nio Exception is thrown as soon I try to create the NutchBean 
 again. 
 I would suggest to add a close method to  FetchedSegments and all involved 
 objects to be able cleanly shutting down the NutchBean.
 Any comments? Would a patch be welcome?
 Caused by: java.nio.channels.ClosedChannelException
 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
 at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
 at 
 org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
 at 
 org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
 at 
 org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
 at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
 at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
 at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
 at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
 at 
 org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
 at 
 org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
 at 
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
 at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
---

Attachment: metadata08_02_06.patch

Doug, I'm afraid there is a missunderstanding or may be I just do not 
understand  your comments.
A plugin never need to add a class - id mapping anymore. The later patches 
(after Andrzej suggestions) can handle any kind of writables. In case the class 
 is not known in a mapping we create a internal id - class tuple and write  it 
to or read it from the 'header' of each mapWritable.  So users can use any kind 
of custom  writable's this just takes some more space in the file. (one byte 
for the id and a UTF8 for the classname). In case there is a frequently used 
new writable we can add it to the mapping. 

So as suggested I moved the mapping from WritableName into a static block of 
MapWritable and in case unknown writables are used we read write a header 
containg this id class tuple. From my point of view this is the best solution 
for now and I don't think we will have that often new and frquently used 
writables. 


 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata010206.patch, metadata060206.patch, 
 metadata08_02_06.patch, metadata300106.patch, metadata310106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-204) multiple field values in HitDetails

2006-02-06 Thread Stefan Groschupf (JIRA)
multiple field values in HitDetails
---

 Key: NUTCH-204
 URL: http://issues.apache.org/jira/browse/NUTCH-204
 Project: Nutch
Type: Improvement
  Components: searcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
 Fix For: 0.8-dev


Improvement as Howie Wang suggested.
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
PROTECTED]


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-204) multiple field values in HitDetails

2006-02-06 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]

Stefan Groschupf updated NUTCH-204:
---

Attachment: DetailGetValues070206.patch

Patch that adding getValues to HitDetails.

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364788 ] 

Stefan Groschupf commented on NUTCH-192:


That's true. In any case I don't wan't to store the class id map. Since if we 
do that, you are right we can use strings. 
What you think about having a map in the MapWritable itself where we manually 
assign id's. This was may plan in very beginning but I was thinking that using 
WritableName would be better, but of cource I overseen problemes you mentioned.
Do you think haveing a static block in the MapwWritable like this, will solve 
our problems?
CACHE.put(LongWritable.class, new Byte(1));

Thanks for taking time to discuss this.

 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch, metadata310106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364795 ] 

Stefan Groschupf commented on NUTCH-192:


A perfect plan, I will do that so and commit a new patch. :) 
THANKS!

 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch, metadata310106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
---

Attachment: metadata010206.patch

As discussed...

 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata010206.patch, metadata300106.patch, metadata310106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ] 

Stefan Groschupf commented on NUTCH-192:


Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have 
string keys and values, or just string keys?
It confuse me a bit but I'm afraid to misunderstand things because of my 
english, since I remember that one reason to have no meta data until today was  
performance and the size of data. 
In one of my personal use-cases I have a set of meta data that is definitely 
smaller than 255 and I only need to store some long values.
So I would love to use key:ByteWritable and value:LongWritable. 

Storing new LongWritable(23) or new UTF8(23) should be  a significant 
different in size. Also parsing byte int or long from a string takes some time.
At least there is a nice side effect, since this map also is a writable we can 
store a Map in a Map, what allows heretically meta data.

I fully agree with having a manual created mapping table stored in the 
MapWritable class and I will change this and commit a new patch.
Thanks for your comments!

 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] 

Stefan Groschupf commented on NUTCH-192:


* plus whatever it takes to put the class name-id mapping in the MapWritable 
header (the mapping table): let's assume 40 bytes. 

I do not write the mapping table in any kind to the out stream, by now the the 
id is caculated by a hash from the class name. 
I will change this so it will be a part of the class where I will manually 
assign LongWritable id = (byte)1, UTF8 id = (byte)2, etc.

For example writing a long ( e.g. a timestamp) as UTF8 require me 15 byte, 
writing it as LongWritable took me 8 byte.
8 byte plus 1 byte for the class type, is 60 % required space than using a 
String. 

I guess the main missunderstanding is that I do not write the clazz - id map 
into the stream at any time.
Makes that sense?
 


 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
---

Attachment: metadata310106.patch

Now 1 byte for the class type and the size of the type itself, this means we 
can have only 2 byte keys and 2 byte values in the map. 

 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch, metadata310106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-30 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
---

Attachment: metadata300106.patch

Attached a first suggestion for a patch to adding meta data support into 
crawlDatum. 
In general I created a MapWritable and add this to the CrawlDatum. If no meta 
data are added to CrawlDatum there will be only one more int written to the 
output stream. The MapWritable works like a HashMap but requre Writables as key 
and value. Beside the key and the value size it writes two addition int's into 
the stream to identify the classes of  key and value. If we may be more change 
the WritableName we can minimize that to two addidtional bytes for storing 
classes (this would limit us but i guess we will neve so mache writable object 
types. :-o). However I started with a patch that changes as less as possible 
and I'm sure there is space for improvements. So feedback and improvement 
suggestions are welcome.



 meta data support for CrawlDatum
 

  Key: NUTCH-192
  URL: http://issues.apache.org/jira/browse/NUTCH-192
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: metadata300106.patch

 Supporting meta data in CrawlDatum would help to get a set of new nutch 
 features realized and makes a lot possible to smaller special focused search 
 engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-01-29 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364401 ] 

Stefan Groschupf commented on NUTCH-14:
---

I didn't see that anymore, but I didn't make any newer heavy load test. We may 
can close this for now.

 NullPointerException NutchBean.getSummary
 -

  Key: NUTCH-14
  URL: http://issues.apache.org/jira/browse/NUTCH-14
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Stefan Groschupf
 Priority: Minor


 In heavy load scenarios this may happens when connection broke.
 java.lang.NullPointerException
 at java.util.Hashtable.get(Hashtable.java:333)
 at net.nutch.ipc.Client.getConnection(Client.java:276)
 at net.nutch.ipc.Client.call(Client.java:251)
 at 
 net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418)
 at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236)
 at 
 org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738)
 at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526)
 at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:552)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-59) meta data support in webdb

2006-01-26 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364136 ] 

Stefan Groschupf commented on NUTCH-59:
---

Nutch 0.8 is very different to 0.7 in the way it stores page data and 
linkgraph. Therefore a reimplementation of meta data support for nutch 0.8 is 
on my todo list. It will be simple HashMap style api to store and retrieve key 
value tupples. Data will be stored in a extra file.

 

 meta data support in webdb
 --

  Key: NUTCH-59
  URL: http://issues.apache.org/jira/browse/NUTCH-59
  Project: Nutch
 Type: New Feature
 Reporter: Stefan Groschupf
 Priority: Minor
  Attachments: webDBMetaDataPatch.txt

 Meta data support in web db would very usefully for a new set of nutch 
 feature that needs long life meta data. 
 Actually page meta data need to be regenerated or lookup every 30 days a page 
 is re-fetched, in a long context web db meta data would bring a dramatically 
 performance improvement for such tasks.
 Furthermore Storage of meta data in webdb would make a new generation of 
 linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Resolved: (NUTCH-127) uncorrect values using -du, or ls does not return items

2006-01-23 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-127?page=all ]
 
Stefan Groschupf resolved NUTCH-127:


Resolution: Fixed

I guess it is solved, thanks. If able to reproduce it again I will just reopen 
this or a new report. 
Thanks!

 uncorrect values using -du, or ls does not return items
 ---

  Key: NUTCH-127
  URL: http://issues.apache.org/jira/browse/NUTCH-127
  Project: Nutch
 Type: Bug
   Components: ndfs
 Versions: 0.8-dev, 0.7.2-dev
 Reporter: Stefan Groschupf
 Priority: Blocker


 The ndfs client return uncorrect values by using du or ls does not return 
 items.
 It looks like there is a problem with the virtual file strcuture, since -du 
 only reads the meta data, isn't it?
 We had moved some data from folder to folder and after that we notice that a 
 folder with zero items has a size.
 [EMAIL PROTECTED] bin/nutch ndfs -du indexes/
 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
 051118 092409 No FS indicated, using default:192.168.200.3:5
 051118 092409 Client connection to 192.168.200.3:5: starting
 Found 1 items
 /user/nutch/indexes/20051022033721  974606348
 [EMAIL PROTECTED] bin/nutch ndfs -du indexes/20051022033721/
 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
 051118 092416 No FS indicated, using default:192.168.200.3:5
 051118 092416 Client connection to 192.168.200.3:5: starting
 Found 0 items
 [EMAIL PROTECTED] bin/nutch ndfs -ls indexes/20051022033721
 051118 093331 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
 051118 093332 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
 051118 093332 No FS indicated, using default:192.168.200.3:5
 051118 093332 Client connection to 192.168.200.3:5: starting
 Found 0 items
 So may the mv tool has a problem, the du or the ls tool. :-O Any ideas where 
 to search for the problem? Dubugging ndfs is tricky.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-18 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12363116 ] 

Stefan Groschupf commented on NUTCH-169:


Thanks, we will fix this in the beginning of next week.

 remove static NutchConf
 ---

  Key: NUTCH-169
  URL: http://issues.apache.org/jira/browse/NUTCH-169
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: NutchConf.367837.patch, NutchConf.Fetcher.060111.patch, 
 NutchConf.Http.060111.patch, NutchConf.RegexURLFilter.060111.patch, 
 nutchConf.patch

 Removing the static NutchConf.get is required for a set of improvements and 
 new features.
 + it allows a better integration of nutch in j2ee or other systems.
 + it allows the management of nutch from a web based gui (a kind of nutch 
 appliance) which will improve the usability and also increase the user 
 acceptance of nutch
 + it allows to change configuration properties until runtime
 + it allows to implement NutchConf as a abstract class or interface to 
 provide other configuration value sources than xml files. (community request)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



  1   2   >