from:"Stefan Groschupf"

[ 
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] 

Stefan Groschupf commented on NUTCH-354:


Since this issue is already closed I can not attach the patch file, so I attach 
it as text within this comment.
If you need the file let me know and I send you a offlist mail. 


Index: src/test/org/apache/nutch/crawl/TestMapWritable.java
===
--- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 
432325)
+++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy)
@@ -180,6 +180,31 @@
 assertEquals(before, after);
   }
 
+  public void testRecycling() throws Exception {
+UTF8 value = new UTF8(value);
+UTF8 key1 = new UTF8(a);
+UTF8 key2 = new UTF8(b);
+
+MapWritable writable = new MapWritable();
+writable.put(key1, value);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+
+DataOutputBuffer dob = new DataOutputBuffer();
+writable.write(dob);
+writable.clear();
+writable.put(key1, value);
+writable.put(key2, value);
+assertEquals(writable.get(key1), value);
+assertEquals(writable.get(key2), value);
+
+DataInputBuffer dib = new DataInputBuffer();
+dib.reset(dob.getData(), dob.getLength());
+writable.readFields(dib);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+  }
+  
   public static void main(String[] args) throws Exception {
 TestMapWritable writable = new TestMapWritable();
 writable.testPerformance();


 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Fwd: [webspam-announces] Web Spam Collection Announced

2006-08-21 Thread Stefan Groschupf


Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls  
from my POV.


Greetings,
Stefan




During AIRWeb'06 we announced the availability of the collection.

We are currently planning a Web Spam challenge based on the dataset we
have built. I assume most of you will be interested on this, so I have
moved the webspam-volunteers list to webspam-announces. If you do
not want to be in this new webspam-announces list, please send me an
e-mail.

This was shown during AIRWeb in Seattle:

.

Web Spam Collection Available
August 10th, 2006

We are pleased to announce the availability of a public collection for
research on Web spam. This collection is the result of efforts by a
team of volunteers:

Thiago AlvesAntonio GulliTamas Sarlos
Luca Becchetti  Zoltan Gyongyi   Mike Thelwall
Paolo Boldi Thomas Lavergn   Belle Tseng
Paul ChiritaAlex Ntoulas Tanguy Urvoy
Mirel Cosulschi Josiane-Xavier Parreira  Wenzhong Zhao
Brian Davison   Xiaoguang Qi
Pascal Filoche  Massimo Santini

The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
downloaded in May 2006 by the Laboratory of Web Algorithmics,
Universit{\`a} degli Studi di Milano. The labelling process was
coordinated by Carlos Castillo working at the Algorithmic Engineering
group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
by the DELIS project (Dynamically Evolving, Large Scale Information
Systems).

Volunteers were provided with a set of guidelines and were asked to
mark a set of hosts as either normal, spam, or borderline. The
collection includes about 6,700 judgments done by the volunteers and
can be used for testing link-based and content-based Web spam
detection and demotion techniques.

More information is available in our Web page, including the
guidelines given to the human judges, the instructions for obtaining
the links and contents of the pages in this collection, and the
contact information for questions and comments.

http://aeserver.dis.uniroma1.it/webspam/

If you use this data set please subscribe to our mailing list by
sending an e-mail to [EMAIL PROTECTED]

--
Carlos Castillo
Universita di Roma La Sapienza
Rome, ITALY





Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/webspam-announces/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 

Stefan Groschupf commented on NUTCH-356:


Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with 
the memory. 
There is no way around having a singleton pluginrepository. However you can 
reset the the pluginRepository by remove the cached object from the 
configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing 
a own plugin that solve your problem that should be a cleaner solution for your 
problem. 

Would you agree to close this issue since we will not be able to commit your 
changes. 
Stefan  

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-357) crawling simulation

crawling simulation
---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  web servers. 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-357) crawling simulation

 [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
---

Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

 crawling simulation
 ---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0

 Attachments: protocol-simulation-pluginV1.patch


 We recently discovered  some serious issue related to crawling and scoring. 
 Reproducing these problems is a kind of difficult, since first of all it is 
 not polite to re-crawl a set of pages again and again, secondly it is 
 difficult to catch the page that cause a problem. 
 Therefore it would be very useful to have a testbed to simulate crawls where  
 we can control the response of  web servers. 
 For the very beginning simulate very basic situation like a page points to it 
 self,  link chains or internal links would already be very usefully. 
 However later on simulate crawls against existing data collections like TREC 
 or a webgraph would be much more interesting, for instance to caculate the 
 quality of the nutch OPIC implementation against page rank scores of the 
 webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)

MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

[ 
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] 

Stefan Groschupf commented on NUTCH-343:


Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to the patch file. 
My personal experience is that some nutch developers have strong opitions about 
code formating, so you may be want to check your code formating. :-)

 Index MP3 SHA1 hashes
 -

 Key: NUTCH-343
 URL: http://issues.apache.org/jira/browse/NUTCH-343
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.8, 0.9.0, 0.8.1
Reporter: Hasan Diwan
 Attachments: parsemp3.pat


 Add indexing of the mp3s sha1 hash.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]

Stefan Groschupf updated NUTCH-341:
---

Attachment: doNotDeleteTmpIndexMergeDirV1.patch

+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes it afterwards with all content.
Very dangerous if a user provides / as tmp folder. The attached patch
rollsback the missing line and I would love to ask that one developer with
write access can roll in this in asap!
THANKS!

IndexMerger now deletes entire workingdir after completing

Key: NUTCH-341
URL: http://issues.apache.org/jira/browse/NUTCH-341
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Critical
Attachments: doNotDeleteTmpIndexMergeDirV1.patch

Change 383304 deleted the following line near Line 117 (see
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304r2=405204diff_format=h
for details):
workDir = new File(workDir, indexmerger-workingdir);
Previously, if no -workingdir workingdir parameter was specified,
IndexMerger.main() would place an indexmerger-workingdir directory into the
default directory and then delete the former after completing. Now,
IndexMerger.main() defaults the value of its workDir to indexmerger within
the default directory, and deletes this workDir afterward.
However, if -workingdir workingdir _is_ specified, IndexMerger.main() will
now set workDir to _this_ path and delete the _entire_ workingdir
afterward. Previously, IndexMerger.main() would only delete
workingDir/indexmerger-workingdir, without deleting workingdir itself.
This is because the line mentioned above always appended
indexmerger-workingdir to workDir.
Our hardware configuration on the jobtracker/namenode box attempts to keep
all large datasets on a separate, large hard drive. Accordingly, we were
keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir
on this drive. Unfortunately, we were passing the folder containing these
folders in the workingdir parameter to the IndexMerger. As a result, the
first time we ran the IndexMerger, we ended up trashing our entire DFS!
Perhaps the way that the IndexMerger handles its workingdir parmaeter now
is an acceptable design. However, given the way it handled this parameter in
the past, I feel that the current implementation is unacceptably dangerous.
More importantly, perhaps there's some way that we could make hadoop more
robust in handling its critical data files. I plan to place a directory owned
by root with dr permissions into each of these critical directories
in order to prevent any of them from suffering the fate of our DFS. This
could become part of a standard hadoop installation.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Attachment: respectFetcherParsePropertyV1.patch

Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a 
contributor to commit this to trunk

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
Priority: Trivial
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Priority: Major  (was: Trivial)

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
marked as STATUS_DB_GONE  
--

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical


Intranet crawls or focused crawls will fetch many pages from the same host. 
This causes that a thread will be blocked since a other thread already fetch 
from the same host. It is very likely that threads are more often blocked than 
http.max.delays. In such a case the HttpBase.blockAddr method throws a 
HttpException. This will be handled in the fetcher by increment the crawlDatum 
retries and set the status to STATUS_FETCH_RETRY. That means that at least you 
have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in 
intranet or focused crawls it is very likely that this is not enough. Increaing 
one of the involved properties dramatically slow down the fetch. 
I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem 
was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

 [ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]

Stefan Groschupf updated NUTCH-350:
---

Attachment: protocolRetryV5.patch

This patch will dramatically increase the number of successfully fetched pages 
of a intranet crawl over the time. 

 urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
 marked as STATUS_DB_GONE
 

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: protocolRetryV5.patch


 Intranet crawls or focused crawls will fetch many pages from the same host. 
 This causes that a thread will be blocked since a other thread already fetch 
 from the same host. It is very likely that threads are more often blocked 
 than http.max.delays. In such a case the HttpBase.blockAddr method throws a 
 HttpException. This will be handled in the fetcher by increment the 
 crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that 
 at least you have only db.fetch.retry.max * http.max.delays chances to fetch 
 a url. But in intranet or focused crawls it is very likely that this is not 
 enough. Increaing one of the involved properties dramatically slow down the 
 fetch. 
 I suggest to not increase the CrawlDatum RetriesSinceFetch in case the 
 problem was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] 

Stefan Groschupf commented on NUTCH-322:


I think this is a serious problem. Page A server side redirect to Page B. Page 
A is never writen to the output. That causes that Page A does not change the 
state or the next fetch time, what means that page A is fetched again, again, 
again ... ∞

I suggest that we write out Page A with a status change to STATUS_DB_GONE.


 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

pages that serverside forwards will be refetched every time
---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1
 Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back 
into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is 
not polite and refetching the forwarding and target page in each segment 
iteration. Also it effects the scoring since the forward page contribute it's 
score to all outlinks.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

 [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Stefan Groschupf updated NUTCH-353:
---

Attachment: doNotRefecthForwarderPagesV1.patch

Since we discussed that nutch need to be more polite we should fix that asap. 

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

 [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Stefan Groschupf resolved NUTCH-322.


Resolution: Duplicate

duplicate of NUTCH-353

 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

[ 
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] 

Stefan Groschupf commented on NUTCH-347:


Please submit this patch! 
Thanks!

 Build: plugins' Jars not found
 --

 Key: NUTCH-347
 URL: http://issues.apache.org/jira/browse/NUTCH-347
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Otis Gospodnetic
 Attachments: nutch_build_plugins_patch.txt


 While building Nutch, I noticed several places where various Jars from 
 plugins' lib directories could not be found, for example:
 $ ant package
 ...
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-log4j/lib-log4j.jar to copy.
 init:
 init-plugin:
 compile:
 jar:
 deps-test:
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-nekohtml/lib-nekohtml.jar 
 to copy.
 ...
 The problem is, these lib-.jar files do not exist.  Instead, those Jars 
 are typically named with a version in the name, like log4j-1.2.11.jar.  I 
 could not find where this lib- prefix comes from, nor where the version is 
 dropped from the name.  Anyone knows?
 In order to avoid these errors I had to make symbolic links and fake things:
 e.g.
   ln -s log4j-1.2.11.jar lib-log4j.jar
 But this should really be fixed somewhere, I just can't see where... :(
 Note that this doesn't completely break the build, but missing Jars can't be 
 a good thing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

[ 
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] 

Stefan Groschupf commented on NUTCH-346:


+1
I agree, can you please create a patch file and attach it to this bug. 
Thanks

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor

 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

[ 
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] 

Stefan Groschupf commented on NUTCH-345:


Shouldn't the DeflateUtils also be part of the protocol-http plugin? 
Also since it is a larger contribution and not just a small bug fix it would be 
great to have a junit test within the patch. 
Thanks for the contribution.



 Add support for Content-Encoding: deflated
 --

 Key: NUTCH-345
 URL: http://issues.apache.org/jira/browse/NUTCH-345
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Pascal Beis
Priority: Minor
 Attachments: nutch-deflate.patch


 Add support for the deflated content-encoding, next to the already
 implemented GZIP content-encoding. Patch attached. See also the
 Patch: deflate encoding thread on nutch-dev on August 7/8 2006.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] 

Stefan Groschupf commented on NUTCH-349:


my vote goes to #2.
Having a tool that need to be started manually would be better than complicate 
the already fragile code from my point of view. 

 Port Nutch to use Hadoop Text instead of UTF8
 -

 Key: NUTCH-349
 URL: http://issues.apache.org/jira/browse/NUTCH-349
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 

 Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
 This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
 instead. Sooner or later we will need to move Nutch to use this class instead 
 of UTF8.
 This raises numerous issues regarding the compatibility of existing data in 
 CrawlDB, LinkDB and segments. I can see two ways to solve this:
 * add code in readers of respective formats to convert UTF8-Text on the fly. 
 New writers would only use Text. This is less than ideal, because it 
 complicates the code, and also at some point in time the UTF8 class will be 
 removed.
 * create a converter (to be maintaines as long as UTF8 exists), which 
 converts existing data in bulk from UTF8 to Text. This requires an additional 
 processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] 

Stefan Groschupf commented on NUTCH-233:


Hi Otis, 
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls that for example comes from link farms the 
crawler runs into. 

 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-348) Generator is building fetch list using lowest scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
---

Attachment: sortPatchV1.patch

What people think about this kind of solution?

 Generator is building fetch list using *lowest* scoring URLs
 

 Key: NUTCH-348
 URL: http://issues.apache.org/jira/browse/NUTCH-348
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider
 Attachments: sortPatchV1.patch


 Ever since revision 391271, when the CrawlDatum key was replaced by a 
 FloatWritable key, the Generator.Selector.reduce method has been outputting 
 the *lowest* scoring URLs! The CrawlDatum class has a Comparator that 
 essentially treats higher scoring CrawlDatum objects as less than lower 
 scoring CrawlDatum objects, so the higher scoring ones would appear first in 
 a sequence file sorted using this as the key.
 When a FloatWritable based on the score itself (as returned from 
 scfilters.generatorSortValue) became the sort key, it should have been 
 negated in Generator.Selector.map to have the same result. Curiously, there 
 is a comment to this effect immediately before the FloatWritable is set:
   // sort by decreasing score
   sortValue.set(sort);
 It seems like the simplest way to fix this is to just negate the score, and 
 this seems to work for me:
   // sort by decreasing score
   // 2006-08-15 CSc REALLY sort by decreasing score
   sortValue.set(-sort);
 Unfortunately, this means that any crawls that have been done using 
 Generator.java after revision 391271 should be discarded, as they were 
 focused on fetching the lowest scoring unfetched URLs in the crawldb, 
 essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)

doubling score causes by page internal anchors.
---

 Key: NUTCH-332
 URL: http://issues.apache.org/jira/browse/NUTCH-332
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of 
anchors the scores of the page are distributed to its outlinks. But all this 
outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this 
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double 
the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] 

Stefan Groschupf commented on NUTCH-318:


Yes this happens only in a distributed environment. Please also see my last 
mail in the hadoop dev list. I think there are more general logging problems, 
that only occurs in a distributed environment. So you will not track them down 
using local runner.

 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] 

Stefan Groschupf commented on NUTCH-318:


Shouldn't that be fixed in .8 since by today this tool just produce no output?!


 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] 

Stefan Groschupf commented on NUTCH-233:


I think this should be fixed in .8 too, since everybody that does real whole 
web crawl with over a 100 Mio pages will run into this problem. The problems 
are for example from spam bot generated urls. 



 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9-dev


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: segread vs. readseg

2006-07-24 Thread Stefan Groschupf


I like it!

Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki:


Stefan Neufeind wrote:

Andrzej Bialecki wrote:

Stefan Groschupf wrote:

Hi developers,

we have command like readdb and readlinkdb but segread. Wouldn't  
be more consistent to name the command readseg instead segread?

... just a thought.


Yes, it seems more consistent. However, if we change it then  
scripts people wrote would break. We could support both aliases  
in 0.8, and give a deprecation message.


What do others think?


Same feeling here. Agreed.


What about the following?

Index: bin/nutch
===
--- bin/nutch(revision 424960)
+++ bin/nutch(working copy)
@@ -40,7 +40,7 @@
  echo   generate  generate new segments to fetch
  echo   fetch fetch a segment's pages
  echo   parse parse a segment's pages
-  echo   segread   read / dump segment data
+  echo   readseg   read / dump segment data
  echo   mergesegs merge several segments, with optional  
filtering and slicing
  echo   updatedb  update crawl db from segments after  
fetching

  echo   invertlinks   create a linkdb from parsed segments
@@ -158,7 +158,10 @@
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ $COMMAND = readlinkdb ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
+elif [ $COMMAND = readseg ] ; then
+  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = segread ] ; then
+  echo [DEPRECATED] Command 'segread' is deprecated, use  
'readseg' instead.

  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = mergesegs ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

result comparison tool?

2006-07-23 Thread Stefan Groschupf


Hi,

I remember there was a search result comparison tool within nutch.
Is that still alive? How to use it / find it? I was not able to find  
it by  browsing the trunk sources.
Is there any such a tool people can suggest to compare search results  
with yahoo or google result to play with configuration properties and  
scoring mechanisms?


Thanks for any hints.
Stefan

nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf


Hi developers,

in nutch-default.xml property plugin.includes we say:   In any case
you need at least include the nutch-extensionpoints plugin.
But we do not include it by default.
   valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-
basic|query-(basic|site|url)|summary-basic|scoring-opic/value
We may be should update the text or include the plugin everything
else may be confuse users.

Should I open a bug or can someone with write access just jump in and
fix that.

Thanks,
Stefan

Re: nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf

I may - but since you know the details of the plugin subsystem,  
tell me what _should_ be there? I.e. should we really include it in  
the plugin.includes list, or not?


This is a philosophically question.
I personal prefer restrict definitions, since applications behavior  
is better traceable. That was a reason I implemented the plugin  
system in a restrict way.
Later on this was washed out by the plugin.auto-activation mechanism,  
what I still think was not a good move.


However in the moment we have the situation that nutch- 
extensionpoints is not included but the the auto activation mechanism  
includes this plugin since it is used by all other plugins.
So if you switch of auto activation today with default configured  
plugin-includes nutch will crash.
My person point of view is to add nutch-extensionpoints and switch  
off auto activation. .. but this is just my personal point of view...



Stefan

[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)

UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
not in plugin.includes
---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


In URLFilters constructor we use an array as long as we have filters defined in 
the urlfilter.order property. 
In case those filters are not included in the plugin.include property end up 
putting null entries into the array.

This cause a NPE in URLFilters line 82.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]

Stefan Groschupf updated NUTCH-325:
---

Attachment: UrlFiltersNPE.patch

A patch that uses a Arralist instead of an array and put only entries into the 
list when the entry is not null. Means only loaded Urlfilter that are loaded 
will be also stored into the filters array that is cached into the 
Configuration object. 

 UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
 not in plugin.includes
 ---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev

 Attachments: UrlFiltersNPE.patch


 In URLFilters constructor we use an array as long as we have filters defined 
 in the urlfilter.order property. 
 In case those filters are not included in the plugin.include property end up 
 putting null entries into the array.
 This cause a NPE in URLFilters line 82.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

log when blocked by robots.txt

2006-07-20 Thread Stefan Groschupf


Hi Developers,
another thing in the discussion to be more polite.
I suggest that we log a message in case an requested URL was blocked  
by a robots.txt.
Optimal would be if we only log this message in case the current used  
agent name is only blocked and it is not a general blocking of all  
agents.


Should I create a patch?

Stefan

[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

 [ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]

Stefan Groschupf updated NUTCH-323:
---

Attachment: MapWritableCopyConstructor.patch

Attached patch add a copy constructor to  the map writable and use it in the 
CrawlDatum.set methode. However there are more methods in the code where meta 
data are passed from one CrawlDatum to a other, but I don't can see any risk of 
concurent usage of the mapWritable there. 


 CrawlDatum.set just reference a mapWritable of a other object but not copy it.
 --

 Key: NUTCH-323
 URL: http://issues.apache.org/jira/browse/NUTCH-323
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev

 Attachments: MapWritableCopyConstructor.patch


 Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to 
 a other. 
 Also a reference of the MapWritable is passed. Means both project share the 
 same mapWritable and its content. 
 This causes problems with concurent manipulate mapWritables and its key-value 
 tuples. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

db.score.link.internal and db.score.link.external are ignored
-

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical


Configuration properties db.score.link.external and db.score.link.internal  are 
ignored.
In case of e.g. message board webpages or pages that have larger navigation 
menus on each page having a lower impact of internal links makes a lot of sense 
for scoring.
Also for web spam this is a serious problem, since now spammers can setup just 
one domain with dynamically generated pages and this highly manipulate the 
nutch scores. 
So I also suggest that we give db.score.link.internal by default a value of 
something like 0.25. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

 [ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]

Stefan Groschupf updated NUTCH-324:
---

Attachment: InternalAndExternalLinkScoreFactor.patch

Multiply the score of a page during distributeScoreToOutlink with 
db.score.link.internal or db.score.link.external.

 db.score.link.internal and db.score.link.external are ignored
 -

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: InternalAndExternalLinkScoreFactor.patch


 Configuration properties db.score.link.external and db.score.link.internal  
 are ignored.
 In case of e.g. message board webpages or pages that have larger navigation 
 menus on each page having a lower impact of internal links makes a lot of 
 sense for scoring.
 Also for web spam this is a serious problem, since now spammers can setup 
 just one domain with dynamically generated pages and this highly manipulate 
 the nutch scores. 
 So I also suggest that we give db.score.link.internal by default a value of 
 something like 0.25. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

 [ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]

Stefan Groschupf resolved NUTCH-319.


Resolution: Won't Fix

Sorry, that is bogus since it is wriiten to the logging stream.

 OPICScoringFilter should use logging API instead of printStackTrace
 ---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


 OPICScoringFilter line 107 should be a logging not a   
 e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

db.max.inlinks

2006-07-18 Thread Stefan Groschupf


Hi,

shouldn't  db.max.inlinks be in the nutch-default.xml configuration?

Stefan

OPICScoringFilter Metadata transport scores as String

2006-07-15 Thread Stefan Groschupf


Hi,

OPICScoringFilter line  91:
content.getMetadata().set(Fetcher.SCORE_KEY,  + datum.getScore());
and line 96,102 we set and get the Fetch Sore as Strings. :-o.
Wouldn't it be better to have the Metadata support floats as well  
instead of serializing and parsing strings?
In general wouldn't it be a good idea to have Metadata as child of  
MapWritable ? OO Design?


Any thoughts?

Stefan

[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)

OPICScoringFilter should use logging API instead of printStackTrace
---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


OPICScoringFilter line 107 should be a logging not a   
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [Nutch-dev] Crawl error

2006-07-10 Thread Stefan Groschupf

As mentioned, set the environment variable bin/nutch set also for  
eclipse, especially logging related variables!



Am 10.07.2006 um 00:05 schrieb AJ Chen:


My classpath has conf folder. NUTCH_JAVA_HOME is set. In fact, nutch
0.71is working well from my eclipse. I suspect the error comes from
changes in
verions 0.8. The problem is the log message does not say what file  
is not

found. So, it's hard to debug.  Any idea?
Thanks,
AJ

On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote:


Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in  bin/nutch.

Btw, please do not crosspost.
Thanks.
Stefan

Am 09.07.2006 um 21:47 schrieb AJ Chen:

 I checked out the 0.8 code from trunk and tried to set it up in
 eclipse.
 When trying to run Crawl from Eclipse using args urls -dir crawl -
 depth 3
 -topN 50, I got the following error, which started from
 LogFactory.getLog(
 Crawl.class). Any idea what file was not found?  There is a url
 file under
 directory urls. Thanks,

 log4j:ERROR setFile(null,true) call failed.
 java.io.FileNotFoundException: \ (The system cannot find the path
 specified)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:177)
at java.io.FileOutputStream.init(FileOutputStream.java:102)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
at org.apache.log4j.FileAppender.activateOptions
 (FileAppender.java:163)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(
 DailyRollingFileAppender.java:215)
at org.apache.log4j.config.PropertySetter.activate
 (PropertySetter.java
 :256)
at org.apache.log4j.config.PropertySetter.setProperties(
 PropertySetter.java:132)
at org.apache.log4j.config.PropertySetter.setProperties(
 PropertySetter.java:96)
at org.apache.log4j.PropertyConfigurator.parseAppender(
 PropertyConfigurator.java:654)
at org.apache.log4j.PropertyConfigurator.parseCategory(
 PropertyConfigurator.java:612)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(
 PropertyConfigurator.java:509)
at org.apache.log4j.PropertyConfigurator.doConfigure(
 PropertyConfigurator.java:415)
at org.apache.log4j.PropertyConfigurator.doConfigure(
 PropertyConfigurator.java:441)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
 OptionConverter.java:468)
at org.apache.log4j.LogManager.clinit(LogManager.java:122)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(
 Log4JLogger.java:229)
at org.apache.commons.logging.impl.Log4JLogger.init
 (Log4JLogger.java
 :65)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java: 
494)

at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
 LogFactoryImpl.java:529)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
 LogFactoryImpl.java:235)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
 LogFactoryImpl.java:209)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:
 351)
at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38)
 log4j:ERROR Either File or DatePattern options are not set for
 appender
 [DRFA].

 -AJ

  
- 
-

 ---
 Using Tomcat but need to do more? Need to support web services,
 security?
 Get stuff done quickly with pre-integrated technology to make your
 job easier
 Download IBM WebSphere Application Server v.1.0.1 based on Apache
 Geronimo
 http://sel.as-us.falkag.net/sel?
 cmd=lnkkid=120709bid=263057dat=121642
 ___
 Nutch-developers mailing list
 Nutch-developers@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nutch-developers

[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)

log4j not proper configured, readdb doesnt give any information
---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


In the latest .8 sources the readdb command doesn't dump any information 
anymore. 
This is realeated to the miss configured log4j.properties file. 
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps the information to the console, but not in a nice way. 

What makes me wonder  is that these information should be also in the log file, 
but the arn't, so there are may be even here problems.
Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [Nutch-dev] Crawl error

2006-07-09 Thread Stefan Groschupf

Try to put the conf folder to your classpath in eclipse and set the  
environemnt variables that are setted in  bin/nutch.


Btw, please do not crosspost.
Thanks.
Stefan

Am 09.07.2006 um 21:47 schrieb AJ Chen:

I checked out the 0.8 code from trunk and tried to set it up in  
eclipse.
When trying to run Crawl from Eclipse using args urls -dir crawl - 
depth 3
-topN 50, I got the following error, which started from  
LogFactory.getLog(
Crawl.class). Any idea what file was not found?  There is a url  
file under

directory urls. Thanks,

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path  
specified)

   at java.io.FileOutputStream.openAppend(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:177)
   at java.io.FileOutputStream.init(FileOutputStream.java:102)
   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
   at org.apache.log4j.FileAppender.activateOptions 
(FileAppender.java:163)

   at org.apache.log4j.DailyRollingFileAppender.activateOptions(
DailyRollingFileAppender.java:215)
   at org.apache.log4j.config.PropertySetter.activate 
(PropertySetter.java

:256)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:132)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:96)
   at org.apache.log4j.PropertyConfigurator.parseAppender(
PropertyConfigurator.java:654)
   at org.apache.log4j.PropertyConfigurator.parseCategory(
PropertyConfigurator.java:612)
   at org.apache.log4j.PropertyConfigurator.configureRootCategory(
PropertyConfigurator.java:509)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:415)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:441)
   at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
OptionConverter.java:468)
   at org.apache.log4j.LogManager.clinit(LogManager.java:122)
   at org.apache.log4j.Logger.getLogger(Logger.java:104)
   at org.apache.commons.logging.impl.Log4JLogger.getLogger(
Log4JLogger.java:229)
   at org.apache.commons.logging.impl.Log4JLogger.init 
(Log4JLogger.java

:65)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native  
Method)

   at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:39)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
   at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
LogFactoryImpl.java:529)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:235)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:209)
   at org.apache.commons.logging.LogFactory.getLog(LogFactory.java: 
351)

   at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38)
log4j:ERROR Either File or DatePattern options are not set for  
appender

[DRFA].

-AJ

-- 
---
Using Tomcat but need to do more? Need to support web services,  
security?
Get stuff done quickly with pre-integrated technology to make your  
job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache  
Geronimo
http://sel.as-us.falkag.net/sel? 
cmd=lnkkid=120709bid=263057dat=121642

___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: Nutch based directory and crawler based on keyword

2006-07-09 Thread Stefan Groschupf


Hi,

this question is difficult to answer and may be there more experts in  
the nutch user list than in the developer list.
In nutch 0.8 you can use the new scoring api to change the scoring of  
a page for being scheduled for crawling based on the it's scores.  
Have a look to the opic score plugin and to the crawldatum meta data.  
The meta data can be used to transport informations like customs  
category weightnings scores that take effect in the crawlDatum score  
caculation.
Attention this is not scoring during search time, this is scoring  
crawling scheduling.
Beside that the may be simplest way is to write a index plugin that  
tag a page (keywordMatch:true / false) that a keyword occurs or not.  
During search you extend the search string behind the scene with  
something like: yourSearchString+ keywordMatch:true


Stefan




Am 08.07.2006 um 07:03 schrieb Syed Kamran Ali:


Hi,

I have successfully configured nutch 0.7.2. Ran the crawler a few  
times all
working fine. Now i wanted to know is there a way i can run the  
crawler so

that if it finds certain keyword in a website only then it indexes it
otherwise not. Also after i have the index created is it possible  
that i can
create a categorized directory, like there is yahoo and google  
directories?


--
Thanks
Kamran

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf


Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0  
and JDK

1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following  
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient  
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is  
invalid.

Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
327)

   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of  
Injector

by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I  
finaly dont

understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf


We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method  
from

CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and  
CrawlDb.update()

and
the inputPath setted in createJob is not needed neither by  
Injector.inject()

nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: 0.8 release

2006-07-05 Thread Stefan Groschupf

+1, but I really would love to see NUTCH-293 as part of nutch .8  
since this all about being more polite.

Thanks.
Stefan

On 05.07.2006, at 03:46, Doug Cutting wrote:


+1

Piotr Kosiorowski wrote:

+1.
P.
Andrzej Bialecki wrote:

Sami Siren wrote:
How would folks feel about releasing 0.8 now, there has been  
quite a lot of improvements/new features
since 0.7 series and I strongly feel that we should push the  
first 0.8 series release (alfa/beta)
out the door now. It would IMO lower the barrier to first timers  
try the 0.8 series and that would

give us more feedback about the overall quality.


Definitely +1. Let's do some testing, however, after the upgrade  
to hadoop 0.3.2 - hadoop had many, many changes, so we just need  
to make sure it's stable when used with Nutch ...


We should also check JIRA and apply any trivial fixes before the  
release.




If there is a consensus about this I can volunteer to be the RM.


That would be great, thanks!

noindedo not index/noindex

2006-06-22 Thread Stefan Groschupf


Hi,
as far I can see nutch's html parser does only support the meta tag  
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there  
is an inoffiziel html noindex tag.

http://www.webmasterworld.com/forum10003/2703.htm

May be this would be another thing to make nutch more polite.
Also please remember my patch to support crawl-delay properties in  
robots.txt. That would be also something important to make nutch more  
polite and may be a better way than removing the nutch crawler  
identification.


Thoughts?
Stefan

Re: how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-12 Thread Stefan Groschupf


Hi Feng,

map Writrable is a kind of hashmap.
You can put in any key value pair, but the key and values need to be  
Writables:
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ 
Writable.html


You can use UTF8 as StingKey and Value or ByteWritable as key and  
Utf8 as Values.

Etc.
Does this answer your question?
Stefan


Am 12.06.2006 um 04:15 schrieb Feng Ji:


hi,

I wonder how to use MapWritable metaData in CrawlDatum.java. The  
API gives

us some function call, but I still don't know how to
input information (String) to metaData and retrieve information;  
How to
convert MapWritable variable to other types like MetaData type or  
String

type.

Any good sample in Nutch's java class?

thanks,

Feng

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf


Hi Lourival,

this means all pages older than 30 days are potential candidates for  
a fetch list that is created by segment generation process.


Stefan



Am 12.06.2006 um 16:33 schrieb Lourival Júnior:


Hi all!

I have a question about nutch-default.xml configuration file. There  
is a
parameter db.default.fetch.interval that is set by default to 30.  
It means

that pages from the webdb are recrawled every 30
days.http://www.mail-archive.com/nutch-user@lucene.apache.org/ 
msg02058.htmlI

want to know if this recrawled here means automatic recrawl or I
have to
execute some shell script before this period to make possible  
updates to my

WebDB.

I really wanna know this because at this time I did not obtain a  
update in

fact.

Thanks a lot!

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf

Ok. So, have you any solution to do this job automatically? I have  
a shell

script, but I don't see if this really works yet.

Shell scripts are the best solution.


Sorry if I'm being redundant. I'm learn about this tool and I have  
a lot of

questions :).
No Problem, but  the nutch user mailing list would be a better list  
to ask such questions.

Thanks!
Stefan



Thanks!

On 6/12/06, Dima Mazmanov [EMAIL PROTECTED] wrote:


Hi,Lourival.


You wrote 12 èþíÿ 2006 ã., 19:33:15:

 Hi all!

 I have a question about nutch-default.xml configuration file.  
There is a
 parameter db.default.fetch.interval that is set by default to  
30. It

means
 that pages from the webdb are recrawled every 30
 days.
http://www.mail-archive.com/nutch-user@lucene.apache.org/ 
msg02058.htmlI

 want to know if this recrawled here means automatic recrawl or I
 have to
 execute some shell script before this period to make possible  
updates to

my
 WebDB.

 I really wanna know this because at this time I did not obtain a  
update

in
 fact.

 Thanks a lot!


You have to recrawl db manually.


--
Regards,
Dima  mailto:[EMAIL PROTECTED]





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV5.patch

Release Candidate 1 of this patch.

This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4]) 
+ a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded
+ add a property to define if the IpAddress Resolver should be started as a 
part of the crawlDb update tool to update the parseoutput folder (contains 
CrawlDatum Status Linked) of a segment before updating the crawlDb.
+ using cached IP during Generation

Please review this patch and give me any improvement suggestion, I think this 
is a very important issue, since it helps to do _real_ whole web crawls and not 
end up in a honey pot after some fetch iterations.
Also if you like please vote for this issue. :-) Thanks.

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, 
 ipInCrawlDatumDraftV5.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV4.patch

Attached a patch that does only use any time 4 byte for the ip. Means we do 
ignore ipv6. This save us a 4 byte in each crawldatum for now.
I tested the resolver tool with a 200++mio crawldb and in average a performance 
of 500 IP lookups / sec per box is possible by using 1000 threads.

I really would love to get this into the sources as the basic version of having 
the IP address in  the crawlDatum, since I'm working on a tool set of spam 
detectors that all need ip adresses somehow.
May be let's exclude the tool but start with the crawlDatum? :-?
Any improvement suggestions?
Thanks.


 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

java doc of CrawlDb is wrong


 Key: NUTCH-302
 URL: http://issues.apache.org/jira/browse/NUTCH-302
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.8-dev


CrawlDb has the same java doc as Injector. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

 [ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]

Stefan Groschupf updated NUTCH-301:
---

Attachment: CommonGramsCacheV1.patch

Cache HashMap COMMON_TERMS in configuration instance.

 CommonGrams loads analysis.common.terms.file for each query
 ---

  Key: NUTCH-301
  URL: http://issues.apache.org/jira/browse/NUTCH-301
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.8-dev
 Reporter: Chris Schneider
  Attachments: CommonGramsCacheV1.patch

 The move away from static objects toward instance variables has resulted in 
 CommonGrams constructor parsing its analysis.common.terms.file for each 
 query. I'm not certain how large a performance impact this really is, but it 
 seems like something you'd want to avoid doing for each query. Perhaps the 
 solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] 

Stefan Groschupf commented on NUTCH-293:


Any comments? There was already a posting in the nutch agent mailing list, 
where someone had banned nutch since nutch does not support crawl-delay.
Becasue nutch tries to be polite from my point of view this is a small but 
important change.
If there are no improvement suggestions can someone of the committers take care 
of that _please_? :-) 

 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

resolving IP in...

2006-06-07 Thread Stefan Groschupf


Hi,
after playing around to figure out the best place to resolve IP's of  
freshly discovered ulrs I agree with Andrzej that the  
Parseoutputformat isn't the best place.


The problem here, Parseoutputformat  is not multithreaded and we  
definitely need many threads for ip lookup.


I think in case we a  ip Resolving MapRunnable  to preprocess segment  
data (after fetching) before crawldb updating would be may be a  
better place.


+ less data to process (in opposite to process a complete crawldb)
+ good dns cache usage, since many new urls will point to the same  
host (internal links)

- we may lookup urls we already have in the crawldb.

Any thoughts?

Stefan

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] 

Stefan Groschupf commented on NUTCH-293:


Hi Andrzej, 
I agree but writing a queue based fetcher is a big step. I already have some 
basic code (nio based).
Also I don't think that a new fetcher will be as stable as that we can put it 
into a .8 release. Since we plan to have .8 release it think it is a good idea 
for now to add this functionality. Maybe we do it configurable and switch it 
off by default?

In any case I suggest that we solve NUTCH-289 first and than getting the  
fetcher done.


 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Stefan Groschupf

As far I understand hadoop use commons logging. Should we switch to  
use commons logging as well?



Am 06.06.2006 um 11:02 schrieb Jérôme Charron:


URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1.  Hadoop now uses Jakarta Commons
Logging, configured for log4j by default.


If log4j is now included in the core, we can remove the lib-log4j  
plugin.

If no objection, I will doing it.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] 

Stefan Groschupf commented on NUTCH-258:


Scott, 
I agree with you. However we need a clean patch to solve the problem, we can 
not just comment things out of the code.
So I vote for the issue and I vote to reopen this issue.

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV1.patch

To keep the discussion alive attached a _first draft_ for storing the ip in the 
crawlDatum for public discussion.

Some notes. 
The IP is stored as byte[] in the crawlDatum itself not in the meta data.
There is a IpAddressResolver maprunnable tool to update a crawlDb using 
multithreaded ip lookups.
In case a IP is available in the crawlDatum the Generator use the cached ip. 

To discuss:
I don't like the idea of post process the complete crawlDb any time after a 
update. 
Processing crawlDb is expansive in storage usage and time. 
We can have a property ipLookups with possible values 
never|duringParsing|postUpdateDb.
Than we can add also some code to lookup the IP in the ParseOutputFormat as 
discussed or we start IpAddressResolver as job in the updateDb tool code.

In the moment I write the ip address bytes like this:
out.writeInt(ipAddress.length);
out.write(ipAddress); 
I think for now we can define that byte[] ipAddress is everytime 4 bytes long, 
or should we be IPv6 compatible by today?

Please give me some comments I have a strong interest to get this issue fixed 
asap and I'm willing to improve things as required. :-)

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Stefan Groschupf




hmm... didn't think about that, are there more opinions about this?


I don't believe this don't be evil thing at all. I think it is just a  
question of time google feel we attack the appliance server market  
and I believe nutch has a serious chance to do so (some time in the  
far feature. :-) )

Stefan


--
Sami Siren

Are you sure there is no trademark infringement here? Perhaps we  
should call it something else, just to avoid any potential legal  
unpleasantries ...

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf

I have a proposal for a simple solution: set a flag in the current  
Configuration instance, and check for this flag. The Configuration  
instance provides a task-specific context persisting throughout the  
lifetime of a task - but limited only to that task. Voila - problem  
solved. We get rid of the dubious use of LogFormatter (I hope Chris  
that even you would agree that this pattern is slightly ..  
unusual ;) ), and we gain flexible mechanism limited in scope to  
the current task, which ensures isolation from other tasks in the  
same JVM. How about that?

Wonderful idea :-D
+ 1

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 
404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

 if a 404 for a robots.txt is returned a NPE is thrown
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: search engine spam detector

2006-06-04 Thread Stefan Groschupf



The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)


This will be a very long way.
I collect some thoughts and a list of web spam related papers in my  
blog.
http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072- 
F98E13CAEFE1.html

Feedback is welcome.


Stefan

[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)

if a 404 for a robots.txt is returned no page is fetched at all from the host
-

 Key: NUTCH-298
 URL: http://issues.apache.org/jira/browse/NUTCH-298
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
 Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do  
robotRules = EMPTY_RULES;  (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used 
and we call isAllowed in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks 
and initialisations.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Attachment: fixNpeRobotRuleSet.patch

fix the npe in RobotRuleSet happen in case we use a empthy RuleSet

 if a 404 for a robots.txt is returned no page is fetched at all from the host
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RobotRuleSet

2006-06-03 Thread Stefan Groschupf


Hi,
just posted a fix for a NPE in case a empty RobotRuleSet is used.
The patch only contains a two lines fix, since I learned that this  
best way to get things committed sooner. :)
However I really don't like the RobotRuleSet implementation since  
entries are copied between a arraylist and a array for just no  
reasons. from my point of view.

I would love to change that to just use the arraylist.
Any thoughts?
Can I have a vote from one committer that would commit that to the  
source in case I do this change? :-)


Thanks.
Stefan

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] 

Stefan Groschupf commented on NUTCH-282:


Is that related to host grouping we discussed? Can we in this case close this 
bug?

 Showing too few results on a page (Paging not correct)
 --

  Key: NUTCH-282
  URL: http://issues.apache.org/jira/browse/NUTCH-282
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 I did a search and got back the  value itemsPerPage from opensearch. But 
 the output shows results 1-8 and I have a total of 46 searchresults.
 Same happens for the webinterface.
 Why aren't enough results fetched?
 The problem might be somewhere in the area of where Nutch should only display 
 a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] 

Stefan Groschupf commented on NUTCH-286:


This is difficult to realize since the http error code is readed from response 
in the fetcher and setted into the protocol status , content analysis can only 
done during parsing. 
Also normally such pages do not get a high OPIC score and should be not in the 
top search results. 
However this is a wrong configured http server response, so you may should open 
a bug in the typo3 issue tracking. 
Should we close this issue?

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:


+1, Can someone create a clean patch file?

 OpenSearchServlet: OutOfMemoryError: Java heap space
 

  Key: NUTCH-292
  URL: http://issues.apache.org/jira/browse/NUTCH-292
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical
  Attachments: summarizer.diff

 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
   
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 The URL I use is:
 [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url
 It seems to be a problem specific to the date I'm working with. Moving the 
 start from 0 to 10 or changing the query works fine.
 Or maybe it doesn't have to do with sorting but it's just that I hit one bad 
 search-result that has a broken summary?
 !! The problem is repeatable. So if anybody has an idea where to search / 
 what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] 

Stefan Groschupf commented on NUTCH-291:


lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified and date is stored in the 
index, please submit a patch for MoreIndexingFilter.

 OpenSearchServlet should return date as well as lastModified
 

  Key: NUTCH-291
  URL: http://issues.apache.org/jira/browse/NUTCH-291
  Project: Nutch
 Type: Improvement

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-291-unfinished.patch

 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + key + :  + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, , new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return ;
}
  }
 So the Problem should be somewhere else.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-287) Exception when searching with sort

 [ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
 
Stefan Groschupf closed NUTCH-287:
--

Resolution: Won't Fix

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html

 Exception when searching with sort
 --

  Key: NUTCH-287
  URL: http://issues.apache.org/jira/browse/NUTCH-287
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical


 Running a search with  sort=url works.
 But when usingsort=title   I get the following exception.
 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
 jsp threw exception
 java.lang.RuntimeException: Unknown sort value type!
 at 
 org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
 at 
 org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
 at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
 at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 What is in those lines is:
   WritableComparable sortValue;   // convert value to writable
   if (sortField == null) {
 sortValue = new FloatWritable(scoreDocs[i].score);
   } else {
 Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
 if (raw instanceof Integer) {
   sortValue = new IntWritable(((Integer)raw).intValue());
 } else if (raw instanceof Float) {
   sortValue = new FloatWritable(((Float)raw).floatValue());
 } else if (raw instanceof String) {
   sortValue = new UTF8((String)raw);
 } else {
   throw new RuntimeException(Unknown sort value type!);
 }
   }
 So I thought that maybe raw is an instance of something strange and tried 
 raw.getClass().getName() or also raw.toString() to track the cause down - but 
 that always resulted in a NullPointerException. So it seems I'm having raw 
 being null for some strange reason.
 When I try with title2 (or something none-existing) I get a different error 
 that title2 is unknown / not indexed. So I suspect that title

[jira] Closed: (NUTCH-284) NullPointerException during index

 [ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
 
Stefan Groschupf closed NUTCH-284:
--

Resolution: Won't Fix

Yes, I was missing index-basic.

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-284) NullPointerException during index

[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] 

Stefan Groschupf commented on NUTCH-284:


Please try discuss such things first in the user mailing list than open a 
issue. 
Maintaining the issue tracking is very time consuming. But if there is a bug 
please continue open bug reports. :)
Thanks.


 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

[ 
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] 

Stefan Groschupf commented on NUTCH-281:


Can you submit a patch file?

 cached.jsp: base-href needs to be outside comments
 --

  Key: NUTCH-281
  URL: http://issues.apache.org/jira/browse/NUTCH-281
  Project: Nutch
 Type: Bug

   Components: web gui
 Reporter: Stefan Neufeind
 Priority: Trivial


 see cached.jsp
 base href=...
 does not take effect when showing a cached page because of the comments 
 around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

[ 
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] 

Stefan Groschupf commented on NUTCH-274:


Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the 
Injector?

 Empty row in/at end of URL-list results in error
 

  Key: NUTCH-274
  URL: http://issues.apache.org/jira/browse/NUTCH-274
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: nightly-2006-05-20
 Reporter: Stefan Neufeind
 Priority: Minor


 This is minor - but it's a little unclean :-)
 Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
 an empty line.
 Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
 fine - but second is empty and therefor fails proper protocol-detection.
 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
 060521 022639 found resource parse-plugins.xml at 
 file:/home/mm/nutch-nightly/conf/parse-plugins.xml
 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
 060521 022639 fetching http://www.bild.de/
 060521 022639 fetching 
 060521 022639 fetch of  failed with: 
 org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
 no protocol: 
 060521 022639 http.proxy.host = null
 060521 022639 http.proxy.port = 8080
 060521 022639 http.timeout = 1
 060521 022639 http.content.limit = 65536
 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 022639 fetcher.server.delay = 1000
 060521 022639 http.max.delays = 1000
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 022640  map 0%  reduce 0%
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-286) Handling common error-pages as 404