Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Hi,

I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?



In general I agree it is the view layer and should have as less as  
possible code, however the idea was to have as less as possible  
dependencies to thirdparty tools and libraries and also getting  
things realized with low tech (jsp).


Stefan





Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Th old hadoop patch is here:
https://issues.apache.org/jira/browse/NUTCH-251
Also we had this conversation:
http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg00314.html
I guess after this we missed to post the patches we use internally.

If someone feels strong about getting the gui working with hadoop he/ 
she should feel free to update the patch and post it in the hadoop jira.


Stefan







On 18.01.2007, at 15:39, Doug Cutting wrote:


Stefan Groschupf wrote:
We run the gui in several production environemnts with patched  
hadoop code - since this is from our point of view the clean  
approach. Everything else feels like a workaround to fix some  
strange hadoop behaviors.


Are there issues in Hadoop's Jira for these?  If so, do they have  
patches attached?  Are they linked to the corresponding issue in  
Nutch?


Doug



~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Hi Scott,

feel free - I have no options on that.

From my very little point of view the nutch  .8 source stream is a  
one way street.
In all my projects we move as far as possible away from nutch. I like  
hadoop a lot and writing customer tools on top of it is - that easy.
But nutch .8 was a proof of concept for the early hadoop.  There is  
only one serious developer left and wow how great he does his job -  
but nutch .8 is just too monolithic, to difficult to extend, to  
difficult to debug, to difficult to integrate for a serious mission  
critical application.
I spend a signification part of my life daily working with nutch, but  
if someone would ask - I would answer don't use it.
May be one day we can get some developer together first think about a  
good extendable design and than start a 2.x stream or a new project.
And ... yes no opic and yes definitely no plugin architecture (I feel  
very sorry for all that wast so much life time because of my terrible  
complicate plugin system) but a clean IOC design with lightweight  
default interface implementations and a great test coverage.
Anyway just my *very little* point of view based on 3.5 years nutch  
experience.


Stefan





On 18.01.2007, at 21:33, Scott Green wrote:


Stefan,

I also dived into contrib/web2 in nutch. The one and admin-gui are
both owns some plugins based on nutch plugin architecture. So I think
it is great if we extract something in high level and they should have
a lot commons.  Well, i dont know it is the right time to do this job.

On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote:

Hi,
 I just finished reading all source code about nutch gui. And
 personally i don't like putting a lot of code snippets into jsp  
files

 since it takes a lot time when refactoring. So how about to adopt
 using velocity/freemarker with servlet?


In general I agree it is the view layer and should have as less as
possible code, however the idea was to have as less as possible
dependencies to thirdparty tools and libraries and also getting
things realized with low tech (jsp).

Stefan








~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: Next Nutch release

2007-01-17 Thread Stefan Groschupf

Hi,

great to hear people still working on things. It shows once more  
getting something in early would save some effort. :)

Just some random comments.

We run the gui in several production environemnts with patched hadoop  
code - since this is from our point of view the clean approach.  
Everything else feels like a workaround to fix some strange hadoop  
behaviors. It is may be a long time ago that I spoke to Doug and some  
other Hadoop developers  but at this time I understand people that  
there is a general interest to have a nutch gui and support required  
functionality in hadoop.

I'm not sure if that is still the case or if I had a wrong impression.
In any case from my p.o.v. the clean way would be getting the  
required minor changes into hadoop (not critical simple stuff from my  
point of view) instead of implement working around in nutch. Since  
hadoop is a kind of child of nutch there should be a close relation  
at least to discuss things.
Anyway no strong option, just my 2 cents. In any case I'm very happy  
if people see now the need for a gui as well and someone is working  
on that since I'm kind of busy with other projects.


Thanks.
Stefan


On 17.01.2007, at 06:42, Enis Soztutar wrote:


Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the  
votes. Stafan has written a good plugin for the admin gui and i  
have updated it to work with nutch-0.8, hadoop 0.4.


Some of the features in the patch is not appropriate for our use  
cases and it requires hadoop changes, thus I am currently working  
on an alternative implementation of the administration gui, which  
runs a hadoop server( like JobTraker) to listen to submitted Jobs,  
an web Gui to submit and track the jobs from the browser and a job  
runner.


The architechture details of the patch is as follows :

 - An interface AdminJob which is an abstract class representing a  
Job in nutch.
 - various classes extending AdminJob. for ex FetchAdminJob,  
IndexAdminJob.
 - A queue which sorts the jobs in priority order, by a modified a  
topological sort(jobs can be dependent).

 - an interface to submit Jobs
 - a rpc server to listen to job submissions
 - an extension point (basically same as the previous)
 - a web server to serve plugin jsp's

upon the features will be
   - submitting jobs from code, command line or web interface,
   - tracking jobs from the command line or web interface
   - scheduling jobs

I could send the code or details if anyone is interested in  
pretesting. And i will appreciate any comments and suggestions on  
this. I am planning to complete the patch and submit it to Jira ASAP.


Sami Siren wrote:

Hello,

It has been a while from a previous release (0.8.1) and looking at  
the
great fixes done in trunk I'd start thinking about baking a new  
release

soon.

Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.

The top 10 voted issues are currently:

NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
NUTCH-48Did you mean query enhancement/refignment feature
NUTCH-251   Administration GUI
NUTCH-289   CrawlDatum should store IP address
NUTCH-36Chinese in Nutch
NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59  
	meta

data support in webdb
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-68A
tool to generate arbitrary fetchlists   NUTCH-87Efficient
site-specific crawling for a large number of sites

Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a  
patch for

it).

--
 Sami Siren







~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: What's the status of Nutch-GUI?

2006-12-02 Thread Stefan Groschupf

Hi Sami,

I quess you refer to these:
•  LocalJobRunner:
  •  Run as kind of singelton
  •  Have a kind of jobQueue
  •  Implement JobSubmissionProtocol status-report
 methods
  •  implement killJob method

Right!



-how about writing a nutchrunner that just extends the  
functionality of localjobrunner?
That would be one solution, however I still hope that the hadoop  
developer understand that it would be general benefit to improve the  
local jobrunner.
Since it would be somehow duplicated code it does not feel right, but  
I also think better this way as never get this issue solved.




-scheduling (jobQueue) could be completely outside of jobrunner?


We solved that with Quarz and file based JobStore we implemented back  
than.


Stefan 

Re: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Stefan Groschupf
did you erver browse this: http://wiki.media-style.com/display/ 
nutchDocu/Home

Nothing big, but it will give you some ideas, also about plugins.

On 25.11.2006, at 06:32, Armel T. Nene wrote:

I agree with you that documentation is vital not the just extending  
the
current version but also for any plugins and patches created. I  
have been
spending almost two weeks trying to adapt nutch to my project but I  
spend
more time in reading code and trying to understand what they do  
before I can

even start to fix problem. Come on guys, documentation is good coding
practice, we can't read your mind to know exactly what you were  
trying to

achieve by just looking at the implementation code.

This is just a good constructive criticism.

:) Armel

-Original Message-
From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED]
Sent: 25 November 2006 03:45
To: nutch-dev@lucene.apache.org
Subject: [jira] Created: (NUTCH-408) Plugin development documentation

Plugin development documentation


 Key: NUTCH-408
 URL: http://issues.apache.org/jira/browse/NUTCH-408
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
 Environment: Linux Fedora
Reporter: nutch.newbie


Documentation is rare! But very vital for extending current (0.9)  
nutch.
Current docs on the wiki for 0.7 plugin development was good but it  
doesn't
apply to 0.9 and new developers who are joining directly 0.9 find  
the 0.7
documentation not enough. A more practical plugin writing  
documentation for
0.9 is desired also exposing the plugin principals in practical  
terms i.e.
extension points and libs etc. furthermore it would be good to  
provide some

best practice example i.e.

look for the lib you are planning to use if its already in lib  
folder and
maybe that version of the external lib is good for the plugin dev  
rather

then using another version things like that..

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the  
administrators:

http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/ 
software/jira








~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: Fetcher freezes

2006-11-03 Thread Stefan Groschupf

Hi,

try to have no regular expression filter and check if this helps.
Let me know if this solve the problem.
You may be want to do a thread dump and send the log to the list to  
check where exactly the fetcher freezes.


Stefan

Am 03.11.2006 um 15:53 schrieb Aisha:



Hi,

I don't know why but I have no answer on the 3 forums where I sent my
problem
As the problem of Fetcher freezes occurs every time I try  to fetch  
my file
system I can't imagine that I am the only one who have this problem  
and as I

said in my last e-mail, I found many mails about this problem but no
solution seems have been done
It is a big problem so I don't understand why nobody seems  
interested on

it

I try to crawl over my file system but the crawl never finished, it  
aborted

with the message Aborting with 3 hung threads.

The number of hung threads is not the same if I retry

I modify the configuration grawing the number of threads but it  
doesn't

solve the problem

Please could somebody help me,
I can't crawl my file system..

thanks in advance.
Aïcha

--
View this message in context: http://www.nabble.com/Fetcher-freezes- 
tf2568287.html#a7158776

Sent from the Nutch - Dev mailing list archive at Nabble.com.




~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: How could I test my modify to NutchAnalysis.jj?

2006-09-10 Thread Stefan Groschupf

There is a eclipse java cc plugin.
It compiles your the grammar and you can write easily test code.
However it has it's own issues so you may just want to generate the  
java files with the nutch ant script and write than unit tests again  
these files.

HTH
Stefan

On 10.09.2006, at 00:49, heack wrote:

I made some changes to this file(with main func), and I want to  
test it, What should I do?

I use ant to build, but it build it all.
Maybe I could write an ant xml to run it, but is there any easier  
way to do that?

Thank you!


~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: Patch Available status?

2006-08-31 Thread Stefan Groschupf


Another alternative would be to construct a new workflow that just  
adds the Patch Available status and still permits issues to be re- 
opened.



+1



Re: Missing pages anchor text

2006-08-29 Thread Stefan Groschupf

Hi Doug,
I'm pretty sure that your problem is related to the deduping of your  
index.
In general the hash of the content of a page is used as key for the  
dedub tool.

We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding  
problem.


Greetings,
Stefan


Am 28.08.2006 um 11:33 schrieb Doug Cook:



Hi, folks,

I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted to  
check to see
if these were known issues (a quick search of the email archives  
and of JIRA

didn't turn up anything). I'm running 0.8 with a handful of patches.

I'm frequently finding root pages of sites missing from my index,  
despite
the fact that they have been fetched. In my admittedly short  
investigation I

have found two classes of cases:

1. Root URL is not a redirect, but there is a root-level index.html  
page.

The index.html page is in the index, but the root page is not.
Unfortunately, most of the anchor text points to the root page, not  
the
/index.html page, and the anchor text has gone missing along with  
its

associated page, so relevance is poor.

2. Root URL is a redirect to another page. Again, this other page  
is in the

index, the but the root page, along with its anchor text, has gone
missing.

I have a deduped index. Both of these cases could result from dedup  
throwing
out the wrong URL, i.e. the one with more anchor text, although one  
might
expect dedup to merge the two anchor texts (at least in the case of  
pages

which commonly normalize to the same URL, e.g. / and /index.html).

The second case might result from the root URL somehow being  
normalized to

its redirect target, but in that case (incorrect, in any case) I would
expect the anchor text to also be attached to the redirect target,  
and it is

not.

I'm about to rebuild with no deduping and see what I find.

Thanks for your help  comments-

Doug
--
View this message in context: http://www.nabble.com/Missing-pages--- 
anchor-text-tf2179049.html#a6025652

Sent from the Nutch - Dev forum at Nabble.com.




~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: [Nutch Wiki] Update of RunNutchInEclipse by UrosG

2006-08-29 Thread Stefan Groschupf

Hi,

+ You may have problems with some imports in parse-mp3 and parse- 
rtf plugins. Because of incompatibility with apache licence they  
were left from sources. You can find it here:

+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/ 
lib/

+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/ 
lib/

+
+ You need to copy jar files into plugin lib path and refresh the  
project.


Isn't the mp3 plugin deactivated? I suggest we remove it and put in a  
kind of sandbox within the jars. However I think the sandbox have to  
be outside of apache.


Stefan 


Re: Checking if crawl dir exists ...

2006-08-25 Thread Stefan Groschupf

Hi Michi,
what is your motivation for that?

Stefan
Am 25.08.2006 um 06:52 schrieb Michael Wechner:


Hi

I think it would be very useful if the NutchBean would check if the  
crawl dir exists and throw at least a warning

in case it doesn't:

Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java
===
--- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java  
(Revision 436787)
+++ nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java  
(Arbeitskopie)

@@ -95,6 +95,9 @@
if (dir == null) {
dir = new Path(this.conf.get(searcher.dir, crawl));
}
+   if (!new java.io.File(dir.toString()).exists()) {
+LOG.warn(No such directory:  + new java.io.File 
(dir.toString()));

+}
Path servers = new Path(dir, search-servers.txt);
if (fs.exists(servers)) {
if (LOG.isInfoEnabled()) {


WDYT?

Thanks

Michi

--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61




~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]

2006-08-24 Thread Stefan Groschupf

Hi Renaud,
I think you were meaning editing: http://wiki.apache.org/nutch/ 
RunNutchInEclipse , not  http://wiki.apache.org/nutch/ 
RenaudRichardet  , right?
Right! Sorry for the misunderstanding.I have no idea about your  
personal page so it would be a bad move to edit it. :-)

Thanks again for creating the debugging nutch within eclipse.

Stefan



Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet

2006-08-23 Thread Stefan Groschupf

Hi Renaud,
I updated your page with some more details, I hope that is ok for you.
Thanks for creating it.
Stefan


Am 23.08.2006 um 11:51 schrieb Apache Wiki:


Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki  
for change notification.


The following page has been changed by RenaudRichardet:
http://wiki.apache.org/nutch/RenaudRichardet

New page:
{{{
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet at wyona.com  http://www.wyona.com
}}}





Re: Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation

2006-08-22 Thread Stefan Groschupf
One must also remember that proper junit testing can be used to  
verify functionality.
There's lot of code currently that is not guarded by unit tests and  
I hereby invite everybody to participate in this endless effort and  
make Nutch unit tests better ;)

I completely agree!!!
Nutch has more bugs than ever before since most of the .8 code was  
developed without tests.


Stefan


[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] 

Stefan Groschupf commented on NUTCH-354:


Since this issue is already closed I can not attach the patch file, so I attach 
it as text within this comment.
If you need the file let me know and I send you a offlist mail. 


Index: src/test/org/apache/nutch/crawl/TestMapWritable.java
===
--- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 
432325)
+++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy)
@@ -180,6 +180,31 @@
 assertEquals(before, after);
   }
 
+  public void testRecycling() throws Exception {
+UTF8 value = new UTF8(value);
+UTF8 key1 = new UTF8(a);
+UTF8 key2 = new UTF8(b);
+
+MapWritable writable = new MapWritable();
+writable.put(key1, value);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+
+DataOutputBuffer dob = new DataOutputBuffer();
+writable.write(dob);
+writable.clear();
+writable.put(key1, value);
+writable.put(key2, value);
+assertEquals(writable.get(key1), value);
+assertEquals(writable.get(key2), value);
+
+DataInputBuffer dib = new DataInputBuffer();
+dib.reset(dob.getData(), dob.getLength());
+writable.readFields(dib);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+  }
+  
   public static void main(String[] args) throws Exception {
 TestMapWritable writable = new TestMapWritable();
 writable.testPerformance();


 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Fwd: [webspam-announces] Web Spam Collection Announced

2006-08-21 Thread Stefan Groschupf

Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls  
from my POV.


Greetings,
Stefan




During AIRWeb'06 we announced the availability of the collection.

We are currently planning a Web Spam challenge based on the dataset we
have built. I assume most of you will be interested on this, so I have
moved the webspam-volunteers list to webspam-announces. If you do
not want to be in this new webspam-announces list, please send me an
e-mail.

This was shown during AIRWeb in Seattle:

.

Web Spam Collection Available
August 10th, 2006

We are pleased to announce the availability of a public collection for
research on Web spam. This collection is the result of efforts by a
team of volunteers:

Thiago AlvesAntonio GulliTamas Sarlos
Luca Becchetti  Zoltan Gyongyi   Mike Thelwall
Paolo Boldi Thomas Lavergn   Belle Tseng
Paul ChiritaAlex Ntoulas Tanguy Urvoy
Mirel Cosulschi Josiane-Xavier Parreira  Wenzhong Zhao
Brian Davison   Xiaoguang Qi
Pascal Filoche  Massimo Santini

The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
downloaded in May 2006 by the Laboratory of Web Algorithmics,
Universit{\`a} degli Studi di Milano. The labelling process was
coordinated by Carlos Castillo working at the Algorithmic Engineering
group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
by the DELIS project (Dynamically Evolving, Large Scale Information
Systems).

Volunteers were provided with a set of guidelines and were asked to
mark a set of hosts as either normal, spam, or borderline. The
collection includes about 6,700 judgments done by the volunteers and
can be used for testing link-based and content-based Web spam
detection and demotion techniques.

More information is available in our Web page, including the
guidelines given to the human judges, the instructions for obtaining
the links and contents of the pages in this collection, and the
contact information for questions and comments.

http://aeserver.dis.uniroma1.it/webspam/

If you use this data set please subscribe to our mailing list by
sending an e-mail to [EMAIL PROTECTED]

--
Carlos Castillo
Universita di Roma La Sapienza
Rome, ITALY





Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/webspam-announces/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/








[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 

Stefan Groschupf commented on NUTCH-356:


Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with 
the memory. 
There is no way around having a singleton pluginrepository. However you can 
reset the the pluginRepository by remove the cached object from the 
configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing 
a own plugin that solve your problem that should be a cleaner solution for your 
problem. 

Would you agree to close this issue since we will not be able to commit your 
changes. 
Stefan  

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation
---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  web servers. 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
---

Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

 crawling simulation
 ---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0

 Attachments: protocol-simulation-pluginV1.patch


 We recently discovered  some serious issue related to crawling and scoring. 
 Reproducing these problems is a kind of difficult, since first of all it is 
 not polite to re-crawl a set of pages again and again, secondly it is 
 difficult to catch the page that cause a problem. 
 Therefore it would be very useful to have a testbed to simulate crawls where  
 we can control the response of  web servers. 
 For the very beginning simulate very basic situation like a page points to it 
 self,  link chains or internal links would already be very usefully. 
 However later on simulate crawls against existing data collections like TREC 
 or a webgraph would be much more interesting, for instance to caculate the 
 quality of the nutch OPIC implementation against page rank scores of the 
 webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] 

Stefan Groschupf commented on NUTCH-343:


Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to the patch file. 
My personal experience is that some nutch developers have strong opitions about 
code formating, so you may be want to check your code formating. :-)

 Index MP3 SHA1 hashes
 -

 Key: NUTCH-343
 URL: http://issues.apache.org/jira/browse/NUTCH-343
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.8, 0.9.0, 0.8.1
Reporter: Hasan Diwan
 Attachments: parsemp3.pat


 Add indexing of the mp3s sha1 hash.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]

Stefan Groschupf updated NUTCH-341:
---

Attachment: doNotDeleteTmpIndexMergeDirV1.patch

+1. 
I agree it makes completly no sense to be required creating a tmp folder 
manually and nutch deletes it afterwards with all content. 
Very dangerous if a user provides  / as tmp folder. The attached patch 
rollsback the missing line and I would love to ask that one developer with 
write access can roll in this in asap!
THANKS!


 IndexMerger now deletes entire workingdir after completing
 

 Key: NUTCH-341
 URL: http://issues.apache.org/jira/browse/NUTCH-341
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Critical
 Attachments: doNotDeleteTmpIndexMergeDirV1.patch


 Change 383304 deleted the following line near Line 117 (see 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304r2=405204diff_format=h
  for details):
 workDir = new File(workDir, indexmerger-workingdir);
 Previously, if no -workingdir workingdir parameter was specified, 
 IndexMerger.main() would place an indexmerger-workingdir directory into the 
 default directory and then delete the former after completing. Now, 
 IndexMerger.main() defaults the value of its workDir to indexmerger within 
 the default directory, and deletes this workDir afterward.
 However, if -workingdir workingdir _is_ specified, IndexMerger.main() will 
 now set workDir to _this_ path and delete the _entire_ workingdir 
 afterward. Previously, IndexMerger.main() would only delete 
 workingDir/indexmerger-workingdir, without deleting workingdir itself. 
 This is because the line mentioned above always appended 
 indexmerger-workingdir to workDir.
 Our hardware configuration on the jobtracker/namenode box attempts to keep 
 all large datasets on a separate, large hard drive. Accordingly, we were 
 keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir 
 on this drive. Unfortunately, we were passing the folder containing these 
 folders in the workingdir parameter to the IndexMerger. As a result, the 
 first time we ran the IndexMerger, we ended up trashing our entire DFS!
 Perhaps the way that the IndexMerger handles its workingdir parmaeter now 
 is an acceptable design. However, given the way it handled this parameter in 
 the past, I feel that the current implementation is unacceptably dangerous.
 More importantly, perhaps there's some way that we could make hadoop more 
 robust in handling its critical data files. I plan to place a directory owned 
 by root with dr permissions into each of these critical directories 
 in order to prevent any of them from suffering the fate of our DFS. This 
 could become part of a standard hadoop installation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Attachment: respectFetcherParsePropertyV1.patch

Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a 
contributor to commit this to trunk

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
Priority: Trivial
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Priority: Major  (was: Trivial)

 Fetcher ignores the fetcher.parse value configured in config file
 -

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9.0
Reporter: Jeremy Huylebroeck
 Attachments: respectFetcherParsePropertyV1.patch


 using the command line call to Fetcher, if the noParsing parameter is given, 
 everything is fine.
 if the noParsing is not given, the value in the nutch-site.xml (or 
 nutch-default.xml) should be taken but it is true that is always given to 
 the call to fetch.
 it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
marked as STATUS_DB_GONE  
--

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical


Intranet crawls or focused crawls will fetch many pages from the same host. 
This causes that a thread will be blocked since a other thread already fetch 
from the same host. It is very likely that threads are more often blocked than 
http.max.delays. In such a case the HttpBase.blockAddr method throws a 
HttpException. This will be handled in the fetcher by increment the crawlDatum 
retries and set the status to STATUS_FETCH_RETRY. That means that at least you 
have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in 
intranet or focused crawls it is very likely that this is not enough. Increaing 
one of the involved properties dramatically slow down the fetch. 
I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem 
was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]

Stefan Groschupf updated NUTCH-350:
---

Attachment: protocolRetryV5.patch

This patch will dramatically increase the number of successfully fetched pages 
of a intranet crawl over the time. 

 urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
 marked as STATUS_DB_GONE
 

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: protocolRetryV5.patch


 Intranet crawls or focused crawls will fetch many pages from the same host. 
 This causes that a thread will be blocked since a other thread already fetch 
 from the same host. It is very likely that threads are more often blocked 
 than http.max.delays. In such a case the HttpBase.blockAddr method throws a 
 HttpException. This will be handled in the fetcher by increment the 
 crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that 
 at least you have only db.fetch.retry.max * http.max.delays chances to fetch 
 a url. But in intranet or focused crawls it is very likely that this is not 
 enough. Increaing one of the involved properties dramatically slow down the 
 fetch. 
 I suggest to not increase the CrawlDatum RetriesSinceFetch in case the 
 problem was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] 

Stefan Groschupf commented on NUTCH-322:


I think this is a serious problem. Page A server side redirect to Page B. Page 
A is never writen to the output. That causes that Page A does not change the 
state or the next fetch time, what means that page A is fetched again, again, 
again ... ∞

I suggest that we write out Page A with a status change to STATUS_DB_GONE.


 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
pages that serverside forwards will be refetched every time
---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1
 Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back 
into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is 
not polite and refetching the forwarding and target page in each segment 
iteration. Also it effects the scoring since the forward page contribute it's 
score to all outlinks.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Stefan Groschupf updated NUTCH-353:
---

Attachment: doNotRefecthForwarderPagesV1.patch

Since we discussed that nutch need to be more polite we should fix that asap. 

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Stefan Groschupf resolved NUTCH-322.


Resolution: Duplicate

duplicate of NUTCH-353

 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Andrzej Bialecki 
 Fix For: 0.9.0


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] 

Stefan Groschupf commented on NUTCH-347:


Please submit this patch! 
Thanks!

 Build: plugins' Jars not found
 --

 Key: NUTCH-347
 URL: http://issues.apache.org/jira/browse/NUTCH-347
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Otis Gospodnetic
 Attachments: nutch_build_plugins_patch.txt


 While building Nutch, I noticed several places where various Jars from 
 plugins' lib directories could not be found, for example:
 $ ant package
 ...
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-log4j/lib-log4j.jar to copy.
 init:
 init-plugin:
 compile:
 jar:
 deps-test:
 deploy:
  [copy] Warning: Could not find file 
 /home/otis/dev/repos/lucene/nutch/trunk/build/lib-nekohtml/lib-nekohtml.jar 
 to copy.
 ...
 The problem is, these lib-.jar files do not exist.  Instead, those Jars 
 are typically named with a version in the name, like log4j-1.2.11.jar.  I 
 could not find where this lib- prefix comes from, nor where the version is 
 dropped from the name.  Anyone knows?
 In order to avoid these errors I had to make symbolic links and fake things:
 e.g.
   ln -s log4j-1.2.11.jar lib-log4j.jar
 But this should really be fixed somewhere, I just can't see where... :(
 Note that this doesn't completely break the build, but missing Jars can't be 
 a good thing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] 

Stefan Groschupf commented on NUTCH-346:


+1
I agree, can you please create a patch file and attach it to this bug. 
Thanks

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor

 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] 

Stefan Groschupf commented on NUTCH-345:


Shouldn't the DeflateUtils also be part of the protocol-http plugin? 
Also since it is a larger contribution and not just a small bug fix it would be 
great to have a junit test within the patch. 
Thanks for the contribution.



 Add support for Content-Encoding: deflated
 --

 Key: NUTCH-345
 URL: http://issues.apache.org/jira/browse/NUTCH-345
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Pascal Beis
Priority: Minor
 Attachments: nutch-deflate.patch


 Add support for the deflated content-encoding, next to the already
 implemented GZIP content-encoding. Patch attached. See also the
 Patch: deflate encoding thread on nutch-dev on August 7/8 2006.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] 

Stefan Groschupf commented on NUTCH-349:


my vote goes to #2.
Having a tool that need to be started manually would be better than complicate 
the already fragile code from my point of view. 

 Port Nutch to use Hadoop Text instead of UTF8
 -

 Key: NUTCH-349
 URL: http://issues.apache.org/jira/browse/NUTCH-349
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 

 Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
 This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
 instead. Sooner or later we will need to move Nutch to use this class instead 
 of UTF8.
 This raises numerous issues regarding the compatibility of existing data in 
 CrawlDB, LinkDB and segments. I can see two ways to solve this:
 * add code in readers of respective formats to convert UTF8-Text on the fly. 
 New writers would only use Text. This is less than ideal, because it 
 complicates the code, and also at some point in time the UTF8 class will be 
 removed.
 * create a converter (to be maintaines as long as UTF8 exists), which 
 converts existing data in bulk from UTF8 to Text. This requires an additional 
 processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] 

Stefan Groschupf commented on NUTCH-233:


Hi Otis, 
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls that for example comes from link farms the 
crawler runs into. 

 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
---

Attachment: sortPatchV1.patch

What people think about this kind of solution?

 Generator is building fetch list using *lowest* scoring URLs
 

 Key: NUTCH-348
 URL: http://issues.apache.org/jira/browse/NUTCH-348
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider
 Attachments: sortPatchV1.patch


 Ever since revision 391271, when the CrawlDatum key was replaced by a 
 FloatWritable key, the Generator.Selector.reduce method has been outputting 
 the *lowest* scoring URLs! The CrawlDatum class has a Comparator that 
 essentially treats higher scoring CrawlDatum objects as less than lower 
 scoring CrawlDatum objects, so the higher scoring ones would appear first in 
 a sequence file sorted using this as the key.
 When a FloatWritable based on the score itself (as returned from 
 scfilters.generatorSortValue) became the sort key, it should have been 
 negated in Generator.Selector.map to have the same result. Curiously, there 
 is a comment to this effect immediately before the FloatWritable is set:
   // sort by decreasing score
   sortValue.set(sort);
 It seems like the simplest way to fix this is to just negate the score, and 
 this seems to work for me:
   // sort by decreasing score
   // 2006-08-15 CSc REALLY sort by decreasing score
   sortValue.set(-sort);
 Unfortunately, this means that any crawls that have been done using 
 Generator.java after revision 391271 should be discarded, as they were 
 focused on fetching the lowest scoring unfetched URLs in the crawldb, 
 essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)
doubling score causes by page internal anchors.
---

 Key: NUTCH-332
 URL: http://issues.apache.org/jira/browse/NUTCH-332
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of 
anchors the scores of the page are distributed to its outlinks. But all this 
outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this 
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double 
the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] 

Stefan Groschupf commented on NUTCH-318:


Yes this happens only in a distributed environment. Please also see my last 
mail in the hadoop dev list. I think there are more general logging problems, 
that only occurs in a distributed environment. So you will not track them down 
using local runner.

 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] 

Stefan Groschupf commented on NUTCH-318:


Shouldn't that be fixed in .8 since by today this tool just produce no output?!


 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.9-dev


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] 

Stefan Groschupf commented on NUTCH-233:


I think this should be fixed in .8 too, since everybody that does real whole 
web crawl with over a 100 Mio pages will run into this problem. The problems 
are for example from spam bot generated urls. 



 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9-dev


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: segread vs. readseg

2006-07-24 Thread Stefan Groschupf

I like it!

Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki:


Stefan Neufeind wrote:

Andrzej Bialecki wrote:

Stefan Groschupf wrote:

Hi developers,

we have command like readdb and readlinkdb but segread. Wouldn't  
be more consistent to name the command readseg instead segread?

... just a thought.


Yes, it seems more consistent. However, if we change it then  
scripts people wrote would break. We could support both aliases  
in 0.8, and give a deprecation message.


What do others think?


Same feeling here. Agreed.


What about the following?

Index: bin/nutch
===
--- bin/nutch(revision 424960)
+++ bin/nutch(working copy)
@@ -40,7 +40,7 @@
  echo   generate  generate new segments to fetch
  echo   fetch fetch a segment's pages
  echo   parse parse a segment's pages
-  echo   segread   read / dump segment data
+  echo   readseg   read / dump segment data
  echo   mergesegs merge several segments, with optional  
filtering and slicing
  echo   updatedb  update crawl db from segments after  
fetching

  echo   invertlinks   create a linkdb from parsed segments
@@ -158,7 +158,10 @@
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ $COMMAND = readlinkdb ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
+elif [ $COMMAND = readseg ] ; then
+  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = segread ] ; then
+  echo [DEPRECATED] Command 'segread' is deprecated, use  
'readseg' instead.

  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = mergesegs ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com







result comparison tool?

2006-07-23 Thread Stefan Groschupf

Hi,

I remember there was a search result comparison tool within nutch.
Is that still alive? How to use it / find it? I was not able to find  
it by  browsing the trunk sources.
Is there any such a tool people can suggest to compare search results  
with yahoo or google result to play with configuration properties and  
scoring mechanisms?


Thanks for any hints.
Stefan


nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf

Hi developers,

in nutch-default.xml property plugin.includes we say:   In any case
you need at least include the nutch-extensionpoints plugin.
But we do not include it by default.
   valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-
basic|query-(basic|site|url)|summary-basic|scoring-opic/value
We may be should update the text or include the plugin everything
else may be confuse users.

Should I open a bug or can someone with write access just jump in and
fix that.

Thanks,
Stefan 


Re: nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf
I may - but since you know the details of the plugin subsystem,  
tell me what _should_ be there? I.e. should we really include it in  
the plugin.includes list, or not?


This is a philosophically question.
I personal prefer restrict definitions, since applications behavior  
is better traceable. That was a reason I implemented the plugin  
system in a restrict way.
Later on this was washed out by the plugin.auto-activation mechanism,  
what I still think was not a good move.


However in the moment we have the situation that nutch- 
extensionpoints is not included but the the auto activation mechanism  
includes this plugin since it is used by all other plugins.
So if you switch of auto activation today with default configured  
plugin-includes nutch will crash.
My person point of view is to add nutch-extensionpoints and switch  
off auto activation. .. but this is just my personal point of view...



Stefan



[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
not in plugin.includes
---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


In URLFilters constructor we use an array as long as we have filters defined in 
the urlfilter.order property. 
In case those filters are not included in the plugin.include property end up 
putting null entries into the array.

This cause a NPE in URLFilters line 82.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]

Stefan Groschupf updated NUTCH-325:
---

Attachment: UrlFiltersNPE.patch

A patch that uses a Arralist instead of an array and put only entries into the 
list when the entry is not null. Means only loaded Urlfilter that are loaded 
will be also stored into the filters array that is cached into the 
Configuration object. 

 UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
 not in plugin.includes
 ---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev

 Attachments: UrlFiltersNPE.patch


 In URLFilters constructor we use an array as long as we have filters defined 
 in the urlfilter.order property. 
 In case those filters are not included in the plugin.include property end up 
 putting null entries into the array.
 This cause a NPE in URLFilters line 82.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




log when blocked by robots.txt

2006-07-20 Thread Stefan Groschupf

Hi Developers,
another thing in the discussion to be more polite.
I suggest that we log a message in case an requested URL was blocked  
by a robots.txt.
Optimal would be if we only log this message in case the current used  
agent name is only blocked and it is not a general blocking of all  
agents.


Should I create a patch?

Stefan



[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]

Stefan Groschupf updated NUTCH-323:
---

Attachment: MapWritableCopyConstructor.patch

Attached patch add a copy constructor to  the map writable and use it in the 
CrawlDatum.set methode. However there are more methods in the code where meta 
data are passed from one CrawlDatum to a other, but I don't can see any risk of 
concurent usage of the mapWritable there. 


 CrawlDatum.set just reference a mapWritable of a other object but not copy it.
 --

 Key: NUTCH-323
 URL: http://issues.apache.org/jira/browse/NUTCH-323
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev

 Attachments: MapWritableCopyConstructor.patch


 Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to 
 a other. 
 Also a reference of the MapWritable is passed. Means both project share the 
 same mapWritable and its content. 
 This causes problems with concurent manipulate mapWritables and its key-value 
 tuples. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
db.score.link.internal and db.score.link.external are ignored
-

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical


Configuration properties db.score.link.external and db.score.link.internal  are 
ignored.
In case of e.g. message board webpages or pages that have larger navigation 
menus on each page having a lower impact of internal links makes a lot of sense 
for scoring.
Also for web spam this is a serious problem, since now spammers can setup just 
one domain with dynamically generated pages and this highly manipulate the 
nutch scores. 
So I also suggest that we give db.score.link.internal by default a value of 
something like 0.25. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]

Stefan Groschupf updated NUTCH-324:
---

Attachment: InternalAndExternalLinkScoreFactor.patch

Multiply the score of a page during distributeScoreToOutlink with 
db.score.link.internal or db.score.link.external.

 db.score.link.internal and db.score.link.external are ignored
 -

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
 Attachments: InternalAndExternalLinkScoreFactor.patch


 Configuration properties db.score.link.external and db.score.link.internal  
 are ignored.
 In case of e.g. message board webpages or pages that have larger navigation 
 menus on each page having a lower impact of internal links makes a lot of 
 sense for scoring.
 Also for web spam this is a serious problem, since now spammers can setup 
 just one domain with dynamically generated pages and this highly manipulate 
 the nutch scores. 
 So I also suggest that we give db.score.link.internal by default a value of 
 something like 0.25. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]

Stefan Groschupf resolved NUTCH-319.


Resolution: Won't Fix

Sorry, that is bogus since it is wriiten to the logging stream.

 OPICScoringFilter should use logging API instead of printStackTrace
 ---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


 OPICScoringFilter line 107 should be a logging not a   
 e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




db.max.inlinks

2006-07-18 Thread Stefan Groschupf

Hi,

shouldn't  db.max.inlinks be in the nutch-default.xml configuration?

Stefan 


OPICScoringFilter Metadata transport scores as String

2006-07-15 Thread Stefan Groschupf

Hi,

OPICScoringFilter line  91:
content.getMetadata().set(Fetcher.SCORE_KEY,  + datum.getScore());
and line 96,102 we set and get the Fetch Sore as Strings. :-o.
Wouldn't it be better to have the Metadata support floats as well  
instead of serializing and parsing strings?
In general wouldn't it be a good idea to have Metadata as child of  
MapWritable ? OO Design?


Any thoughts?

Stefan



[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
OPICScoringFilter should use logging API instead of printStackTrace
---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


OPICScoringFilter line 107 should be a logging not a   
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [Nutch-dev] Crawl error

2006-07-10 Thread Stefan Groschupf
As mentioned, set the environment variable bin/nutch set also for  
eclipse, especially logging related variables!



Am 10.07.2006 um 00:05 schrieb AJ Chen:


My classpath has conf folder. NUTCH_JAVA_HOME is set. In fact, nutch
0.71is working well from my eclipse. I suspect the error comes from
changes in
verions 0.8. The problem is the log message does not say what file  
is not

found. So, it's hard to debug.  Any idea?
Thanks,
AJ

On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote:


Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in  bin/nutch.

Btw, please do not crosspost.
Thanks.
Stefan

Am 09.07.2006 um 21:47 schrieb AJ Chen:

 I checked out the 0.8 code from trunk and tried to set it up in
 eclipse.
 When trying to run Crawl from Eclipse using args urls -dir crawl -
 depth 3
 -topN 50, I got the following error, which started from
 LogFactory.getLog(
 Crawl.class). Any idea what file was not found?  There is a url
 file under
 directory urls. Thanks,

 log4j:ERROR setFile(null,true) call failed.
 java.io.FileNotFoundException: \ (The system cannot find the path
 specified)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:177)
at java.io.FileOutputStream.init(FileOutputStream.java:102)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
at org.apache.log4j.FileAppender.activateOptions
 (FileAppender.java:163)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(
 DailyRollingFileAppender.java:215)
at org.apache.log4j.config.PropertySetter.activate
 (PropertySetter.java
 :256)
at org.apache.log4j.config.PropertySetter.setProperties(
 PropertySetter.java:132)
at org.apache.log4j.config.PropertySetter.setProperties(
 PropertySetter.java:96)
at org.apache.log4j.PropertyConfigurator.parseAppender(
 PropertyConfigurator.java:654)
at org.apache.log4j.PropertyConfigurator.parseCategory(
 PropertyConfigurator.java:612)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(
 PropertyConfigurator.java:509)
at org.apache.log4j.PropertyConfigurator.doConfigure(
 PropertyConfigurator.java:415)
at org.apache.log4j.PropertyConfigurator.doConfigure(
 PropertyConfigurator.java:441)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
 OptionConverter.java:468)
at org.apache.log4j.LogManager.clinit(LogManager.java:122)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(
 Log4JLogger.java:229)
at org.apache.commons.logging.impl.Log4JLogger.init
 (Log4JLogger.java
 :65)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java: 
494)

at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
 LogFactoryImpl.java:529)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
 LogFactoryImpl.java:235)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
 LogFactoryImpl.java:209)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:
 351)
at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38)
 log4j:ERROR Either File or DatePattern options are not set for
 appender
 [DRFA].

 -AJ

  
- 
-

 ---
 Using Tomcat but need to do more? Need to support web services,
 security?
 Get stuff done quickly with pre-integrated technology to make your
 job easier
 Download IBM WebSphere Application Server v.1.0.1 based on Apache
 Geronimo
 http://sel.as-us.falkag.net/sel?
 cmd=lnkkid=120709bid=263057dat=121642
 ___
 Nutch-developers mailing list
 Nutch-developers@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nutch-developers






[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
log4j not proper configured, readdb doesnt give any information
---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


In the latest .8 sources the readdb command doesn't dump any information 
anymore. 
This is realeated to the miss configured log4j.properties file. 
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps the information to the console, but not in a nice way. 

What makes me wonder  is that these information should be also in the log file, 
but the arn't, so there are may be even here problems.
Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [Nutch-dev] Crawl error

2006-07-09 Thread Stefan Groschupf
Try to put the conf folder to your classpath in eclipse and set the  
environemnt variables that are setted in  bin/nutch.


Btw, please do not crosspost.
Thanks.
Stefan

Am 09.07.2006 um 21:47 schrieb AJ Chen:

I checked out the 0.8 code from trunk and tried to set it up in  
eclipse.
When trying to run Crawl from Eclipse using args urls -dir crawl - 
depth 3
-topN 50, I got the following error, which started from  
LogFactory.getLog(
Crawl.class). Any idea what file was not found?  There is a url  
file under

directory urls. Thanks,

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path  
specified)

   at java.io.FileOutputStream.openAppend(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:177)
   at java.io.FileOutputStream.init(FileOutputStream.java:102)
   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
   at org.apache.log4j.FileAppender.activateOptions 
(FileAppender.java:163)

   at org.apache.log4j.DailyRollingFileAppender.activateOptions(
DailyRollingFileAppender.java:215)
   at org.apache.log4j.config.PropertySetter.activate 
(PropertySetter.java

:256)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:132)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:96)
   at org.apache.log4j.PropertyConfigurator.parseAppender(
PropertyConfigurator.java:654)
   at org.apache.log4j.PropertyConfigurator.parseCategory(
PropertyConfigurator.java:612)
   at org.apache.log4j.PropertyConfigurator.configureRootCategory(
PropertyConfigurator.java:509)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:415)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:441)
   at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
OptionConverter.java:468)
   at org.apache.log4j.LogManager.clinit(LogManager.java:122)
   at org.apache.log4j.Logger.getLogger(Logger.java:104)
   at org.apache.commons.logging.impl.Log4JLogger.getLogger(
Log4JLogger.java:229)
   at org.apache.commons.logging.impl.Log4JLogger.init 
(Log4JLogger.java

:65)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native  
Method)

   at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:39)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
   at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
LogFactoryImpl.java:529)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:235)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:209)
   at org.apache.commons.logging.LogFactory.getLog(LogFactory.java: 
351)

   at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38)
log4j:ERROR Either File or DatePattern options are not set for  
appender

[DRFA].

-AJ

-- 
---
Using Tomcat but need to do more? Need to support web services,  
security?
Get stuff done quickly with pre-integrated technology to make your  
job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache  
Geronimo
http://sel.as-us.falkag.net/sel? 
cmd=lnkkid=120709bid=263057dat=121642

___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers




Re: Nutch based directory and crawler based on keyword

2006-07-09 Thread Stefan Groschupf

Hi,

this question is difficult to answer and may be there more experts in  
the nutch user list than in the developer list.
In nutch 0.8 you can use the new scoring api to change the scoring of  
a page for being scheduled for crawling based on the it's scores.  
Have a look to the opic score plugin and to the crawldatum meta data.  
The meta data can be used to transport informations like customs  
category weightnings scores that take effect in the crawlDatum score  
caculation.
Attention this is not scoring during search time, this is scoring  
crawling scheduling.
Beside that the may be simplest way is to write a index plugin that  
tag a page (keywordMatch:true / false) that a keyword occurs or not.  
During search you extend the search string behind the scene with  
something like: yourSearchString+ keywordMatch:true


Stefan




Am 08.07.2006 um 07:03 schrieb Syed Kamran Ali:


Hi,

I have successfully configured nutch 0.7.2. Ran the crawler a few  
times all
working fine. Now i wanted to know is there a way i can run the  
crawler so

that if it finds certain keyword in a website only then it indexes it
otherwise not. Also after i have the index created is it possible  
that i can
create a categorized directory, like there is yahoo and google  
directories?


--
Thanks
Kamran




Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0  
and JDK

1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following  
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient  
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is  
invalid.

Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
327)

   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of  
Injector

by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I  
finaly dont

understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method  
from

CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and  
CrawlDb.update()

and
the inputPath setted in createJob is not needed neither by  
Injector.inject()

nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




Re: 0.8 release

2006-07-05 Thread Stefan Groschupf
+1, but I really would love to see NUTCH-293 as part of nutch .8  
since this all about being more polite.

Thanks.
Stefan

On 05.07.2006, at 03:46, Doug Cutting wrote:


+1

Piotr Kosiorowski wrote:

+1.
P.
Andrzej Bialecki wrote:

Sami Siren wrote:
How would folks feel about releasing 0.8 now, there has been  
quite a lot of improvements/new features
since 0.7 series and I strongly feel that we should push the  
first 0.8 series release (alfa/beta)
out the door now. It would IMO lower the barrier to first timers  
try the 0.8 series and that would

give us more feedback about the overall quality.


Definitely +1. Let's do some testing, however, after the upgrade  
to hadoop 0.3.2 - hadoop had many, many changes, so we just need  
to make sure it's stable when used with Nutch ...


We should also check JIRA and apply any trivial fixes before the  
release.




If there is a consensus about this I can volunteer to be the RM.


That would be great, thanks!







noindedo not index/noindex

2006-06-22 Thread Stefan Groschupf

Hi,
as far I can see nutch's html parser does only support the meta tag  
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there  
is an inoffiziel html noindex tag.

http://www.webmasterworld.com/forum10003/2703.htm

May be this would be another thing to make nutch more polite.
Also please remember my patch to support crawl-delay properties in  
robots.txt. That would be also something important to make nutch more  
polite and may be a better way than removing the nutch crawler  
identification.


Thoughts?
Stefan 


Re: how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-12 Thread Stefan Groschupf

Hi Feng,

map Writrable is a kind of hashmap.
You can put in any key value pair, but the key and values need to be  
Writables:
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ 
Writable.html


You can use UTF8 as StingKey and Value or ByteWritable as key and  
Utf8 as Values.

Etc.
Does this answer your question?
Stefan


Am 12.06.2006 um 04:15 schrieb Feng Ji:


hi,

I wonder how to use MapWritable metaData in CrawlDatum.java. The  
API gives

us some function call, but I still don't know how to
input information (String) to metaData and retrieve information;  
How to
convert MapWritable variable to other types like MetaData type or  
String

type.

Any good sample in Nutch's java class?

thanks,

Feng




Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf

Hi Lourival,

this means all pages older than 30 days are potential candidates for  
a fetch list that is created by segment generation process.


Stefan



Am 12.06.2006 um 16:33 schrieb Lourival Júnior:


Hi all!

I have a question about nutch-default.xml configuration file. There  
is a
parameter db.default.fetch.interval that is set by default to 30.  
It means

that pages from the webdb are recrawled every 30
days.http://www.mail-archive.com/nutch-user@lucene.apache.org/ 
msg02058.htmlI

want to know if this recrawled here means automatic recrawl or I
have to
execute some shell script before this period to make possible  
updates to my

WebDB.

I really wanna know this because at this time I did not obtain a  
update in

fact.

Thanks a lot!

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]




Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Ok. So, have you any solution to do this job automatically? I have  
a shell

script, but I don't see if this really works yet.

Shell scripts are the best solution.


Sorry if I'm being redundant. I'm learn about this tool and I have  
a lot of

questions :).
No Problem, but  the nutch user mailing list would be a better list  
to ask such questions.

Thanks!
Stefan



Thanks!

On 6/12/06, Dima Mazmanov [EMAIL PROTECTED] wrote:


Hi,Lourival.


You wrote 12 èþíÿ 2006 ã., 19:33:15:

 Hi all!

 I have a question about nutch-default.xml configuration file.  
There is a
 parameter db.default.fetch.interval that is set by default to  
30. It

means
 that pages from the webdb are recrawled every 30
 days.
http://www.mail-archive.com/nutch-user@lucene.apache.org/ 
msg02058.htmlI

 want to know if this recrawled here means automatic recrawl or I
 have to
 execute some shell script before this period to make possible  
updates to

my
 WebDB.

 I really wanna know this because at this time I did not obtain a  
update

in
 fact.

 Thanks a lot!


You have to recrawl db manually.


--
Regards,
Dima  mailto:[EMAIL PROTECTED]





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]




[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV5.patch

Release Candidate 1 of this patch.

This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4]) 
+ a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded
+ add a property to define if the IpAddress Resolver should be started as a 
part of the crawlDb update tool to update the parseoutput folder (contains 
CrawlDatum Status Linked) of a segment before updating the crawlDb.
+ using cached IP during Generation

Please review this patch and give me any improvement suggestion, I think this 
is a very important issue, since it helps to do _real_ whole web crawls and not 
end up in a honey pot after some fetch iterations.
Also if you like please vote for this issue. :-) Thanks.

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, 
 ipInCrawlDatumDraftV5.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV4.patch

Attached a patch that does only use any time 4 byte for the ip. Means we do 
ignore ipv6. This save us a 4 byte in each crawldatum for now.
I tested the resolver tool with a 200++mio crawldb and in average a performance 
of 500 IP lookups / sec per box is possible by using 1000 threads.

I really would love to get this into the sources as the basic version of having 
the IP address in  the crawlDatum, since I'm working on a tool set of spam 
detectors that all need ip adresses somehow.
May be let's exclude the tool but start with the crawlDatum? :-?
Any improvement suggestions?
Thanks.


 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

2006-06-07 Thread Stefan Groschupf (JIRA)
java doc of CrawlDb is wrong


 Key: NUTCH-302
 URL: http://issues.apache.org/jira/browse/NUTCH-302
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.8-dev


CrawlDb has the same java doc as Injector. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]

Stefan Groschupf updated NUTCH-301:
---

Attachment: CommonGramsCacheV1.patch

Cache HashMap COMMON_TERMS in configuration instance.

 CommonGrams loads analysis.common.terms.file for each query
 ---

  Key: NUTCH-301
  URL: http://issues.apache.org/jira/browse/NUTCH-301
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.8-dev
 Reporter: Chris Schneider
  Attachments: CommonGramsCacheV1.patch

 The move away from static objects toward instance variables has resulted in 
 CommonGrams constructor parsing its analysis.common.terms.file for each 
 query. I'm not certain how large a performance impact this really is, but it 
 seems like something you'd want to avoid doing for each query. Perhaps the 
 solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] 

Stefan Groschupf commented on NUTCH-293:


Any comments? There was already a posting in the nutch agent mailing list, 
where someone had banned nutch since nutch does not support crawl-delay.
Becasue nutch tries to be polite from my point of view this is a small but 
important change.
If there are no improvement suggestions can someone of the committers take care 
of that _please_? :-) 

 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



resolving IP in...

2006-06-07 Thread Stefan Groschupf

Hi,
after playing around to figure out the best place to resolve IP's of  
freshly discovered ulrs I agree with Andrzej that the  
Parseoutputformat isn't the best place.


The problem here, Parseoutputformat  is not multithreaded and we  
definitely need many threads for ip lookup.


I think in case we a  ip Resolving MapRunnable  to preprocess segment  
data (after fetching) before crawldb updating would be may be a  
better place.


+ less data to process (in opposite to process a complete crawldb)
+ good dns cache usage, since many new urls will point to the same  
host (internal links)

- we may lookup urls we already have in the crawldb.

Any thoughts?

Stefan










[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] 

Stefan Groschupf commented on NUTCH-293:


Hi Andrzej, 
I agree but writing a queue based fetcher is a big step. I already have some 
basic code (nio based).
Also I don't think that a new fetcher will be as stable as that we can put it 
into a .8 release. Since we plan to have .8 release it think it is a good idea 
for now to add this functionality. Maybe we do it configurable and switch it 
off by default?

In any case I suggest that we solve NUTCH-289 first and than getting the  
fetcher done.


 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Stefan Groschupf
As far I understand hadoop use commons logging. Should we switch to  
use commons logging as well?



Am 06.06.2006 um 11:02 schrieb Jérôme Charron:


URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1.  Hadoop now uses Jakarta Commons
Logging, configured for log4j by default.


If log4j is now included in the core, we can remove the lib-log4j  
plugin.

If no objection, I will doing it.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] 

Stefan Groschupf commented on NUTCH-258:


Scott, 
I agree with you. However we need a clean patch to solve the problem, we can 
not just comment things out of the code.
So I vote for the issue and I vote to reopen this issue.

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV1.patch

To keep the discussion alive attached a _first draft_ for storing the ip in the 
crawlDatum for public discussion.

Some notes. 
The IP is stored as byte[] in the crawlDatum itself not in the meta data.
There is a IpAddressResolver maprunnable tool to update a crawlDb using 
multithreaded ip lookups.
In case a IP is available in the crawlDatum the Generator use the cached ip. 

To discuss:
I don't like the idea of post process the complete crawlDb any time after a 
update. 
Processing crawlDb is expansive in storage usage and time. 
We can have a property ipLookups with possible values 
never|duringParsing|postUpdateDb.
Than we can add also some code to lookup the IP in the ParseOutputFormat as 
discussed or we start IpAddressResolver as job in the updateDb tool code.

In the moment I write the ip address bytes like this:
out.writeInt(ipAddress.length);
out.write(ipAddress); 
I think for now we can define that byte[] ipAddress is everytime 4 bytes long, 
or should we be IPv6 compatible by today?

Please give me some comments I have a strong interest to get this issue fixed 
asap and I'm willing to improve things as required. :-)

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting
  Attachments: ipInCrawlDatumDraftV1.patch

 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Stefan Groschupf



hmm... didn't think about that, are there more opinions about this?


I don't believe this don't be evil thing at all. I think it is just a  
question of time google feel we attack the appliance server market  
and I believe nutch has a serious chance to do so (some time in the  
far feature. :-) )

Stefan


--
Sami Siren

Are you sure there is no trademark infringement here? Perhaps we  
should call it something else, just to avoid any potential legal  
unpleasantries ...









Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf
I have a proposal for a simple solution: set a flag in the current  
Configuration instance, and check for this flag. The Configuration  
instance provides a task-specific context persisting throughout the  
lifetime of a task - but limited only to that task. Voila - problem  
solved. We get rid of the dubious use of LogFormatter (I hope Chris  
that even you would agree that this pattern is slightly ..  
unusual ;) ), and we gain flexible mechanism limited in scope to  
the current task, which ensures isolation from other tasks in the  
same JVM. How about that?

Wonderful idea :-D
+ 1




[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 
404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

 if a 404 for a robots.txt is returned a NPE is thrown
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: search engine spam detector

2006-06-04 Thread Stefan Groschupf


The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)


This will be a very long way.
I collect some thoughts and a list of web spam related papers in my  
blog.
http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072- 
F98E13CAEFE1.html

Feedback is welcome.


Stefan



[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
if a 404 for a robots.txt is returned no page is fetched at all from the host
-

 Key: NUTCH-298
 URL: http://issues.apache.org/jira/browse/NUTCH-298
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
 Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do  
robotRules = EMPTY_RULES;  (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used 
and we call isAllowed in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks 
and initialisations.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Attachment: fixNpeRobotRuleSet.patch

fix the npe in RobotRuleSet happen in case we use a empthy RuleSet

 if a 404 for a robots.txt is returned no page is fetched at all from the host
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RobotRuleSet

2006-06-03 Thread Stefan Groschupf

Hi,
just posted a fix for a NPE in case a empty RobotRuleSet is used.
The patch only contains a two lines fix, since I learned that this  
best way to get things committed sooner. :)
However I really don't like the RobotRuleSet implementation since  
entries are copied between a arraylist and a array for just no  
reasons. from my point of view.

I would love to change that to just use the arraylist.
Any thoughts?
Can I have a vote from one committer that would commit that to the  
source in case I do this change? :-)


Thanks.
Stefan






[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] 

Stefan Groschupf commented on NUTCH-282:


Is that related to host grouping we discussed? Can we in this case close this 
bug?

 Showing too few results on a page (Paging not correct)
 --

  Key: NUTCH-282
  URL: http://issues.apache.org/jira/browse/NUTCH-282
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 I did a search and got back the  value itemsPerPage from opensearch. But 
 the output shows results 1-8 and I have a total of 46 searchresults.
 Same happens for the webinterface.
 Why aren't enough results fetched?
 The problem might be somewhere in the area of where Nutch should only display 
 a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] 

Stefan Groschupf commented on NUTCH-286:


This is difficult to realize since the http error code is readed from response 
in the fetcher and setted into the protocol status , content analysis can only 
done during parsing. 
Also normally such pages do not get a high OPIC score and should be not in the 
top search results. 
However this is a wrong configured http server response, so you may should open 
a bug in the typo3 issue tracking. 
Should we close this issue?

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:


+1, Can someone create a clean patch file?

 OpenSearchServlet: OutOfMemoryError: Java heap space
 

  Key: NUTCH-292
  URL: http://issues.apache.org/jira/browse/NUTCH-292
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical
  Attachments: summarizer.diff

 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
   
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 The URL I use is:
 [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url
 It seems to be a problem specific to the date I'm working with. Moving the 
 start from 0 to 10 or changing the query works fine.
 Or maybe it doesn't have to do with sorting but it's just that I hit one bad 
 search-result that has a broken summary?
 !! The problem is repeatable. So if anybody has an idea where to search / 
 what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] 

Stefan Groschupf commented on NUTCH-291:


lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified and date is stored in the 
index, please submit a patch for MoreIndexingFilter.

 OpenSearchServlet should return date as well as lastModified
 

  Key: NUTCH-291
  URL: http://issues.apache.org/jira/browse/NUTCH-291
  Project: Nutch
 Type: Improvement

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-291-unfinished.patch

 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + key + :  + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, , new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return ;
}
  }
 So the Problem should be somewhere else.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
 
Stefan Groschupf closed NUTCH-287:
--

Resolution: Won't Fix

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html

 Exception when searching with sort
 --

  Key: NUTCH-287
  URL: http://issues.apache.org/jira/browse/NUTCH-287
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical


 Running a search with  sort=url works.
 But when usingsort=title   I get the following exception.
 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
 jsp threw exception
 java.lang.RuntimeException: Unknown sort value type!
 at 
 org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
 at 
 org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
 at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
 at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 What is in those lines is:
   WritableComparable sortValue;   // convert value to writable
   if (sortField == null) {
 sortValue = new FloatWritable(scoreDocs[i].score);
   } else {
 Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
 if (raw instanceof Integer) {
   sortValue = new IntWritable(((Integer)raw).intValue());
 } else if (raw instanceof Float) {
   sortValue = new FloatWritable(((Float)raw).floatValue());
 } else if (raw instanceof String) {
   sortValue = new UTF8((String)raw);
 } else {
   throw new RuntimeException(Unknown sort value type!);
 }
   }
 So I thought that maybe raw is an instance of something strange and tried 
 raw.getClass().getName() or also raw.toString() to track the cause down - but 
 that always resulted in a NullPointerException. So it seems I'm having raw 
 being null for some strange reason.
 When I try with title2 (or something none-existing) I get a different error 
 that title2 is unknown / not indexed. So I suspect that title

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
 
Stefan Groschupf closed NUTCH-284:
--

Resolution: Won't Fix

Yes, I was missing index-basic.

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] 

Stefan Groschupf commented on NUTCH-284:


Please try discuss such things first in the user mailing list than open a 
issue. 
Maintaining the issue tracking is very time consuming. But if there is a bug 
please continue open bug reports. :)
Thanks.


 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] 

Stefan Groschupf commented on NUTCH-281:


Can you submit a patch file?

 cached.jsp: base-href needs to be outside comments
 --

  Key: NUTCH-281
  URL: http://issues.apache.org/jira/browse/NUTCH-281
  Project: Nutch
 Type: Bug

   Components: web gui
 Reporter: Stefan Neufeind
 Priority: Trivial


 see cached.jsp
 base href=...
 does not take effect when showing a cached page because of the comments 
 around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] 

Stefan Groschupf commented on NUTCH-274:


Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the 
Injector?

 Empty row in/at end of URL-list results in error
 

  Key: NUTCH-274
  URL: http://issues.apache.org/jira/browse/NUTCH-274
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: nightly-2006-05-20
 Reporter: Stefan Neufeind
 Priority: Minor


 This is minor - but it's a little unclean :-)
 Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
 an empty line.
 Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
 fine - but second is empty and therefor fails proper protocol-detection.
 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
 060521 022639 found resource parse-plugins.xml at 
 file:/home/mm/nutch-nightly/conf/parse-plugins.xml
 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
 060521 022639 fetching http://www.bild.de/
 060521 022639 fetching 
 060521 022639 fetch of  failed with: 
 org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
 no protocol: 
 060521 022639 http.proxy.host = null
 060521 022639 http.proxy.port = 8080
 060521 022639 http.timeout = 1
 060521 022639 http.content.limit = 65536
 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 022639 fetcher.server.delay = 1000
 060521 022639 http.max.delays = 1000
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 022640  map 0%  reduce 0%
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
 
Stefan Groschupf closed NUTCH-286:
--

Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response 
codes based on responded html content.
Prune the index is a good idea to solve the problem.

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
support for Crawl-delay in Robots.txt
-

 Key: NUTCH-293
 URL: http://issues.apache.org/jira/browse/NUTCH-293
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical


Nutch need support for Crawl-delay defined in robots.txt, it is not a standard 
but a de-facto standard.
See:
http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-293?page=all ]

Stefan Groschupf updated NUTCH-293:
---

Attachment: crawlDelayv1.patch

A frist darft of a crawl delay support for nutch. The problem I see is that in 
case ip based delay is configured it can happen that we use the crawl delay of 
one host for a other host running on the same ip.
Feedback is welcome.

 support for Crawl-delay in Robots.txt
 -

  Key: NUTCH-293
  URL: http://issues.apache.org/jira/browse/NUTCH-293
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Attachments: crawlDelayv1.patch

 Nutch need support for Crawl-delay defined in robots.txt, it is not a 
 standard but a de-facto standard.
 See:
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: JVM error while parsing

2006-05-30 Thread Stefan Groschupf

Hi,
I heard there is a bug in JVM  1.5_06 beta, can you try a older or  
may be a 1.4 jvm and report if this happens with a other jvm as well.

Thanks,
Stefan

Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren:


Hi everyone,

I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to  
complete

a 1-depth-crawl
by using DFS and mapreduce structures. However, after a fetch step, I
encounter the below JVM exception
at one or more task trackers at the parsing step. It does not  
differ whether

I use only the default parsers,
or I also use the additional ones (pdf excel etc.). My task  
trackers work on

AMD X2 64-bit machines
and my JVM version is 1.5_06.

Have you ever faced with such a problem at the parse stage?Or how  
do you

think I can spot the cause of
this JVM exception?The error report is :

060530 144113 task_0007_m_10_0  Using Signature impl:
org.apache.nutch.crawl.MD5Signature
060530 144113 task_0007_m_10_0
5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data: 
0+12303612

060530 144114 task_0007_m_10_0  Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
060530 144114 task_0007_m_07_0
0.084114%/crawl/segments/20060521171305/content/part-00011/data:0 
+12493176

060530 144115 task_0007_m_07_0
0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0 
+12493176

060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An unexpected error has been  
detected

by HotSpot Virtual Machine:
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 #  SIGSEGV (0xb) at
pc=0x003d1d247c10, pid=25093, tid=182894086496
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64- 
Bit Server

VM (1.5.0_06-b05 mixed mode)
060530 144115 task_0007_m_07_0 # Problematic frame:
060530 144115 task_0007_m_07_0 # C  [libc.so.6+0x47c10]
printf_size+0x740
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An error report file with more
information is saved as hs_err_pid25093.log
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # If you would like to submit a bug
report, please visit:
060530 144115 task_0007_m_07_0 #
http://java.sun.com/webapps/bugreport/crash.jsp
060530 144115 task_0007_m_07_0 #
060530 144115 Server connection on port 51950 from 192.168.15.61:  
exiting

060530 144115 task_0007_m_07_0 Child Error
java.io.IOException: Task process exit with nonzero status of 134.
   at org.apache.hadoop.mapred.TaskRunner.runChild 
(TaskRunner.java:242)

   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)


Thank you very much.




Re: Extract infos from documents and query external sites

2006-05-30 Thread Stefan Groschupf

Think about using the google API.

However the way to go could be:

+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract  
names based on patterns and heuristics.

++ write the names in a file.

+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second  
empty crawl database

+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.

HTH
Stefan




Am 30.05.2006 um 12:14 schrieb HellSpawn:



I'm working on a search engine for my university and they want me  
to do that

to create a repository of scientific articles on the web :D

I red something about xpath for extracting exact parts from a  
document, once
done this building the query is very easy but my doubts are about  
how to

insert all of this in the nutch crawler...

Thank you
--
View this message in context: http://www.nabble.com/Extract+infos 
+from+documents+and+query+external+sites-t1675003.html#a4624272

Sent from the Nutch - Dev forum at Nabble.com.






  1   2   3   4   >