[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323007 ] 

Jerome Charron commented on NUTCH-88:
-

Dawid,
Thanks for your pointers on IE MimeType resolution. We have in Nutch a MimeType 
resolver based on both file-ext and files magic sequences to find the 
content-type of a file. It is actually underused, and perhaps some enhancement 
must be added: such as the content-type mapping: allow to map a content-type to 
a normalized one (ie mapping for instance application/powerpoint to 
application/vnd.ms-powerpoint, so that only the normalized version must be 
registered in the plugin.xml file).

Chris,
Thanks in advance for your futur work. Could you please synchronize your 
efforts with Sébastien, since he seems very interested to contribute.

Andrzej,
The way to express a preference of one plugin over another, if both support the 
same content type is to activate the plugin you want to handle a content-type 
and deactivate onther ones.
No?

Note: Since the MimeResolver handles associations between file-extensions and 
content-types, the path-suffix in plugin.xml (and in ParserFactory policy for 
choosing a Parser) could certainly be removed in order to have only one central 
point for storing this knowledge.

 Enhance ParserFactory plugin selection policy
 -

  Key: NUTCH-88
  URL: http://issues.apache.org/jira/browse/NUTCH-88
  Project: Nutch
 Type: Improvement
   Components: indexer
 Versions: 0.7, 0.8-dev
 Reporter: Jerome Charron
  Fix For: 0.8-dev


 The ParserFactory choose the Parser plugin to use based on the content-types 
 and path-suffix defined in the parsers plugin.xml file.
 The selection policy is as follow:
 Content type has priority: the first plugin found whose contentType 
 attribute matches the beginning of the content's type is used. 
 If none match, then the first whose pathSuffix attribute matches the end of 
 the url's path is used.
 If neither of these match, then the first plugin whose pathSuffix is the 
 empty string is used.
 This policy has a lot of problems when no matching is found, because a random 
 parser is used (and there is a lot of chance this parser can't handle the 
 content).
 On the other hand, the content-type associated to a parser plugin is 
 specified in the plugin.xml of each plugin (this is the value used by the 
 ParserFactory), AND the code of each parser checks itself in its code if the 
 content-type is ok (it uses an hard-coded content-type value, and not uses 
 the value specified in the plugin.xml = possibility of missmatches between 
 content-type hard-coded and content-type delcared in plugin.xml).
 A complete list of problems and discussion aout this point is available in:
   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Jérôme Charron
 Jerome: Give me a shout if you need a hand on this. I'll be happy to
 help and as it happens, I'll be available in the next few weeks.

Sébastien,
Great! As I mentioned in my last comment on JIRA, please synchronize with 
Chris on this point.
I'm currently coding on other subjects and don't have time to code on this 
issue.
But I can participate on the reflexion and I'm ok to review the proposal.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/


[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323009 ] 

Dawid Weiss commented on NUTCH-88:
--

Yep, I know about byte-magic mime detector. I'm just pointing out Internet 
Explorer doesn't use it... or at least, it doesn't always use it the way you 
would expect it to. Whether Nutch should mimic IE in this behaviour is another 
question.

 Enhance ParserFactory plugin selection policy
 -

  Key: NUTCH-88
  URL: http://issues.apache.org/jira/browse/NUTCH-88
  Project: Nutch
 Type: Improvement
   Components: indexer
 Versions: 0.7, 0.8-dev
 Reporter: Jerome Charron
  Fix For: 0.8-dev


 The ParserFactory choose the Parser plugin to use based on the content-types 
 and path-suffix defined in the parsers plugin.xml file.
 The selection policy is as follow:
 Content type has priority: the first plugin found whose contentType 
 attribute matches the beginning of the content's type is used. 
 If none match, then the first whose pathSuffix attribute matches the end of 
 the url's path is used.
 If neither of these match, then the first plugin whose pathSuffix is the 
 empty string is used.
 This policy has a lot of problems when no matching is found, because a random 
 parser is used (and there is a lot of chance this parser can't handle the 
 content).
 On the other hand, the content-type associated to a parser plugin is 
 specified in the plugin.xml of each plugin (this is the value used by the 
 ParserFactory), AND the code of each parser checks itself in its code if the 
 content-type is ok (it uses an hard-coded content-type value, and not uses 
 the value specified in the plugin.xml = possibility of missmatches between 
 content-type hard-coded and content-type delcared in plugin.xml).
 A complete list of problems and discussion aout this point is available in:
   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



bug in bin/nutch?

2005-09-09 Thread Earl Cahill
Trying to get mapred stuff to work, and I find it hard
to believe that this is a bug, but just trying to go
through the tutorial, I enter

bin/nutch admin db -create

and get

Exception in thread main
java.lang.NoClassDefFoundError: admin

Looking through bin/nutch, sure enough there isn't a
chunk for admin.  But there is in trunk.  If I add it
back in as per my patch below, then it seems to work.

But that sure seems like it would be broken for every
person that walks through the tutorial on mapred.

Earl

 ~/nutch/branches/mapred $ svn diff bin/nutch
Index: bin/nutch
===
--- bin/nutch   (revision 279726)
+++ bin/nutch   (working copy)
@@ -124,6 +124,8 @@
 # figure out which class to run
 if [ $COMMAND = crawl ] ; then
   CLASS=org.apache.nutch.crawl.Crawl
+elif [ $COMMAND = admin ] ; then
+  CLASS=org.apache.nutch.tools.WebDBAdminTool
 elif [ $COMMAND = inject ] ; then
   CLASS=org.apache.nutch.crawl.Injector
 elif [ $COMMAND = generate ] ; then


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


tutorial suggestion

2005-09-09 Thread Earl Cahill
Walking through the tutorial

http://lucene.apache.org/nutch/tutorial.html

and just a little suggestion.  For the 

s1=`ls -d segments/2* | tail -1`
s2=`ls -d segments/2* | tail -1`
s3=`ls -d segments/2* | tail -1`

I suggest using \ls just in case users have an alias
like

alias ls='ls -lFa'

like me.  Such an alias, without the \ls means that

echo $s1

gives something like

drwxr-xr-x 8 nutch nutch 4096 Sep 9 03:08
segments/20050909030535/

which isn't going to work so hot.

Yeah, kind of dumb, I know, but pretty well any ls
alias would break it.  Only took me a couple minutes
to figure out, but I don't see a reason to not have
\ls.

Thanks,
Earl

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: bug in bin/nutch?

2005-09-09 Thread Earl Cahill
 The DB format in mapred branch is completely
 different. So, what you 
 create with admin db -create is the old DB format,
 not used in the 
 mapred branch.
 
 Please study the code to the Crawl command, this
 should help... Mapred 
 stuff is powerful, but it is also very different
 from the current way of 
 doing things, so there will be alot to learn...

Guess I figured as much.  Can I suggest that someone
typing 

bin/nutch admin ...

in the mappred branch, should get pointed to the
proper command, or at least a message saying that
admin doesn't exist in the mapred branch, just to save
some confusion.  There is a dumb patch below that
would change the usage line.

I think such differences are all the more reason to
have a nice mapred tutorial, which I would be more
than willing to help with.  I thought I was close, but
I have yet to get a mapred crawl/index/search
completed.  Your comment makes me think I am still
aways off.

Thanks,
Earl

Index: bin/nutch
===
--- bin/nutch   (revision 279734)
+++ bin/nutch   (working copy)
@@ -29,7 +29,7 @@
   echo Usage: nutch COMMAND
   echo where COMMAND is one of:
   echo   crawl one-step crawler for
intranets
-  echo   admin database administration,
including creation
+  echo   admin not used in mapred
   echo   injectinject new urls into the
database
   echo   generate  generate new segments to
fetch
   echo   fetch fetch a segment's pages


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: bug in bin/nutch?

2005-09-09 Thread Andrzej Bialecki

Earl Cahill wrote:


Guess I figured as much.  Can I suggest that someone
typing 


bin/nutch admin ...

in the mappred branch, should get pointed to the
proper command, or at least a message saying that


There is no separate command - for now the DB is created when you run 
Injector or Crawl (which calls Injector as the first step). Other 
commands from the script should work very similarly, even though they 
use now different implementations:


* inject - runs Injector to add urls from a plaintext file (one url per 
line, there may be many input files, and they must be placed inside a 
directory). This creates the CrawlDB in the destination directory if it 
didn't exist before, or updates the existing one. Note that the new 
CrawlDB does NOT contain links - they are stored separately in a LinkDB, 
and CrawlDB just stores the equivalents of Page in the former WebDB.


* generate - runs Generate to create new fetchlists to be fetched

* fetch - runs the modified Fetcher to fetch segments

* updatedb - runs CrawlDB.update() to update the CrawlDB with new page 
information, and to add new unfetched pages.


* invertlinks - creates or updates a LinkDB, containing incoming link 
information. Note that it takes as an argument the top level dir, where 
the new segments are contained, and not the dir names of segments...


* index - runs the new modified Indexer to create an index of the 
fetched segments.


The above commands read the mapred configuration, and for now it 
defaults to local, which means that all Jobs execute within the same 
JVM, and NDFS also defaults to local. The rest of the commands in 
bin/nutch have to do with a distributed setup.



admin doesn't exist in the mapred branch, just to save
some confusion.  There is a dumb patch below that
would change the usage line.

I think such differences are all the more reason to
have a nice mapred tutorial, which I would be more
than willing to help with.  I thought I was close, but


Yes, I agree. But there are still some command-line tools missing, or 
not yet ported to use mapred. At this point a general tutorial would be 
difficult... unless it would be simply you need to run ./nutch crawl ...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch 0.7 bug?

2005-09-09 Thread [EMAIL PROTECTED]

Hi Michael,

I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value, 
when it is bigger than 1.
There is another possible sources: fetcher.threads.fetch or 
fetcher.threads.per.host or parser.threads.parse.


Best Regards,
   Ferenc


Hi Ferenc,

I see the same errors. As I've seen a running installation yesterday, 
I think it's a configuration mistake. By now I have no idea where. 
Have you made any progress?


Regards

Michael


[EMAIL PROTECTED] wrote:


Dear Developers!

I tested  nutch 0.7 with all the parser plugins, and found the 
followings:


- 


The fetch broken by with e.g. followings:
- 

050901 110915 fetch okay, but can't parse 
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
files are unsupported at this time

050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
   at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
   at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)

   at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
   at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key 
out of order: 319 after 319

java.io.IOException: key out of order: 319 after 319
   at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
   at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
   at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread main java.lang.RuntimeException: SEVERE error 
logged.  Exiting fetcher.

   at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
   at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key 
out of order: 319 after 319

java.io.IOException: key out of order: 319 after 319
   at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
   at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
   at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key 
out of order: 319 after 319

java.io.IOException: key out of order: 319 after 319
   at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
   at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
   at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 

   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key 
out of order: 319 after 319

java.io.IOException: key out of order: 319 after 319
   at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
   at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
   at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.

--- 


There are the differences between nutch-site.xml and nutch-default.xml:
--- 


* nutch-default.xml
 namehttp.timeout/name
 value1/value
 descriptionThe default network timeout, in 
milliseconds./description

* NUTCH-SITE.XML
 namehttp.timeout/name
 value3/value
 descriptionThe default network timeout, in 
milliseconds./description

*

* nutch-default.xml
 namehttp.max.delays/name
 value3/value
 descriptionThe number of times a thread will delay when trying to
* NUTCH-SITE.XML
 namehttp.max.delays/name
 value6/value
 descriptionThe number of times a thread will delay when trying to
*


Re: fetch performance

2005-09-09 Thread Andrzej Bialecki

AJ wrote:
I tried to run 10 cycles of fetch/updatabs.  In the 3rd cycle, the fetch 
list had 8810 urls.  Fetch ran pretty fast on my laptop before 4000 
pages were fetched. After 4000 pages, it suddenly switched to very slow 
speed, about 30 mins for just 100 pages.  My laptop also started to run 
at 100% CPU all the time. Is there a threshold for fetch list size, 
above which fetch performance will be degraded? Or it was because my 
laptop? I know -topN option can control the fetch size. But, topN=4000 
seems too small because it will end up thousands of segments.  Is there 
a good rule of thumb for topN setting ?


A related question is how big a segment should be in order to keep the 
number of segments small without hitting fetch performance too much. For 
example, to crawl 1 million pages in one run (has many fetch cycles), 
what will be a good limit for each fetch list?


There are no artificial limits like that - I'm routinely fetching 
segments of 1 mln pages. Most likely what happened to you is that:


* you are using Nutch version with PDFBox 0.7.1 or below

* you fetched a rare kind of PDF, which puts PDFBox in a tight loop

* the thread that got stuck is consuming 99% of your CPU. :-)

Solution: upgrade PDFBox to the yet unreleased 0.7.2 .


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com