[jira] Resolved: (NUTCH-315) CrawlDbReader usage text - implementation mismatch

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-315?page=all ] Sami Siren resolved NUTCH-315. -- Resolution: Duplicate duplicate of NUTCH-318 > CrawlDbReader usage text - implementation mismatch > -- > >

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423546 ] Sami Siren commented on NUTCH-318: -- I agree :) so the next thing to do is change readdb -stats to print to stdout, i'll go ahead and do that. Are there any other c

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423542 ] Andrzej Bialecki commented on NUTCH-318: - I think also that producing no output on the console is confusing to new users, especially in the "local" mode. I

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] Stefan Groschupf commented on NUTCH-318: Yes this happens only in a distributed environment. Please also see my last mail in the hadoop dev list. I think th

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423531 ] Sami Siren commented on NUTCH-318: -- Perhaps this is happening in distributed setup? in 1 machine setup output is done to log file see NUTCH-315 > log4j not proper

RE: How can i get a page content or parse data by the page's url

2006-07-25 Thread Aaron Tang
Is there any nutch api can do this? -Original Message- From: Lourival Júnior [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 26, 2006 1:41 AM To: nutch-dev@lucene.apache.org Subject: Re: How can i get a page content or parse data by the page's url If I'm not wrong you can´t do this. The

[jira] Updated: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-07-25 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Chris A. Mattmann updated NUTCH-258: Fix Version/s: 0.8-dev > Once Nutch logs a SEVERE log item, Nutch fails forevermore > -- > >

[jira] Updated: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ] Renaud Richardet updated NUTCH-330: --- Attachment: clSearch.diff forgot the "echo" in sh... > command line tool to search a Lucene index > -- > >

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages wi

[jira] Updated: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ] Renaud Richardet updated NUTCH-330: --- Attachment: clSearch.diff unified diff against head > command line tool to search a Lucene index > -- > >

[jira] Created: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
command line tool to search a Lucene index -- Key: NUTCH-330 URL: http://issues.apache.org/jira/browse/NUTCH-330 Project: Nutch Issue Type: Improvement Components: searcher Affects Versio

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] Stefan Groschupf commented on NUTCH-318: Shouldn't that be fixed in .8 since by today this tool just produce no output?! > log4j not proper configured, rea

[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ] Sami Siren updated NUTCH-325: - Fix Version/s: 0.9-dev (was: 0.8-dev) > UrlFilters.java throws NPE in case urlfilter.order contains Filters that are > not in plugin.includes >

[jira] Updated: (NUTCH-247) robot parser to restrict.

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-247?page=all ] Sami Siren updated NUTCH-247: - Fix Version/s: 0.9-dev (was: 0.8-dev) > robot parser to restrict. > - > > Key: NUTCH-247 >

[jira] Updated: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=all ] Sami Siren updated NUTCH-233: - Fix Version/s: 0.9-dev (was: 0.8-dev) > wrong regular expression hang reduce process for ever >

[jira] Updated: (NUTCH-310) Review Log Levels

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-310?page=all ] Sami Siren updated NUTCH-310: - Fix Version/s: 0.9-dev (was: 0.8-dev) > Review Log Levels > - > > Key: NUTCH-310 > URL: http://i

[jira] Updated: (NUTCH-262) Summary excerpts and highlights problems

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-262?page=all ] Sami Siren updated NUTCH-262: - Fix Version/s: 0.9-dev (was: 0.8-dev) > Summary excerpts and highlights problems > > >

[jira] Updated: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Sami Siren updated NUTCH-322: - Fix Version/s: 0.9-dev (was: 0.8-dev) > Fetcher discards ProtocolStatus, doesn't store redirected pages > --

[jira] Updated: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ] Sami Siren updated NUTCH-318: - Fix Version/s: 0.9-dev (was: 0.8-dev) > log4j not proper configured, readdb doesnt give any information > --

[jira] Updated: (NUTCH-251) Administration GUI

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-251?page=all ] Sami Siren updated NUTCH-251: - Fix Version/s: 0.9-dev (was: 0.8-dev) > Administration GUI > -- > > Key: NUTCH-251 > URL: http:/

[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=all ] Sami Siren updated NUTCH-246: - Fix Version/s: 0.9-dev (was: 0.8-dev) > segment size is never as big as topN or crawlDB size in a distributed > deployement > -

[jira] Updated: (NUTCH-74) French Analyzer Plugin

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-74?page=all ] Sami Siren updated NUTCH-74: Fix Version/s: 0.9-dev (was: 0.8-dev) > French Analyzer Plugin > -- > > Key: NUTCH-74 > URL: ht

[jira] Updated: (NUTCH-86) LanguageIdentifier API enhancements

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-86?page=all ] Sami Siren updated NUTCH-86: Fix Version/s: 0.9-dev (was: 0.8-dev) > LanguageIdentifier API enhancements > --- > > Key: NUTCH-86

[jira] Updated: (NUTCH-249) black- white list url filtering

2006-07-25 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-249?page=all ] Sami Siren updated NUTCH-249: - Fix Version/s: 0.9-dev (was: 0.8-dev) > black- white list url filtering > --- > > Key: NUTCH-249 >

Limiting Results By Domain

2006-07-25 Thread Robert Sanford
I'm interested in a plugin to filter results so that they are limited to a collection of domains that are specified by the user at the time of the search. If such a filter does not currently exist I'm willing to work on one if someone is willing to point me in the right direction. rjsjr

Re: How can i get a page content or parse data by the page's url

2006-07-25 Thread Lourival Júnior
If I'm not wrong you can´t do this. The segread command only accept these arguments: SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...) NOTE: at least one segment dir name is required, or '-dir' option. -fix

How can i get a page content or parse data by the page's url

2006-07-25 Thread Aaron Tang
Hi all, How can i get a page content or parse data by the page's url. Just like the command: $ bin/nutch segread crawl/segments/20060725213636/ -dump will dump pages in the segment. I'm using nutch 0.7.2 on cygwin under winxp. Thanks! Aaron

Re: Scanning the database

2006-07-25 Thread Stefan Neufeind
Robert Sanford wrote: > Running Nutch 0.7.2 but I'm willing to move up to 0.8 if need be. > > I have created an "Intranet" crawl using the file containing a list of > URIs and the list of regex to allow in conf/crawl-urlfilter.txt. Using > search.jsp I get lots and lots of good results so I'm quit

Indexing href attribute in links.

2006-07-25 Thread Robert Sanford
I am currently running Nutch 0.7.2 under Jboss 4.0.1 using Java 1.5.0_01 for Win32. The index runs were created under Cygwin. What I have found so far is that Nutch will not index keywords within the href attribute of an anchor tag and I want Nutch to do so. I provide a co-branding service for cu

Scanning the database

2006-07-25 Thread Robert Sanford
Running Nutch 0.7.2 but I'm willing to move up to 0.8 if need be. I have created an "Intranet" crawl using the file containing a list of URIs and the list of regex to allow in conf/crawl-urlfilter.txt. Using search.jsp I get lots and lots of good results so I'm quite happy so far. But, I want to

Re: 0.8 release

2006-07-25 Thread Andrzej Bialecki
Sami Siren wrote: There is a package available for testing in http://people.apache.org/~siren/nutch-0.8/ please give it some testing and post in your opinion - is it good enough to be a public release? I have some doubts because of NUTCH-266, but so far only 3 people have reported this to b

Re: 0.8 release

2006-07-25 Thread Sami Siren
There is a package available for testing in http://people.apache.org/~siren/nutch-0.8/ please give it some testing and post in your opinion - is it good enough to be a public release? I have some doubts because of NUTCH-266, but so far only 3 people have reported this to be problem (me incl