Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf
Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should

Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf
working with hadoop he/ she should feel free to update the patch and post it in the hadoop jira. Stefan On 18.01.2007, at 15:39, Doug Cutting wrote: Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view

Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf
dont know it is the right time to do this job. On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about

Re: Next Nutch release

2007-01-17 Thread Stefan Groschupf
Hi, great to hear people still working on things. It shows once more getting something in early would save some effort. :) Just some random comments. We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach.

Re: What's the status of Nutch-GUI?

2006-12-02 Thread Stefan Groschupf
Hi Sami, I quess you refer to these: • LocalJobRunner: • Run as kind of singelton • Have a kind of jobQueue • Implement JobSubmissionProtocol status-report methods • implement killJob method Right! -how about writing a nutchrunner that just extends the functionality of

Re: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Stefan Groschupf
did you erver browse this: http://wiki.media-style.com/display/ nutchDocu/Home Nothing big, but it will give you some ideas, also about plugins. On 25.11.2006, at 06:32, Armel T. Nene wrote: I agree with you that documentation is vital not the just extending the current version but also for

Re: Fetcher freezes

2006-11-03 Thread Stefan Groschupf
Hi, try to have no regular expression filter and check if this helps. Let me know if this solve the problem. You may be want to do a thread dump and send the log to the list to check where exactly the fetcher freezes. Stefan Am 03.11.2006 um 15:53 schrieb Aisha: Hi, I don't know why but

Re: How could I test my modify to NutchAnalysis.jj?

2006-09-10 Thread Stefan Groschupf
There is a eclipse java cc plugin. It compiles your the grammar and you can write easily test code. However it has it's own issues so you may just want to generate the java files with the nutch ant script and write than unit tests again these files. HTH Stefan On 10.09.2006, at 00:49, heack

Re: Patch Available status?

2006-08-31 Thread Stefan Groschupf
Another alternative would be to construct a new workflow that just adds the Patch Available status and still permits issues to be re- opened. +1

Re: Missing pages anchor text

2006-08-29 Thread Stefan Groschupf
Hi Doug, I'm pretty sure that your problem is related to the deduping of your index. In general the hash of the content of a page is used as key for the dedub tool. We ran into the the forwarding problem also in a other case. https://issues.apache.org/jira/browse/NUTCH-353 So may be we

Re: [Nutch Wiki] Update of RunNutchInEclipse by UrosG

2006-08-29 Thread Stefan Groschupf
Hi, + You may have problems with some imports in parse-mp3 and parse- rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here: + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/ lib/ + +

Re: Checking if crawl dir exists ...

2006-08-25 Thread Stefan Groschupf
Hi Michi, what is your motivation for that? Stefan Am 25.08.2006 um 06:52 schrieb Michael Wechner: Hi I think it would be very useful if the NutchBean would check if the crawl dir exists and throw at least a warning in case it doesn't: Index:

Re: [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]

2006-08-24 Thread Stefan Groschupf
Hi Renaud, I think you were meaning editing: http://wiki.apache.org/nutch/ RunNutchInEclipse , not http://wiki.apache.org/nutch/ RenaudRichardet , right? Right! Sorry for the misunderstanding.I have no idea about your personal page so it would be a bad move to edit it. :-) Thanks again for

Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet

2006-08-23 Thread Stefan Groschupf
Hi Renaud, I updated your page with some more details, I hope that is ok for you. Thanks for creating it. Stefan Am 23.08.2006 um 11:51 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has

Re: Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation

2006-08-22 Thread Stefan Groschupf
One must also remember that proper junit testing can be used to verify functionality. There's lot of code currently that is not guarded by unit tests and I hereby invite everybody to participate in this endless effort and make Nutch unit tests better ;) I completely agree!!! Nutch has more

[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] Stefan Groschupf commented on NUTCH-354: Since this issue is already closed I can not attach the patch file, so I attach it as text within this comment

Fwd: [webspam-announces] Web Spam Collection Announced

2006-08-21 Thread Stefan Groschupf
Hi, May be some people will find that posting interesting. Webspam is one of the biggest issues or nutch for whole web crawls from my POV. Greetings, Stefan During AIRWeb'06 we announced the availability of the collection. We are currently planning a Web Spam challenge based on the

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] Stefan Groschupf commented on NUTCH-356: Hi Enrico, there will be as much PluginRepositories as Configuration objects. So in case you create many

[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix

[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ] Stefan Groschupf updated NUTCH-357: --- Attachment: protocol-simulation-pluginV1.patch A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used

[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong

[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. MapWritable, nextEntry is not reset when Entries

[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look

[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file

[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
/NUTCH-350 Project: Nutch Issue Type: Bug Reporter: Stefan Groschupf Priority: Critical Intranet crawls or focused crawls will fetch many pages from the same host. This causes that a thread will be blocked since a other thread already fetch from

[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ] Stefan Groschupf updated NUTCH-350: --- Attachment: protocolRetryV5.patch This patch will dramatically increase the number of successfully fetched pages of a intranet crawl over the time

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] Stefan Groschupf commented on NUTCH-322: I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output

[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1 Attachments: doNotRefecthForwarderPagesV1.patch Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ] Stefan Groschupf updated NUTCH-353: --- Attachment: doNotRefecthForwarderPagesV1.patch Since we discussed that nutch need to be more polite we should fix that asap. pages that serverside

[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 Fetcher discards ProtocolStatus, doesn't store redirected pages

[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] Stefan Groschupf commented on NUTCH-347: Please submit this patch! Thanks! Build: plugins' Jars not found

[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] Stefan Groschupf commented on NUTCH-346: +1 I agree, can you please create a patch file and attach it to this bug. Thanks Improve readability of logs

[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] Stefan Groschupf commented on NUTCH-345: Shouldn't the DeflateUtils also be part of the protocol-http plugin? Also since it is a larger contribution

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] Stefan Groschupf commented on NUTCH-349: my vote goes to #2. Having a tool that need to be started manually would be better than complicate the already

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls

[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ] Stefan Groschupf updated NUTCH-348: --- Attachment: sortPatchV1.patch What people think about this kind of solution? Generator is building fetch list using *lowest* scoring URLs

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)
Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] Stefan Groschupf commented on NUTCH-318: Yes this happens only in a distributed environment. Please also see my last mail in the hadoop dev list. I think

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] Stefan Groschupf commented on NUTCH-318: Shouldn't that be fixed in .8 since by today this tool just produce no output?! log4j not proper configured

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages

Re: segread vs. readseg

2006-07-24 Thread Stefan Groschupf
I like it! Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki: Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just

result comparison tool?

2006-07-23 Thread Stefan Groschupf
Hi, I remember there was a search result comparison tool within nutch. Is that still alive? How to use it / find it? I was not able to find it by browsing the trunk sources. Is there any such a tool people can suggest to compare search results with yahoo or google result to play with

nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf
Hi developers, in nutch-default.xml property plugin.includes we say: In any case you need at least include the nutch-extensionpoints plugin. But we do not include it by default. valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-

Re: nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf
I may - but since you know the details of the plugin subsystem, tell me what _should_ be there? I.e. should we really include it in the plugin.includes list, or not? This is a philosophically question. I personal prefer restrict definitions, since applications behavior is better traceable.

[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
-325 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev In URLFilters constructor we use an array as long as we have filters defined in the urlfilter.order property

[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ] Stefan Groschupf updated NUTCH-325: --- Attachment: UrlFiltersNPE.patch A patch that uses a Arralist instead of an array and put only entries into the list when the entry is not null. Means

log when blocked by robots.txt

2006-07-20 Thread Stefan Groschupf
Hi Developers, another thing in the discussion to be more polite. I suggest that we log a message in case an requested URL was blocked by a robots.txt. Optimal would be if we only log this message in case the current used agent name is only blocked and it is not a general blocking of all

[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ] Stefan Groschupf updated NUTCH-323: --- Attachment: MapWritableCopyConstructor.patch Attached patch add a copy constructor to the map writable and use it in the CrawlDatum.set methode. However

[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
Components: fetcher Reporter: Stefan Groschupf Priority: Critical Configuration properties db.score.link.external and db.score.link.internal are ignored. In case of e.g. message board webpages or pages that have larger navigation menus on each page having a lower impact

[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Stefan Groschupf updated NUTCH-324: --- Attachment: InternalAndExternalLinkScoreFactor.patch Multiply the score of a page during distributeScoreToOutlink with db.score.link.internal

[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ] Stefan Groschupf resolved NUTCH-319. Resolution: Won't Fix Sorry, that is bogus since it is wriiten to the logging stream. OPICScoringFilter should use logging API instead

db.max.inlinks

2006-07-18 Thread Stefan Groschupf
Hi, shouldn't db.max.inlinks be in the nutch-default.xml configuration? Stefan

OPICScoringFilter Metadata transport scores as String

2006-07-15 Thread Stefan Groschupf
Hi, OPICScoringFilter line 91: content.getMetadata().set(Fetcher.SCORE_KEY, + datum.getScore()); and line 96,102 we set and get the Fetch Sore as Strings. :-o. Wouldn't it be better to have the Metadata support floats as well instead of serializing and parsing strings? In general wouldn't it

[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
Affects Versions: 0.8-dev Reporter: Stefan Groschupf Assigned To: Andrzej Bialecki Priority: Trivial Fix For: 0.8-dev OPICScoringFilter line 107 should be a logging not a e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it? -- This message

Re: [Nutch-dev] Crawl error

2006-07-10 Thread Stefan Groschupf
changes in verions 0.8. The problem is the log message does not say what file is not found. So, it's hard to debug. Any idea? Thanks, AJ On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Try to put the conf folder to your classpath in eclipse and set the environemnt variables

[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
: Stefan Groschupf Priority: Critical Fix For: 0.8-dev In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps

Re: [Nutch-dev] Crawl error

2006-07-09 Thread Stefan Groschupf
Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set it up in eclipse. When trying

Re: Nutch based directory and crawler based on keyword

2006-07-09 Thread Stefan Groschupf
Hi, this question is difficult to answer and may be there more experts in the nutch user list than in the developer list. In nutch 0.8 you can use the new scoring api to change the scoring of a page for being scheduled for crawling based on the it's scores. Have a look to the opic score

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf
Hi Jérôme, I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. We should fix that. Stefan On 06.07.2006, at 08:54, Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf
We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) and this worked without any problem. Thanks for catching that, this saved us a lot of time. Stefan On 07.07.2006, at 16:08, Jérôme Charron wrote: I have the same problem on

Re: 0.8 release

2006-07-05 Thread Stefan Groschupf
+1, but I really would love to see NUTCH-293 as part of nutch .8 since this all about being more polite. Thanks. Stefan On 05.07.2006, at 03:46, Doug Cutting wrote: +1 Piotr Kosiorowski wrote: +1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8 now,

noindedo not index/noindex

2006-06-22 Thread Stefan Groschupf
Hi, as far I can see nutch's html parser does only support the meta tag noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there is an inoffiziel html noindex tag. http://www.webmasterworld.com/forum10003/2703.htm May be this would be another thing to make nutch more polite. Also

Re: how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-12 Thread Stefan Groschupf
Hi Feng, map Writrable is a kind of hashmap. You can put in any key value pair, but the key and values need to be Writables: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ Writable.html You can use UTF8 as StingKey and Value or ByteWritable as key and Utf8 as Values. Etc.

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Hi Lourival, this means all pages older than 30 days are potential candidates for a fetch list that is created by segment generation process. Stefan Am 12.06.2006 um 16:33 schrieb Lourival Júnior: Hi all! I have a question about nutch-default.xml configuration file. There is a

Re: nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Ok. So, have you any solution to do this job automatically? I have a shell script, but I don't see if this really works yet. Shell scripts are the best solution. Sorry if I'm being redundant. I'm learn about this tool and I have a lot of questions :). No Problem, but the nutch user

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV5.patch Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte[4

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV4.patch Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte

[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

2006-06-07 Thread Stefan Groschupf (JIRA)
java doc of CrawlDb is wrong Key: NUTCH-302 URL: http://issues.apache.org/jira/browse/NUTCH-302 Project: Nutch Type: Bug Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.8-dev CrawlDb has the same java doc

[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-301?page=all ] Stefan Groschupf updated NUTCH-301: --- Attachment: CommonGramsCacheV1.patch Cache HashMap COMMON_TERMS in configuration instance. CommonGrams loads analysis.common.terms.file for each query

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] Stefan Groschupf commented on NUTCH-293: Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does

resolving IP in...

2006-06-07 Thread Stefan Groschupf
Hi, after playing around to figure out the best place to resolve IP's of freshly discovered ulrs I agree with Andrzej that the Parseoutputformat isn't the best place. The problem here, Parseoutputformat is not multithreaded and we definitely need many threads for ip lookup. I think in

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] Stefan Groschupf commented on NUTCH-293: Hi Andrzej, I agree but writing a queue based fetcher is a big step. I already have some basic code (nio based). Also I don't

Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Stefan Groschupf
As far I understand hadoop use commons logging. Should we switch to use commons logging as well? Am 06.06.2006 um 11:02 schrieb Jérôme Charron: URL: http://svn.apache.org/viewvc?rev=411943view=rev Log: Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons Logging, configured

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV1.patch To keep the discussion alive attached a _first draft_ for storing the ip in the crawlDatum for public discussion

Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Stefan Groschupf
hmm... didn't think about that, are there more opinions about this? I don't believe this don't be evil thing at all. I think it is just a question of time google feel we attack the appliance server market and I believe nutch has a serious chance to do so (some time in the far feature.

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf
I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get rid

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a 404 for a robots.txt is returned no page is fetched at all from the host

Re: search engine spam detector

2006-06-04 Thread Stefan Groschupf
The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) This will be a very long way. I collect some thoughts and a list of web spam related papers in my blog. http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072-

[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
: Stefan Groschupf Fix For: 0.8-dev What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Attachment: fixNpeRobotRuleSet.patch fix the npe in RobotRuleSet happen in case we use a empthy RuleSet if a 404 for a robots.txt is returned no page

RobotRuleSet

2006-06-03 Thread Stefan Groschupf
Hi, just posted a fix for a NPE in case a empty RobotRuleSet is used. The patch only contains a two lines fix, since I learned that this best way to get things committed sooner. :) However I really don't like the RobotRuleSet implementation since entries are copied between a arraylist and a

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus

[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ] Stefan Groschupf closed NUTCH-286: -- Resolution: Won't Fix I hope everybody agree with the statement: We can not detect http response codes based on responded html content. Prune

[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf

[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=all ] Stefan Groschupf updated NUTCH-293: --- Attachment: crawlDelayv1.patch A frist darft of a crawl delay support for nutch. The problem I see is that in case ip based delay is configured it can

Re: JVM error while parsing

2006-05-30 Thread Stefan Groschupf
Hi, I heard there is a bug in JVM 1.5_06 beta, can you try a older or may be a 1.4 jvm and report if this happens with a other jvm as well. Thanks, Stefan Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren: Hi everyone, I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to

Re: Extract infos from documents and query external sites

2006-05-30 Thread Stefan Groschupf
Think about using the google API. However the way to go could be: + fetch your pages + do not parse the pages + write a map reduce job that extract your data ++ make a xhtml dom from the html e.g. using neko ++ use xpath queries to extract your data ++ also check out gate as a named entity

  1   2   3   4   >