Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never

Re: FW: Nutch release process help

2007-03-06 Thread Piotr Kosiorowski
Chris, I have documented the process in the wiki. Doug have sent the links already. If you have any questions I would be willing to help. I can even do it myself if find it difficult - I simply do not want to be the bottleneck as I am behind my schedule at work and in private life. I still hope

Re: Reviving Nutch 0.7

2007-01-22 Thread Piotr Kosiorowski
Otis, Some time ago people on the list said that they are willing to at least maintain Nutch 0.7 branch. As a committer (not very active recently) I volunteered to commit patches when they appear - I do not have enough time at the moment to do active coding. I have created a 7.3 release in JIRA

[jira] Closed: (NUTCH-429) Secured Searches

2007-01-11 Thread Piotr Kosiorowski (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Kosiorowski closed NUTCH-429. --- Resolution: Invalid Please use nutch-user mailing list for such questions and JIRA

Re: 0.7.3 version

2006-11-23 Thread Piotr Kosiorowski
As no objections were raised I created a 0.7.3 version in JIRA so we can start assigning current JIRA issues to it. Regards Piotr Piotr Kosiorowski wrote: Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release

0.7.3 version

2006-11-16 Thread Piotr Kosiorowski
Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is

Re: How to start working with MapReduce?

2006-11-11 Thread Piotr Kosiorowski
Please read the tutorial on nutch site. O suggest posting such issues to nutch-user - you will have much higher chance of getting useful response there. regards Piotr On 11/9/06, kauu [EMAIL PROTECTED] wrote: or it's the same with the version 0.8.x any idea is preciated On 11/9/06, kauu [EMAIL

Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Piotr Kosiorowski
+1 On 10/16/06, Doug Cutting [EMAIL PROTECTED] wrote: Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Piotr Kosiorowski
I had a look at it and it seems I do not have enough permissions to change it. So probably this one goes to Doug... P. Chris Mattmann wrote: Hey Guys, Speaking of which, I noticed that Sami's issue below is a Task in JIRA, which reminded me of a task that I input a long time ago that would be

[jira] Assigned: (NUTCH-374) when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.

2006-09-30 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-374?page=all ] Piotr Kosiorowski reassigned NUTCH-374: --- Assignee: Piotr Kosiorowski when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any

Re: 0.8 release

2006-07-27 Thread Piotr Kosiorowski
No objections form me. We waited long and we can fix things in maitenance release in few weeks. Regards Piotr On 7/26/06, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Sami Siren wrote: There is a package available for testing in http://people.apache.org/~siren/nutch-0.8/

Re: log when blocked by robots.txt

2006-07-21 Thread Piotr Kosiorowski
I think I would log in both situations but different message. +1 P. On 7/21/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Developers, another thing in the discussion to be more polite. I suggest that we log a message in case an requested URL was blocked by a robots.txt. Optimal would be if

Re: Nutch web site

2006-07-04 Thread Piotr Kosiorowski
, is there a reason why this (among other) documentation (for all relevant versions) could not be maintained in trunk? -- Sami Siren Piotr Kosiorowski wrote: Andrzej Bialecki wrote: +1, yes it would be really confusing. Since there are more and more people trying 0.8, could we perhaps

Re: 0.8 release

2006-07-04 Thread Piotr Kosiorowski
+1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8 now, there has been quite a lot of improvements/new features since 0.7 series and I strongly feel that we should push the first 0.8 series release (alfa/beta) out the door now. It would IMO lower the

Re: 0.8 release?

2006-04-13 Thread Piotr Kosiorowski
it so many times that I want to cross check). Regards Piotr Dawid Weiss wrote: What kind of problems? If you need something, let me know. D. Piotr Kosiorowski wrote: I got some problems while applying Dawid clustering patch (my linux environment looks not to be setu correctly) - but I switched

Re: 0.8 release?

2006-04-12 Thread Piotr Kosiorowski
I got some problems while applying Dawid clustering patch (my linux environment looks not to be setu correctly) - but I switched to cygwin and it looks ok. I will try to commit it today/tommorow. Regards Piotr On 4/12/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, Any progress on the

Re: mapred branch

2006-04-10 Thread Piotr Kosiorowski
Anton Potehin wrote: Where now placed mapred branch of nutch ? it is developed in trunk now. P.

Re: PMD integration

2006-04-09 Thread Piotr Kosiorowski
Jérôme Charron wrote: 2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one oro jar than. Piotr, please keep oro-2.0.8 in pmd-ext I

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski
I do agree with Jarome - plugins should be checked too. I would like to integrate PMD for core and plugins over the weekend based on the Dawid's work - I will make it totally separate target (so test do not depend on it). The goal is to allow other developers to play with pmd easily but at the

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski
I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I agree with it) to stop the build file if PMD complains about something. It's similar to testing -- if your tests fail, the entire build file fails. I totally agree with it - but I

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Piotr Kosiorowski
Doug Cutting wrote: Piotr, would you like to make this release, or should I? I would prefer you would do it this time - I am not sure if I can find some time next week. I would like to do some things before release though: 1) Commit clustering patch from Dawid (I took it over from Andrzej).

Re: Patch to remove Nutch formating from logs

2006-04-07 Thread Piotr Kosiorowski
Hello Christopher, I personally do not like combining logging with severe error handling but it is one of the features of Nutch for some time and I do not think it causes infinite loops in normal installations. Changing it as we are preparing to release a new version is not a good idea in my

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski
think we can upgrade (as far as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one oro jar than. So happy PMD-ing, Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-06 Thread Piotr Kosiorowski
+1 - I offer my help - we can coordinate it and I can do a part of work. I will also try to commit your patches quickly. Piotr On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote: Other options (raised on the Hadoop list) are Checkstyle: PMD seems to be the best choice for an Apache project and

PMD integration (was: Re: Add .settings to svn:ignore on root Nutch folder?)

2006-04-06 Thread Piotr Kosiorowski
=1465574group_id=56262 D. Piotr Kosiorowski wrote: +1 - I offer my help - we can coordinate it and I can do a part of work. I will also try to commit your patches quickly. Piotr On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote: Other options (raised on the Hadoop list) are Checkstyle: PMD

[jira] Closed: (NUTCH-239) I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl

2006-03-25 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-239?page=all ] Piotr Kosiorowski closed NUTCH-239: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied with JavaDoc changes. Thanks. I changed httpclient to use

[jira] Closed: (NUTCH-94) MapFile.Writer throwing 'File exists error'.

2006-03-25 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-94?page=all ] Piotr Kosiorowski closed NUTCH-94: -- Fix Version: 0.7.2-dev Resolution: Duplicate Assign To: Piotr Kosiorowski Duplicate ofNUTCH-117. MapFile.Writer throwing 'File exists

[jira] Closed: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-03-25 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-14?page=all ] Piotr Kosiorowski closed NUTCH-14: -- Resolution: Cannot Reproduce Closed according to Stefan suggestion NullPointerException NutchBean.getSummary

[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

2006-03-25 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-117?page=all ] Piotr Kosiorowski closed NUTCH-117: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied fixed by Mike. Also reported offlist by Michal Karwanski

Site switched to branch-0.7.

2006-03-09 Thread Piotr Kosiorowski
Hi, I have updated site in 0.7 branch with latest trunk changes. I have added both tutorials to the site so people will be aware of differences. I have also committed DOAP file in 0.7 branch. Nutch Website uses branch-0.7 now. Piotr

Nutch 0.7.2

2006-03-09 Thread Piotr Kosiorowski
Hello, I would like to release nutch 0.7.2 in a week or two. Some serious bugfixes are already covered and I have a plan to fix one or two more. I found an email from Doug with title [Fwd: Crawler submits forms?] stating: This has been fixed in the mapred branch, but that patch is not in

[jira] Closed: (NUTCH-225) Changed the links to the tutorial to point to the wiki

2006-03-09 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-225?page=all ] Piotr Kosiorowski closed NUTCH-225: --- Resolution: Won't Fix I have just updated Nutch Web site. It contains now both tutorials (for 0.7 and 0.8). I have also added a notr to each

Re: Tutorial

2006-03-09 Thread Piotr Kosiorowski
Upps, sorry for ignoring this discussion - i was looking for comments in JIRA and already committed the change before reading your discussion. My motivation is to have usable version of tutorial - as simple as it is possible to be versioned with the sources - only for historical purposes - if

[jira] Closed: (NUTCH-91) empty encoding causes exception

2006-03-09 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ] Piotr Kosiorowski closed NUTCH-91: -- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Commited with small extension. Thanks. empty encoding causes exception

[jira] Commented: (NUTCH-225) Changed the links to the tutorial to point to the wiki

2006-03-07 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-225?page=comments#action_12369405 ] Piotr Kosiorowski commented on NUTCH-225: - As stated in another thread I prefer to have a simple tutorial kept in version control with releases. We already have

Nutch web site

2006-03-06 Thread Piotr Kosiorowski
Hi, It looks like Nutch web site was updated with site built from latest trunk - the only problem is it contains tutorial for unreleased (yet) version 0.8. I think we talked about it and agreed to keep tutorial for latest release on the Web. I have just updated site in svn (branch-0.7) with

Re: Nutch web site

2006-03-06 Thread Piotr Kosiorowski
Andrzej Bialecki wrote: +1, yes it would be really confusing. Since there are more and more people trying 0.8, could we perhaps include a short note that 0.8 and later is NOT compatible with this tutorial, and a reference to the tutorial for 0.8 (or the trunk/ branch in general)? I can

[jira] Commented: (NUTCH-79) Fault tolerant searching.

2006-01-30 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364496 ] Piotr Kosiorowski commented on NUTCH-79: I think it should work without changes I suggested in previous comment - they would be simply useful additions. I was not using

[jira] Closed: (NUTCH-45) Log corrupt segments in SegmentMergeTool

2006-01-20 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-45?page=all ] Piotr Kosiorowski closed NUTCH-45: -- Fix Version: 0.7.2-dev Resolution: Fixed Applied. Thanks. Log corrupt segments in SegmentMergeTool

[jira] Closed: (NUTCH-174) Problem encountered with ant during compilation

2006-01-14 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-174?page=all ] Piotr Kosiorowski closed NUTCH-174: --- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Fixed some time ago during preparation of 0.7.2 release. Please use version

Re: test suite fails?

2006-01-09 Thread Piotr Kosiorowski
It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf

Re: no static NutchConf

2006-01-04 Thread Piotr Kosiorowski
+1 in general In fact I like the approach presented by Stefan to pass only required parameters to objects that have small number of configurable params instead of NutchConf - it makes it obvious which parameters are required for such basic objects to run and as they are usually building blocks

Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Piotr Kosiorowski
Andrzej, Do you think it would be a good idea to commit it in 0.7 branch for 0.7.2 release? I personally prefer to use released libraries instead of RC if possible. It does not require a lot of changes and you have already tested it with existing code... Piotr [EMAIL PROTECTED] wrote:

[jira] Closed: (NUTCH-142) NutchConf should use the thread context classloader

2006-01-04 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-142?page=all ] Piotr Kosiorowski closed NUTCH-142: --- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed NutchConf should use the thread context classloader

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] Piotr Kosiorowski commented on NUTCH-138: - I am not sure but I would suspect it is a problem of bad tomcat configuration. To handle special characters in query urls

[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=all ] Piotr Kosiorowski closed NUTCH-138: --- Resolution: Invalid Setting URIEncoding in tomcat config file fixes the problem. non-Latin-1 characters cannot be submitted for search

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] Piotr Kosiorowski commented on NUTCH-138: - BTW - just create user for yourself in nutch Wiki and you shoudl be able to add a new page with information without problems

Re: Mega-cleanup in trunk/

2006-01-01 Thread Piotr Kosiorowski
Andrzej Bialecki wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... Hi, I am not sure what is wrong but a lot of JUnit test simply does not compile -

Re: how to add additional factor at search time to ranking score

2006-01-01 Thread Piotr Kosiorowski
AJ Chen wrote: It would be great if I can add some new functions to the nutch code to accomplish this. But, if it requires to customize lucene code, that's fine. I have tried to use the most recent release (1.4.3) of lucene source code, but it did not work. Is the lucene jar files included

[jira] Commented: (NUTCH-142) NutchConf should use the thread context classloader

2006-01-01 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-142?page=comments#action_12361492 ] Piotr Kosiorowski commented on NUTCH-142: - Thanks. Fixed in 0.7 branch. Left open to fix it in trunk after cleaning trunk JUnit test problems (in next few days

[jira] Closed: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-12-31 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-42?page=all ] Piotr Kosiorowski closed NUTCH-42: -- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed OpenSearch implemented. enhance search.jsp such that it can also returns XML

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-23 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361206 ] Piotr Kosiorowski commented on NUTCH-148: - 'df' command is required for NDFS operation so if you were not using NDFS in 0.7.1 and nutch shell scripts you were able

[jira] Closed: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-23 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-148?page=all ] Piotr Kosiorowski closed NUTCH-148: --- Resolution: Invalid org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

[jira] Closed: (NUTCH-147) nutch map reduce does not work in windows map reduce runs in a loop

2005-12-23 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-147?page=all ] Piotr Kosiorowski closed NUTCH-147: --- Resolution: Invalid cygwin requirement on Windows is listed in nutch tutorial. Please reopen if problems persists after using it from cygwin

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-22 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361128 ] Piotr Kosiorowski commented on NUTCH-148: - Do you have Cygwin installed? Is 'df' working in your cygwin installation? Do you run crawl from cygwin shell? Nutch

Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-19 Thread Piotr Kosiorowski
+1 - especially for amount of support Stefan gives to nutch users. P. Andrzej Bialecki wrote: Hi, During the past year and more Stefan participated actively in the development, and contributed many high-quality patches. He's been spending considerable effort on addressing many issues in JIRA,

Re: svn commit: r357334 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/protocol/Content.java src/java/org/apache/nutch/protocol/ContentProperties.java

2005-12-17 Thread Piotr Kosiorowski
Doug Cutting wrote: [EMAIL PROTECTED] wrote: +/* + * (non-Javadoc) + * + * @see org.apache.nutch.io.Writable#write(java.io.DataOutput) + */ +public final void write(DataOutput out) throws IOException { We should either include javadoc or not. In general, all

JUnit test failures

2005-12-15 Thread Piotr Kosiorowski
Hi, I have problems with JUnit tests in trunk and mapred branches. TestFetcher fails in both branches. The same test executes correctly in 0.7 branch. Is it only my problem (environment setup) or others are having it too? I would suspect some changes in redirect handling Regards Piotr

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Piotr Kosiorowski
Doug Cutting wrote: Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Thinking about this more, perhaps we should do it sooner. There's already a branch for 0.7.x releases,

Re: Lucene performance bottlenecks

2005-12-08 Thread Piotr Kosiorowski
Hi, I started to think about implementing special kind of Lucene Query (if I remember correctly I would have to write my own Scorer and probably a few other classes) optimized for Nutch some time ago. I assumed having specialized query I would be able to avoid accessing some of lucene index

Re: Urlfilter Patch

2005-12-01 Thread Piotr Kosiorowski
Jérôme Charron wrote: [...] build a list of file extensions to include (other ones will be excluded) in the fecth process. [...] I would not like to exclude all others - as for example many extensions are valid for html - especially dynamicly generated pages (jsp,asp,cgi just to name the easy

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski
On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski
, Andrzej Bialecki [EMAIL PROTECTED] wrote: Piotr Kosiorowski wrote: On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code

[jira] Closed: (NUTCH-99) ports are hardcoded or random

2005-11-14 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ] Piotr Kosiorowski closed NUTCH-99: -- Resolution: Fixed Patch committed. Thanks Stefan. ports are hardcoded or random - Key: NUTCH-99 URL

Re: suspicious outlink count

2005-11-13 Thread Piotr Kosiorowski
EM wrote: 202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does

Re: to many hdd reads

2005-10-11 Thread Piotr Kosiorowski
Committed in trunk and branch-0.7 (just in case if we decide to make a 0.7.2release sometime). Thanks Piotr On 10/11/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, don't think I'm fuddy-duddy but is it really sensefull to do following in the nutchbean? File [] directories =

Nutch 0.7.1 and Nutch web site

2005-10-01 Thread Piotr Kosiorowski
Hello, I have prepared Nutch 0.7.1 release today but I had one problem. I was updating the site in branch but to deploy it one must use the version from trunk. Currently I simply committed generated site in trunk but this solution is far from perfect. Should we have version independent site -

Re: Nutch Suggestion? (Google like did you mean)

2005-09-29 Thread Piotr Kosiorowski
Have a look at http://issues.apache.org/jira/browse/NUTCH-48. I think ngram based appeoach is appropriate here. I was using it in our search engine. Regards Piotr On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote: Hi I am very like Google's Did you mean and I notice that nutch now does not

[jira] Closed: (NUTCH-89) parse-rss null pointer exception

2005-09-23 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-89?page=all ] Piotr Kosiorowski closed NUTCH-89: -- Fix Version: 0.8-dev 0.7 Resolution: Fixed Applied in trunk and 0.7 branch. Thanks. parse-rss null pointer exception

[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2005-09-21 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12330113 ] Piotr Kosiorowski commented on NUTCH-95: I was renaming segments quite often so I would vote for reading the date from the segment instead of using dir name

0.7.1 release

2005-09-20 Thread Piotr Kosiorowski
Hello, As it looks everything that was planned was commited to 0.7 branch I would like to prepare a 0.7.1 release in next few days. I will change branch name at the same time to comply with agreed standard. Any objections? Regards Piotr

Re: svn commit: r290163 - in /lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2: ./ lib/

2005-09-19 Thread Piotr Kosiorowski
Hi Andrzej, Is anything related to clustering commits left? Or should we proceed with 0.7.1 release? Piotr [EMAIL PROTECTED] wrote: Author: ab Date: Mon Sep 19 07:11:07 2005 New Revision: 290163 URL: http://svn.apache.org/viewcvs?rev=290163view=rev Log: Update of the clustering plugin,

Re: DistributedSearch$Client.updateSegments() blocking other threads

2005-09-16 Thread Piotr Kosiorowski
Hello Andrzej, You can also try http://issues.apache.org/jira/browse/NUTCH-79 - I think it should also help here - it is a bit complicated as it contain additional functionality but if you have any problems I am willing to help. I am going to perform some test of it again and maybe commit it

Re: Problems on Crawling

2005-09-16 Thread Piotr Kosiorowski
bin/nutch updatedb db $s1 command updates WebDB with links you fetched in segment $s1. Regards Piotr Daniele Menozzi wrote: Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do not have really understood what is the ralationship between depth,segments,fetching.. Take for

Re: Delete an entry in ArrayFile/MapFile

2005-09-06 Thread Piotr Kosiorowski
Hello, You cannot do it. These structures where not designed for it. But you can copy all the data to other ArrayFile skipping entries you want to delete. Regards Piotr On 9/6/05, Ben [EMAIL PROTECTED] wrote: Hi How can I delete an entry in the ArrayFile/MapFile if I know the id/key?

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Piotr Kosiorowski
Doug Cutting wrote: Glancing at other Apache projects in subversion, I see that httpd uses branch names like 2.2.x and tag names like 2.2.4. That's a little cryptic. I propose that we use branch names like branch-2.4 and tag names like release-2.4.1. What do folks think? +1 In fact I

Re: merge mapred to trunk

2005-08-31 Thread Piotr Kosiorowski
Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug +1 P.

Re: null lang bug? and patch?

2005-08-31 Thread Piotr Kosiorowski
Great - I just thought that it would be better if you look at it - instead of me digging into the code. I wanted to be on the safe side with 0.7.1 release. Regards Piotr Jérôme Charron wrote: I am a bit lost but just a quick check - shouldn't it also be committed in Release-0.7 branch? No,

Re: Analysis plugins and lucene-analyzers

2005-08-27 Thread Piotr Kosiorowski
Hello, I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in nutch core but I would like to give another option. I think it is possible to create a plugin which contains and exports this library and make other analysis plugin depend on it. I am not an expert in it but I think

Re: crawl-urlfilter.txt mechanics

2005-08-22 Thread Piotr Kosiorowski
crawl-urlfilter.txt is bin/nutch crawl specific. If you want to use each step separatelly - you ar ein fact doing Whole Web crawling from tutorial - so you need to modify regex-urlfilter.txt instead. Regards Piotr On 8/22/05, Michael Ji [EMAIL PROTECTED] wrote: Hi, When I use intranet

Re: Failing JUnit test

2005-08-21 Thread Piotr Kosiorowski
Hello Jérôme, I found it and commited the fix. It was not using UTF-8 encoding sometimes. But while looking at the code I feel a little bit worried about LanguageIdentifier.identify(InputStream is) - as it reads bytes from file in chunks and coverts each chunk to stink separatelly. If multibyte

Re: Failing JUnit test

2005-08-20 Thread Piotr Kosiorowski
It works on my Linux box - with both JDK 1.4 and 1.5. I will try to track it down. Regards Piotr Jérôme Charron wrote: I am using JDK 1.5 on Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the problem. OK. Thanks Jérôme

Failing JUnit test

2005-08-19 Thread Piotr Kosiorowski
Hello, I have updated my local copy today and JUnit tests started to fail. expected:el but was:sv junit.framework.ComparisonFailure: expected:el but was:sv at org.apache.nutch.analysis.lang.TestLanguageIdentifier.testIdentify(Unknown Source) at

Release 0.7

2005-08-16 Thread Piotr Kosiorowski
Hello Nutch Committers, Is anyone working on preparing the release? If not I can spent some time on it in an hour or so. Regards Piotr

Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski
Hello, I have a problem related to 0.7 release. After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). I thought it was my mistake - I started to do it on Windows so I moved to linux, but the problem

Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski
So I will move the release till tommorow as I am a bit sleepy now. Regards Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). It is strange

Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski
/* / include name=${final.name}/** / /tarfileset tarfileset dir=${build.dir} mode=755 include name=${final.name}/bin/* / /tarfileset /tar /target I will commit it tommorow and test. Regards Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: After making a tar I

Re: VOTE: clustering plugin update for Rel 0.7

2005-08-15 Thread Piotr Kosiorowski
Hi, Maybe it would be a better idea to go for 0.7 branch and schedule a new 0.7.1 release in short time? It is difficult for me to judge if the patch I had not seen is good for release. So I would say 0 from me (if you think it is good enough I will not object). Regards, Piotr Andrzej

Re: FW: Fetcher, ParseText, ParseData - need to modify

2005-08-15 Thread Piotr Kosiorowski
Hello, To change nutch standard html parsing the best place to start would be probably parse-html plugin. Regards Piotr Fuad Efendi wrote: 1. This is part of ParseText: Any Accessories Backup Devices Media Barebone Systems Camcorder Accessories Camcorders Cases External Enclosures CD / DVD

Re: page ranking weights

2005-08-15 Thread Piotr Kosiorowski
Boost for the page maybe calculated in few different ways (and in few different places in nutch): 1) PageRank based score - calculated by nutch analyze command based on WebDB - during fetchlist generation scores from WebDB are stored in segment - indexing phase uses score

Nutch versions - Was: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-10 Thread Piotr Kosiorowski
Hello, I think a lot of people will wait before moving to mapreduce implementation for some time so we will have a 0.7 version to support. I was a heavy CVS branch user in my previous job taking care about common library so I fully agree that such branch would be needed for bug fixing. I would

Re: clucene-java bindings

2005-08-09 Thread Piotr Kosiorowski
Hello Ben, I personally would be interested mainly in search part of it if speed increase would be significant. I am running my indices on linux/ AMD Opterons - I hope CLucene will work well in this environment. I assume CLucene is compatible with Java lucene index format as we do have some

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski
Hello Doug, I read your email ten times and still I am not sure what the problem is. Regards, Piotr Doug Cutting wrote: [EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I think this should now be:

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski
No problem at all. I have a lot to learn yet and it is nice people like you check my commits for stupid mistakes. Four eyes are always better than two :). Regards, Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: I read your email ten times and still I am not sure what the problem

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski
Will do it tommorow - I wanted to put down a kind of release checklist in Wiki - starting with where to change numbers. But would like to cover also release howto - but in fact I am not sure how to do make a relase yet. But will try to gather this information. Regards Piotr Andrzej Bialecki

Tutorial

2005-08-08 Thread Piotr Kosiorowski
Hello, Some time ago someone mentioned on the list a problem with nutch tutorial (I cannot find this email now). I have checked it today and he/she was right. If you follow the nutch Intranet Crawling tutorial you will end up with not very interesting index. This is because it recommends users to

NUTCH 79 Fault tolerant searching.

2005-08-08 Thread Piotr Kosiorowski
Hello, I just created an issue in JIRA http://issues.apache.org/jira/browse/NUTCH-79 containing the code for fault tolerant searching. I think it is too late to include it in 0.7 release but I would wait for comments and test it in the meantime. I would like to commit it when release would be

Re: JIRA access

2005-08-08 Thread Piotr Kosiorowski
Thanks. It works. Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should

  1   2   >