from:"Piotr Kosiorowski"

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski

pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am using
it in my daily job and I found it valuable.
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move forward
slowly - my youngest kid was born just 2 months ago and it takes a lot
of attention).
Piotr

On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote:
 Update external jars to latest versions
 ---

 Key: NUTCH-680
 URL: https://issues.apache.org/jira/browse/NUTCH-680
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0


 This issue will be used to update external libraries nutch uses.

 These are the libraries that are outdated (upon a quick glance):

 nekohtml (1.9.9)
 lucene-highlighter (2.4.0)
 jdom (1.1)
 carrot2 - as mentioned in another issue
 jets3t - above
 icu4j (4.0.1)
 jakarta-oro (2.0.8)

 We should probably update tika to whatever the latest is as well before 1.0.


 Please add ones  I missed in comments.

 Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen 
 there.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski

From what I know (the way we use hudson) is that hudson has plugins
for presenting tool results only and the tools need to be executed
during build - and libraries need to be included so they are available
to ant.
Piotr

On Tue, Jan 20, 2009 at 9:40 PM, Doğacan Güney doga...@gmail.com wrote:
 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 ogjunk-nu...@yahoo.com wrote:
 That I don't know...

 I don't see the jars here: 
 http://svn.apache.org/viewvc/hadoop/core/trunk/lib/

 But who knows, maybe maven/ivy fetch them on demand.  I don't know.


 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 1:13:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
 wrote:
  Lucene doesn't use anything.
  Hadoop uses pmd integrate in Hudson.
 

 Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 10:49:44 AM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  2009/1/20 Piotr Kosiorowski :
   pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
   committed them long time ago in an attempt to bring some static
   analysis toools to nutch sources. There was a short discussion around
   it and we all thought t was worth doing but it never gained enough
   momentum.   There is a pmd target in build.xml file that uses it -
   they are not needed in runtime nor for standard builds.
   As nutch is built using hudson now I think it would be worth to
   integrate pmd (and checkstyle/findbugs/cobertura might be also
   interesting) - hudson has very nice plugins for such tools. I am using
   it in my daily job and I found it valuable.
 
  Thanks for the explanation. I am definitely +1 on having some sort of
  static analysis tools for nutch.
 
  Does anyone know what hadoop/hbase/lucene use for this? or do
  they use something at all?
 
   But as I am not active committer now (I only try to follow mailing
   lists) I do not think it is my call.  But if everyone will be
   interested I can try to look at integration (but it will move forward
   slowly - my youngest kid was born just 2 months ago and it takes a lot
   of attention).
 
  Congratulations!
 
   Piotr
  
   On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
   Update external jars to latest versions
   ---
  
   Key: NUTCH-680
   URL: https://issues.apache.org/jira/browse/NUTCH-680
   Project: Nutch
Issue Type: Improvement
  Reporter: Doğacan Güney
  Assignee: Doğacan Güney
  Priority: Minor
   Fix For: 1.0.0
  
  
   This issue will be used to update external libraries nutch uses.
  
   These are the libraries that are outdated (upon a quick glance):
  
   nekohtml (1.9.9)
   lucene-highlighter (2.4.0)
   jdom (1.1)
   carrot2 - as mentioned in another issue
   jets3t - above
   icu4j (4.0.1)
   jakarta-oro (2.0.8)
  
   We should probably update tika to whatever the latest is as well 
   before
 1.0.
  
  
   Please add ones  I missed in comments.
  
   Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
  there.
  
   --
   This message is automatically generated by JIRA.
   -
   You can reply to this email to add a comment to the issue online.
  
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney





 --
 Doğacan Güney

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski

I have configured hudson for 10 or more projects and always used pmd
plugin to display the pmd results only - the actual pmd task to
generate report was run from ant script. Maybe there is such
possibility tu run pmd reports directly in hudson (not through project
build scripts) but I have never come accross it.
Piotr

On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 They've had pmd integrated with Hudson for many months now, I believe.  I've 
 seen patches in JIRA that were the result of fixes for problems reported by 
 pmd.  Or maybe they run pmd by hand?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 3:40:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 wrote:
  That I don't know...
 
  I don't see the jars here: 
  http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
 
  But who knows, maybe maven/ivy fetch them on demand.  I don't know.
 

 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 1:13:20 PM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
  wrote:
   Lucene doesn't use anything.
   Hadoop uses pmd integrate in Hudson.
  
 
  Does this mean we do not need pmd jars in nutch ( are they provided by
 hudson)?
 
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Doğacan Güney
   To: nutch-dev@lucene.apache.org
   Sent: Tuesday, January 20, 2009 10:49:44 AM
   Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
  versions
  
   2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am 
using
it in my daily job and I found it valuable.
  
   Thanks for the explanation. I am definitely +1 on having some sort of
   static analysis tools for nutch.
  
   Does anyone know what hadoop/hbase/lucene use for this? or do
   they use something at all?
  
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move forward
slowly - my youngest kid was born just 2 months ago and it takes a 
lot
of attention).
  
   Congratulations!
  
Piotr
   
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
Update external jars to latest versions
---
   
Key: NUTCH-680
URL: https://issues.apache.org/jira/browse/NUTCH-680
Project: Nutch
 Issue Type: Improvement
   Reporter: Doğacan Güney
   Assignee: Doğacan Güney
   Priority: Minor
Fix For: 1.0.0
   
   
This issue will be used to update external libraries nutch uses.
   
These are the libraries that are outdated (upon a quick glance):
   
nekohtml (1.9.9)
lucene-highlighter (2.4.0)
jdom (1.1)
carrot2 - as mentioned in another issue
jets3t - above
icu4j (4.0.1)
jakarta-oro (2.0.8)
   
We should probably update tika to whatever the latest is as well 
before
  1.0.
   
   
Please add ones  I missed in comments.
   
Also what exactly is pmd-ext? There is an extra jakarta-oro and 
jaxen
   there.
   
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
   
   
   
  
  
  
   --
   Doğacan Güney
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney

Re: FW: Nutch release process help

2007-03-06 Thread Piotr Kosiorowski


Chris,
I have documented the process in the wiki. Doug have sent the links
already. If you have  any questions I would be willing to help. I can
even do it myself if find it difficult - I simply do not want to be
the bottleneck as I am behind my schedule at work and in private life.
I still hope I would be able to get to be more active in ntuch
community in future.
Regards
Pitor

On 3/6/07, Doug Cutting [EMAIL PROTECTED] wrote:

Chris Mattmann wrote:
 It's too bad that
 this has turned out to be an issue that I've handled incorrectly, and for
 that, I apologize.

Sorry if I blew this out of proportion.  We all help each other run this
project.  I don't think any grave error was made.  I just saw an
opportunity to remind folks to try to keep project discussions public,
and did not mean to rebuke you.

I am thrilled that you want to take on the responsibility of making a
release.  I very much do not want to damp your enthusiasm for that.

As you probably know, the release documentation is at:

http://wiki.apache.org/nutch/Release_HOWTO

This may need to be updated.  You might also look at the release
documentation for other projects, to get ideas.

http://wiki.apache.org/lucene-hadoop/HowToRelease
http://wiki.apache.org/solr/HowToRelease
http://wiki.apache.org/jakarta-lucene/ReleaseTodo

Cheers,

Doug

Re: Reviving Nutch 0.7

2007-01-22 Thread Piotr Kosiorowski


Otis,
Some time ago people on the list said that they are willing to at
least maintain Nutch 0.7 branch. As a committer (not very active
recently) I volunteered to commit patches when they appear - I do not
have enough time at the moment to do active coding. I have created a
7.3 release in JIRA so we can start looking at it. So - we are ready
and willing to move Nutch 0.7 forward but it looks like there is no
interest at the moment.
Regards
Piotr

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?

Otis

[jira] Closed: (NUTCH-429) Secured Searches

2007-01-11 Thread Piotr Kosiorowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Kosiorowski closed NUTCH-429.
---

Resolution: Invalid

Please use nutch-user mailing list for such questions and JIRA for reporting 
issues only. I suggest also being more specifing on the mailing list what you 
mean by Secured Seraches.

 Secured Searches
 

 Key: NUTCH-429
 URL: https://issues.apache.org/jira/browse/NUTCH-429
 Project: Nutch
  Issue Type: Bug
Reporter: Piyush

 Does NUTCH Support secured Searches? If yes, Could you please forward me to a 
 appropriate documentation 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: 0.7.3 version

2006-11-23 Thread Piotr Kosiorowski

As no objections were raised I created a 0.7.3 version in JIRA so we can 
start assigning current JIRA issues to it.

Regards
Piotr
Piotr Kosiorowski wrote:

Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release. The idea is to allow
people who still use 0.7.2 to get rid of most important bugs and allow
them to add some small features they would need as the claim is 0.8.1
is not good for small crawls at the moment. It will allow us to work
on 0.8 branch so it would be more small installation friendly.
I would like to approach it this way that if noone objects I would
create a 0.7.3 release in JIRA and ask people to assign issues with
patches to it. I do not have a lot of time personally so I do not plan
to do any development myself - just taking care of high quality
patches and committing them - after some time when we gather some
aomount of bugfixes/isues I would prepare 0.7.3 release. Any
objections comments?
Regards
Piotr

0.7.3 version

2006-11-16 Thread Piotr Kosiorowski


Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release. The idea is to allow
people who still use 0.7.2 to get rid of most important bugs and allow
them to add some small features they would need as the claim is 0.8.1
is not good for small crawls at the moment. It will allow us to work
on 0.8 branch so it would be more small installation friendly.
I would like to approach it this way that if noone objects I would
create a 0.7.3 release in JIRA and ask people to assign issues with
patches to it. I do not have a lot of time personally so I do not plan
to do any development myself - just taking care of high quality
patches and committing them - after some time when we gather some
aomount of bugfixes/isues I would prepare 0.7.3 release. Any
objections comments?
Regards
Piotr

Re: How to start working with MapReduce?

2006-11-11 Thread Piotr Kosiorowski


Please read the tutorial on nutch site. O suggest posting such issues
to nutch-user - you will have much higher chance of getting useful
response there.
regards
Piotr

On 11/9/06, kauu [EMAIL PROTECTED] wrote:

or it's the same with the version 0.8.x
any idea is preciated

On 11/9/06, kauu [EMAIL PROTECTED] wrote:


 anyone kown the detail of the process with the topic how to start working
 with MapReduce?

 i'v read something in the FAQ ,but i don't understand it very well , my
 version is 0.7.2, not 0.8x
 --
 www.babatu.com




--
www.babatu.com

Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Piotr Kosiorowski


+1

On 10/16/06, Doug Cutting [EMAIL PROTECTED] wrote:

Sami Siren wrote:
 looks like somebody just enabled email-to-jira-comments-feature. I was
 just wondering would it be good to use this feature more widely.

I think it would be good.  That way mailing list discussion would be
logged to the bug as well.

 This could be achieved by removing the replyto header from messages
 coming from jira so that replies get sent to [EMAIL PROTECTED] (i am
 assuming that is possible). So whenever somebody just hits reply
 from email client and writes the comment it would get automatically
 attached to correct issue as a comment.

I sent a message to [EMAIL PROTECTED] this morning asking about this.
If it's possible, and no one objects, I will request it for the Nutch
mailing lists.

Doug

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Piotr Kosiorowski

I had a look at it and it seems I do not have enough permissions to 
change it. So probably this one goes to Doug...

P.
Chris Mattmann wrote:

Hey Guys,

 Speaking of which, I noticed that Sami's issue below is a Task in JIRA,
which reminded me of a task that I input a long time ago that would be nice
to fix real quick (for those with JIRA permissions to do so):

http://issues.apache.org/jira/browse/NUTCH-304

We should really change the email address for JIRA to not use the Apache
incubator one anymore, and to use to Lucene one.

Sound good? If so, could someone with permissions please take care of it?
:-)

Cheers,
  Chris



On 10/3/06 9:04 AM, Sami Siren [EMAIL PROTECTED] wrote:

  

Andrzej Bialecki wrote:


Chris Mattmann wrote:
  

Hi Folks,

 I noticed that Nutch now requires JDK 5 in order to compile, due to
recent
changes to the PluginRepository and some other classes. I think that
this is
a good move, however, I wasn't sure that I had seen any official
announcement that Nutch now requires 1.5...
  


This is a proactive change - as soon as we upgrade to Hadoop 0.6.x we
will lose 1.4 compatibility anyway, so we may as well prepare in advance.

Also, Now refers to the unreleased 0.9, we will keep branch 0.8.x
compatible with 1.4.

  

The switch to 1.5 format was also logged on jira issue
http://issues.apache.org/jira/browse/NUTCH-360
--
 Sami Siren



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

[jira] Assigned: (NUTCH-374) when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.

2006-09-30 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-374?page=all ]

Piotr Kosiorowski reassigned NUTCH-374:
---

Assignee: Piotr Kosiorowski

 when http.content.limit be set to -1 and  Response.CONTENT_ENCODING  is gzip 
 or x-gzip  , it can not fetch any thing.
 -

 Key: NUTCH-374
 URL: http://issues.apache.org/jira/browse/NUTCH-374
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8, 0.8.1
Reporter: King Kong
 Assigned To: Piotr Kosiorowski

 I set http.content.limit  to -1 to not truncate content being fetched.
 However , if  response used gzip or x-gzip , then it was not able to 
 uncompress.
 I found the problem is in HttpBase.processGzipEncoded  (plugin lib-http) 
   ...
byte[] content = GZIPUtils.unzipBestEffort(compressed, getMaxContent());
...
 because it is not  deal with -1 to no limit , so must modify code to solve it;
 byte[] content;
 if (getMaxContent()=0){
 content = GZIPUtils.unzipBestEffort(compressed, getMaxContent());
 }else{
   content = GZIPUtils.unzipBestEffort(compressed);
 }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: 0.8 release

2006-07-27 Thread Piotr Kosiorowski


No objections form me. We waited long and we can fix things in
maitenance release in few weeks.
Regards
Piotr

On 7/26/06, Sami Siren [EMAIL PROTECTED] wrote:

Andrzej Bialecki wrote:

 Sami Siren wrote:

 There is a package available for testing in
 http://people.apache.org/~siren/nutch-0.8/

 please give it some testing and post in your opinion - is it good
 enough to be a public release?

 I have some doubts because of NUTCH-266, but so far only 3 people
 have reported this to be problem
 (me included)


 This is I guess related to a very specific environment - multiple
 nodes running on cygwin. Usually people run multiple nodes on some
 flavor of Unix.

 I don't have any means to test it for this issue ...

Bug also appears in singlenode configuration, but I think that it is not
that common (quessing from the number of people who have reported it).
However that is now fixed in hadoop trunk. Should we use a patched
version of hadoop-0.4.0 in Nutch or wait for 0.5 (which at least still
seems to be 1.4 compatible)?

The 0.8 package has now hit the mirrors, has anybody any objections
about announcing it? Stefan allready commented about two issues he
wished to be fixed in 0.8 but to me it looks that they can both be
addressed with configuration changes and documentation in the first
place and there's nothing stopping us from releasing 0.8.1 in very short
time addressing the issues discovered in 0.8.

--
 Sami Siren

Re: log when blocked by robots.txt

2006-07-21 Thread Piotr Kosiorowski


I think I would log in both situations but different message.
+1
P.

On 7/21/06, Stefan Groschupf [EMAIL PROTECTED] wrote:

Hi Developers,
another thing in the discussion to be more polite.
I suggest that we log a message in case an requested URL was blocked
by a robots.txt.
Optimal would be if we only log this message in case the current used
agent name is only blocked and it is not a general blocking of all
agents.

Should I create a patch?

Stefan

Re: Nutch web site

2006-07-04 Thread Piotr Kosiorowski

It was maintained in branch as we agreed public website should contain 
docs for released version. I have nothing against moving it to trunk
and maintaining it there. Sorry for late response but I am back from 
vacation and going through all these emails.

Regards,
Piotr

Sami Siren wrote:

Piotr,

is there a reason why this (among other) documentation (for all relevant 
versions)

could not be maintained in trunk?

--
Sami Siren

Piotr Kosiorowski wrote:



Andrzej Bialecki wrote:

+1, yes it would be really confusing. Since there are more and more 
people trying 0.8, could we perhaps include a short  note that 0.8 
and later is NOT compatible with this tutorial, and a reference to 
the tutorial for 0.8 (or the trunk/ branch in general)?




I can add both tutorials to Nutch web site named Tutorial for 0.7 
version and Tutorial for 0.8 version. It should make things clear.

Anyone against it?
Piotr

Re: 0.8 release

2006-07-04 Thread Piotr Kosiorowski


+1.
P.
Andrzej Bialecki wrote:

Sami Siren wrote:
How would folks feel about releasing 0.8 now, there has been quite a 
lot of improvements/new features
since 0.7 series and I strongly feel that we should push the first 0.8 
series release (alfa/beta)
out the door now. It would IMO lower the barrier to first timers try 
the 0.8 series and that would

give us more feedback about the overall quality.


Definitely +1. Let's do some testing, however, after the upgrade to 
hadoop 0.3.2 - hadoop had many, many changes, so we just need to make 
sure it's stable when used with Nutch ...


We should also check JIRA and apply any trivial fixes before the release.



If there is a consensus about this I can volunteer to be the RM.


That would be great, thanks!

Re: 0.8 release?

2006-04-13 Thread Piotr Kosiorowski

I had problems with DOS/Unix new lines and some (still unsolved) 
environment settings on my linux box - I will try to solve it. Anyway I 
was able to apply the patch on Cygwin. Could you please have a look at 
it so we will be sure I have not applied it wrongly (I think it is 
correct but I did it so many times that I want to cross check).

Regards
Piotr

Dawid Weiss wrote:


What kind of problems? If you need something, let me know.
D.

Piotr Kosiorowski wrote:

I got some problems while applying Dawid clustering patch (my linux
environment looks not to be setu correctly) - but I switched to cygwin
and it looks ok. I will try to commit it today/tommorow.
Regards
Piotr

On 4/12/06, Chris Mattmann [EMAIL PROTECTED] wrote:

Hi Guys,

 Any progress on the 0.8 release? Was there any resolution about 
which JIRA

issues to complete before the 0.8 release? We had a bit of conversation
there and some ideas, but no definitive answer...

Thanks for your help, and sorry to pester ;)

Cheers,
 Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release?

2006-04-12 Thread Piotr Kosiorowski

I got some problems while applying Dawid clustering patch (my linux
environment looks not to be setu correctly) - but I switched to cygwin
and it looks ok. I will try to commit it today/tommorow.
Regards
Piotr

On 4/12/06, Chris Mattmann [EMAIL PROTECTED] wrote:
 Hi Guys,

  Any progress on the 0.8 release? Was there any resolution about which JIRA
 issues to complete before the 0.8 release? We had a bit of conversation
 there and some ideas, but no definitive answer...

 Thanks for your help, and sorry to pester ;)

 Cheers,
  Chris

 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group

 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___

 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.

Re: mapred branch

2006-04-10 Thread Piotr Kosiorowski


Anton Potehin wrote:

Where now placed mapred branch of nutch ?



it is developed in trunk now.
P.

Re: PMD integration

2006-04-09 Thread Piotr Kosiorowski


Jérôme Charron wrote:

2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar

things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far
as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one
oro jar than.


Piotr, please keep oro-2.0.8 in pmd-ext
I think we can plan to replace oro regex by java ones (as in RegexUrlFilter)
in the whole nutch code (and then remove oro-2.0.7 from lib):
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
src/java/org/apache/nutch/parse/OutlinkExtractor.java
src/java/org/apache/nutch/net/RegexUrlNormalizer.java
src/java/org/apache/nutch/net/BasicUrlNormalizer.java



I do not agree here - we are going to make a new release next week and 
releasing with two versions of oro does not look nice. oro is quite 
stable product and changes are in fact minimal:

http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES
I would like to upgrade to 2.0.8 (as no interface changes were made it 
would be trivial) before 0.8 release.

What do others think?
Regards
Piotr

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski

I do agree with Jarome  - plugins should be checked too.
I would like to integrate PMD for core and plugins over the weekend based on
the Dawid's work - I will make it totally separate target (so test do not
depend on it).
The goal is to allow other developers to play with pmd easily but at the
same time I do not want the build to be affected.
I would like also to look at possibility to generate crossreferenced HTML
code from Nutch sources as it looks like pmd can use it and violation
reports would be much easier to read.
P,


On 4/7/06, Jérôme Charron [EMAIL PROTECTED] wrote:

   that right now it is checking only main code (without plugins?).
  Yes, that's correct -- I forgot to mention that. PMD target is hooked up
  with tests and stops the build if something fails. I thought the core
  code should be this strict; for plugins we can have more relaxed rules

 -1
 Since plugins provides a lot of Nutch functionalities (without any plugin,
 Nutch provides no service), I think that plugins code should be as strict
 as
 the core code.

 Thanks

 Jérôme

 --
 http://motrech.free.fr/
 http://www.frutch.org/

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski



  I will make it totally separate target (so test do not
  depend on it).

 That was actually Doug's idea (and I agree with it) to stop the build
 file if PMD complains about something. It's similar to testing -- if
 your tests fail, the entire build file fails.

I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.
Piotr

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Piotr Kosiorowski


Doug Cutting wrote:


Piotr, would you like to make this release, or should I?

I would prefer you would do it this time - I am not sure if I can find 
some time next week. I would like to do some things before release though:

1) Commit clustering patch from Dawid (I took it over from Andrzej).
2) Commit pmd stuff as optional for this release. We will make it 
required later.
3) Review tutorial - I saw some posts on user list with claims about 
errors so I would like to check it before release.
4) It would be good to go through JIRA issues before - but I am not sure 
if I will manage it.

Any comments?

Regards
Piotr

Re: Patch to remove Nutch formating from logs

2006-04-07 Thread Piotr Kosiorowski


Hello  Christopher,
I personally do not like combining logging with severe error handling 
but it is one of the features of Nutch for some time and I do not think
it causes infinite loops in normal installations. Changing it as we are 
preparing to release a new version is not a good idea in my opinion.

But I will be happy if we change the way it is handled in future.
So for now -1.
Piotr


Christopher Burkey wrote:
Did anyone get this email? Can a commiter acknowledge this has been 
received?


We are have been having problems with infinite loops caused by Nutch. My 
theory is that the problem is related to using the log API to track 
severe errors.  This patch is a only a few lines of code and should be 
easy to insert. Please let me know if it has been received and what the 
feedback is.




Christopher Burkey wrote:

Hello,

   Here is a patch to change org.apache.nutch.util.LogFormatter to not 
insert itself as the default handler for the system.


   I have been using Nutch for a year and have been waiting for a 
version that I can embed into OpenEdit. The problem has been that 
Nutch inserts itself as the formatter for the Java log system and that 
interferes with OpenEdit logging.





diff -Naur ../java/org/apache/nutch/util/LogFormatter.java 
java/org/apache/nutch/util/LogFormatter.java
--- ../java/org/apache/nutch/util/LogFormatter.java2006-03-31 
13:40:50.0 -0500
+++ java/org/apache/nutch/util/LogFormatter.java2006-04-05 
16:27:59.0 -0400

@@ -16,13 +16,23 @@
 
 package org.apache.nutch.util;
 
-import java.util.logging.*;

-import java.io.*;
-import java.text.*;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.PrintStream;
+import java.io.PrintWriter;
+import java.io.StringWriter;
+import java.text.FieldPosition;
+import java.text.SimpleDateFormat;
 import java.util.Date;
-
-/** Prints just the date and the log message. */
-
+import java.util.logging.Formatter;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+import java.util.logging.Logger;
+
+/** Prints just the date and the log message. + *  This was also used 
to stop processing as nutch crawls a web site
+ *  [EMAIL PROTECTED] changed this code to use a LogWrapper class 
to catch severe errors

+ * */
 public class LogFormatter extends Formatter {
   private static final String FORMAT = yyMMdd HHmmss;
   private static final String NEWLINE = 
System.getProperty(line.separator);

@@ -35,20 +45,27 @@
   private static boolean showTime = true;
   private static boolean showThreadIDs = false;
 
+  protected static LogFormatter sharedformatter =  new LogFormatter();
+  protected static SevereLogHandler sharedhandler =  new 
SevereLogHandler(sharedformatter);

+
+  /*
   // install when this class is loaded
   static {
 Handler[] handlers = LogFormatter.getLogger().getHandlers();
 for (int i = 0; i  handlers.length; i++) {
-  handlers[i].setFormatter(new LogFormatter());
+  handlers[i].setFormatter(sharedformatter);
   handlers[i].setLevel(Level.FINEST);
 }
   }
-
+  */
   /** Gets a logger and, as a side effect, installs this as the default
* formatter. */
   public static Logger getLogger(String name) {
 // just referencing this class installs it
-return Logger.getLogger(name);
+Logger logr = Logger.getLogger(name);
+logr.addHandler(sharedhandler);
+   
+return logr;

   }
  /** When true, time is logged with each entry. */
@@ -60,7 +77,10 @@
   public static void setShowThreadIDs(boolean showThreadIDs) {
 LogFormatter.showThreadIDs = showThreadIDs;
   }
-
+  public void setLoggedSevere( boolean inSevere )
+  {
+  loggedSevere = inSevere;
+  }
   /**
* Format the given LogRecord.
* @param record the log record to be formatted.
diff -Naur ../java/org/apache/nutch/util/SevereLogHandler.java 
java/org/apache/nutch/util/SevereLogHandler.java
--- ../java/org/apache/nutch/util/SevereLogHandler.java1969-12-31 
19:00:00.0 -0500
+++ java/org/apache/nutch/util/SevereLogHandler.java2006-04-05 
16:29:20.0 -0400

@@ -0,0 +1,46 @@
+/*
+ * Created on Apr 5, 2006
+ */
+package org.apache.nutch.util;
+
+import java.util.logging.Handler;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+
+public class SevereLogHandler extends Handler
+{
+protected LogFormatter fieldNutchFormatter;
+   
+public SevereLogHandler(LogFormatter inFormatter)

+{
+setNutchFormatter(inFormatter);
+}
+   
+protected LogFormatter getNutchFormatter()

+{
+return fieldNutchFormatter;
+}
+
+protected void setNutchFormatter(LogFormatter inNutchFormatter)
+{
+fieldNutchFormatter = inNutchFormatter;
+}
+
+public void publish(LogRecord inRecord)
+{
+if ( inRecord.getLevel().intValue() == Level.SEVERE.intValue())
+{
+

Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski


Committed.
One can run the pmd checks by 'ant pmd'. It produces file with html 
report in build directory. It covers core nutch and plugins.
Currently it uses unusedcode ruleset checks only but one can uncomment 
other rulesets in build.xml (or add another ones according to pmd 
documentation).


I would like to  add cross-referenced source so report is easier to read 
in near feature.

I have two additional questions for developers:
1) Should we check test sources with pmd?
2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar 
things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far 
as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one 
oro jar than.


So happy PMD-ing,
Piotr





Doug Cutting wrote:

Piotr Kosiorowski wrote:

I will make it totally separate target (so test do not
depend on it).


That was actually Doug's idea (and I agree with it) to stop the build
file if PMD complains about something. It's similar to testing -- if
your tests fail, the entire build file fails.


I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.


So we start out comitting it as an independent target, and then add it 
to the test target?  Is that the plan?  If so, +1.


Doug

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-06 Thread Piotr Kosiorowski

+1 - I offer my help - we can coordinate it and I can do a part of work. I
will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote:


  Other options (raised on the Hadoop list) are Checkstyle:

 PMD seems to be the best choice for an Apache project and they all seem
 to perform at a similar level.

  Anything that generates a lot of false positives is bad: it either
  causes us to skip analysis of lots of files, or ignore the warnings.
  Skipping the JavaCC-generated classes is reasonable, but I'm wary of
  skipping much else.

 I thought a bit about this. The warnings PMD may actually make sense to
 fix. Take a look at maxDoc here:

 class LuceneQueryOptimizer {

private static class LimitExceeded extends RuntimeException {
  private int maxDoc;
  public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
}
 ...

 maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
 accessor in LimitExceeded. It also may look confusing because you
 declare a field private to a class, but use it from the outside...
 changing declarations to something like this:

 class LuceneQueryOptimizer {

private static class LimitExceeded extends RuntimeException {
  final int maxDoc;
  public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
}
 ...

 removes the warning and also seems to make more sense (note that package
 scope of maxDoc doesn't really expose it much more than before because
 the entire class is private).

 So... if you agree to change existing warnings as shown above (there's
 not that many) then integrating PMD with a set of sensible rules may
 help detecting bad smells in the future (I couldn't resist -- it really
 is called like this in software engineering :). I only used dead code
 detection ruleset for now, other rulesets can be checked and we will see
 if they help or quite the contrary.

 If developers agree to the above I'll create a patch together with what
 needs to be fixed to cleanly compile. Otherwise I see little sense in
 integrating PMD.

 D.

PMD integration (was: Re: Add .settings to svn:ignore on root Nutch folder?)

2006-04-06 Thread Piotr Kosiorowski


Hi,
I have downloaded the patches and generally like them (I had only read 
them not applied yet). I have one question - am I reading it correctly 
that right now it is checking only main code (without plugins?).

P.


Dawid Weiss wrote:


All right, I though I'd give it a go since I have a spare few minutes. 
Jura is off, so I made the patches available here --


http://ophelia.cs.put.poznan.pl/~dweiss/nutch/

pmd.patch is the build file patch and libraries (binaries are in a 
separate zip file pmd-ext.zip).


pmd-fixes.patch fixes the current core code to go through pmd smoothly. 
I removed obvious unused code, but left FIXME comments where I wasn't 
sure if the removal can cause side effects (in these places PMD warnings 
are suppressed with NOPMD comments).


I also discovered a bug in PMD... eh... nothing's perfect.

https://sourceforge.net/tracker/?func=detailatid=479921aid=1465574group_id=56262 



D.


Piotr Kosiorowski wrote:
+1 - I offer my help - we can coordinate it and I can do a part of 
work. I

will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote:



Other options (raised on the Hadoop list) are Checkstyle:

PMD seems to be the best choice for an Apache project and they all seem
to perform at a similar level.


Anything that generates a lot of false positives is bad: it either
causes us to skip analysis of lots of files, or ignore the warnings.
Skipping the JavaCC-generated classes is reasonable, but I'm wary of
skipping much else.

I thought a bit about this. The warnings PMD may actually make sense to
fix. Take a look at maxDoc here:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 private int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
accessor in LimitExceeded. It also may look confusing because you
declare a field private to a class, but use it from the outside...
changing declarations to something like this:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 final int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

removes the warning and also seems to make more sense (note that package
scope of maxDoc doesn't really expose it much more than before because
the entire class is private).

So... if you agree to change existing warnings as shown above (there's
not that many) then integrating PMD with a set of sensible rules may
help detecting bad smells in the future (I couldn't resist -- it really
is called like this in software engineering :). I only used dead code
detection ruleset for now, other rulesets can be checked and we will see
if they help or quite the contrary.

If developers agree to the above I'll create a patch together with what
needs to be fixed to cleanly compile. Otherwise I see little sense in
integrating PMD.

D.

[jira] Closed: (NUTCH-239) I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl

2006-03-25 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-239?page=all ]
 
Piotr Kosiorowski closed NUTCH-239:
---

Fix Version: 0.7.2-dev
 Resolution: Fixed
  Assign To: Piotr Kosiorowski

Applied with JavaDoc changes. Thanks.

 I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl
 

  Key: NUTCH-239
  URL: http://issues.apache.org/jira/browse/NUTCH-239
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.2-dev
  Environment: RedHat Enterprise Linux
 Reporter: Jake Vanderdray
 Assignee: Piotr Kosiorowski
 Priority: Trivial
  Fix For: 0.7.2-dev


 I made the following changes in order to get the dependency on com.sun.ssl 
 out of the 0.7 branch.  The same changes have already been applied to the 0.8 
 branch (Revision 379215) thanks to ab.  There is still a dependency on using 
 the Sun JRE.  In order to get it to work with the IBM JRE I had to change 
 SunX509 to IbmX509, but I didn't include that change in this patch.  
 Thanks,
 Jake.
 Index: DummySSLProtocolSocketFactory.java
 ===
 --- DummySSLProtocolSocketFactory.java  (revision 388638)
 +++ DummySSLProtocolSocketFactory.java  (working copy)
 @@ -22,8 +22,8 @@
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
  
 -import com.sun.net.ssl.SSLContext;
 -import com.sun.net.ssl.TrustManager;
 +import javax.net.ssl.SSLContext;
 +import javax.net.ssl.TrustManager;
  
  public class DummySSLProtocolSocketFactory implements ProtocolSocketFactory {
  
 Index: DummyX509TrustManager.java
 ===
 --- DummyX509TrustManager.java  (revision 388638)
 +++ DummyX509TrustManager.java  (working copy)
 @@ -10,9 +10,9 @@
  import java.security.cert.CertificateException;
  import java.security.cert.X509Certificate;
  
 -import com.sun.net.ssl.TrustManagerFactory;
 -import com.sun.net.ssl.TrustManager;
 -import com.sun.net.ssl.X509TrustManager;
 +import javax.net.ssl.TrustManagerFactory;
 +import javax.net.ssl.TrustManager;
 +import javax.net.ssl.X509TrustManager;
  import org.apache.commons.logging.Log; 
  import org.apache.commons.logging.LogFactory;
  
 @@ -57,4 +57,12 @@
  public X509Certificate[] getAcceptedIssuers() {
  return this.standardTrustManager.getAcceptedIssuers();
  }
 +   
 +public void checkClientTrusted(X509Certificate[] arg0, String arg1) 
 throws CertificateException {
 +   // do nothing
 +}
 +
 +public void checkServerTrusted(X509Certificate[] arg0, String arg1) 
 throws CertificateException {
 +   // do nothing
 +}
  }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-94) MapFile.Writer throwing 'File exists error'.

2006-03-25 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-94?page=all ]
 
Piotr Kosiorowski closed NUTCH-94:
--

Fix Version: 0.7.2-dev
 Resolution: Duplicate
  Assign To: Piotr Kosiorowski

Duplicate ofNUTCH-117.

 MapFile.Writer throwing 'File exists error'.
 

  Key: NUTCH-94
  URL: http://issues.apache.org/jira/browse/NUTCH-94
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.6
  Environment: Server 2003, Resin, 1.4.2_05
 Reporter: Michael Couck
 Assignee: Piotr Kosiorowski
  Fix For: 0.7.2-dev


 Running Nutch inside a server JVM or multiple times in the same JVM, 
 MapFile.Writer doesn't get collected or closed by the WebDBWriter and the 
 associated files and directories are not deleted, consequently throws a File 
 exists error in the constructor of MapFile.Writer.
 Seems that this portion of code is very heavily integrated into Nutch and I 
 am hesitant to look for a solution personally as a retrofit will be necessary 
 with every release.
 Has anyone got any ideas, had the same issue, any solutions?
 Regards
 Michael

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-03-25 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-14?page=all ]
 
Piotr Kosiorowski closed NUTCH-14:
--

Resolution: Cannot Reproduce

Closed according to Stefan suggestion

 NullPointerException NutchBean.getSummary
 -

  Key: NUTCH-14
  URL: http://issues.apache.org/jira/browse/NUTCH-14
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Stefan Groschupf
 Priority: Minor


 In heavy load scenarios this may happens when connection broke.
 java.lang.NullPointerException
 at java.util.Hashtable.get(Hashtable.java:333)
 at net.nutch.ipc.Client.getConnection(Client.java:276)
 at net.nutch.ipc.Client.call(Client.java:251)
 at 
 net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418)
 at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236)
 at 
 org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738)
 at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526)
 at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:552)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

2006-03-25 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-117?page=all ]
 
Piotr Kosiorowski closed NUTCH-117:
---

Fix Version: 0.7.2-dev
 Resolution: Fixed
  Assign To: Piotr Kosiorowski

Applied fixed by Mike. Also reported offlist by Michal Karwanski.

 Crawl crashes with java.io.IOException: already exists: 
 C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
 -

  Key: NUTCH-117
  URL: http://issues.apache.org/jira/browse/NUTCH-117
  Project: Nutch
 Type: Bug
 Versions: 0.7.1, 0.7, 0.6
  Environment: Window 2000  P4 1.70GHz 512MB RAM
 Java 1.5.0_05
 Reporter: Stephen Cross
 Assignee: Piotr Kosiorowski
 Priority: Critical
  Fix For: 0.7.2-dev


 I started a crawl using the command line using nutch 0.7.1.
 nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
 After crawling for over 15 hours the crawl crached with the following 
 exception:
 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 
 bytes, 48020 ms
 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
 051019 050544 Updating for 
 C:\nutch\crawl.intranet\oct18\segments\20051019050438
 051019 050544 Processing document 0
 051019 050544 Finishing update
 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
 Exception in thread main java.io.IOException: already exists: 
 C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
 at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86)
 at 
 org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
 at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
 at 
 org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
 at 
 org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
 This was on the 14th segement from the requested depth of 20. Doing a quick 
 Google on the exception brings up a few previous posts with the same error 
 but no definitive answer, seems to have been occuring since nutch 0.6.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Site switched to branch-0.7.

2006-03-09 Thread Piotr Kosiorowski


Hi,
I have updated site in 0.7 branch with latest trunk changes. I have 
added both tutorials to the site so people will be aware of differences.

I have also committed DOAP file in 0.7 branch.
Nutch Website uses branch-0.7 now.
Piotr

Nutch 0.7.2

2006-03-09 Thread Piotr Kosiorowski


Hello,
I would like to release nutch 0.7.2 in a week or two. Some serious 
bugfixes are already covered and I have a plan to fix one or two more.


I found an email from Doug with title [Fwd: Crawler submits forms?]
stating: This has been fixed in the mapred branch, but that patch is 
not in 0.7.1.  This alone might be a reason to make a 0.7.2 release. 

I just want to make sure it was fixed by svn commit: r348533
Fix to not extract urls whose method=post.. I think this was the fix 
but I wanted to make sure before committing.


Any objections against the plan?
Piotr

[jira] Closed: (NUTCH-225) Changed the links to the tutorial to point to the wiki

2006-03-09 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-225?page=all ]
 
Piotr Kosiorowski closed NUTCH-225:
---

Resolution: Won't Fix

I have just updated Nutch Web site. It contains now both tutorials (for 0.7 and 
0.8).
I have also added a notr to each tutorialstating that more detailed tutorials 
are available on Nutch Wiki.

 Changed the links to the tutorial to point to the wiki
 --

  Key: NUTCH-225
  URL: http://issues.apache.org/jira/browse/NUTCH-225
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Jake Vanderdray


 This is a patch to repoint tutorial links on the nutch site to the wiki.
 Index: site.xml
 ===
 --- site.xml(revision 384005)
 +++ site.xml(working copy)
 @@ -26,7 +26,7 @@
docs label=Documentation
  faq label=FAQ  href=ext:faq /
  wikilabel=Wiki href=ext:wiki /
 -tutoriallabel=Tutorial href=tutorial.html /
 +tutoriallabel=Tutorial href=ext:tutorial /
  webmasters  label=Robothref=bot.html /
  i18nlabel=i18n href=i18n.html /
  apidocs label=API Docs href=apidocs/index.html /
 @@ -48,6 +48,7 @@
  wiki  href=http://wiki.apache.org/nutch/; /
  faq   href=http://wiki.apache.org/nutch/FAQ; / 
  store href=http://www.cafepress.com/nutch/; /
 +tutorial  href=http://wiki.apache.org/nutch/NutchTutorial; /
/external-refs
   
  /site
 Index: i18n.xml
 ===
 --- i18n.xml(revision 384005)
 +++ i18n.xml(working copy)
 @@ -188,7 +188,7 @@
  href=http://jakarta.apache.org/tomcat/;Tomcat/a installed./p
  
  pAn index is also required.  You can collect your own by working
 -through the a 
 href=http://lucene.apache.org/nutch/tutorial.html;tutorial/a.
 +through the a 
 href=http://wiki.apache.org/nutch/NutchTutorial;tutorial/a.
  Once you have an index, follow the steps outlined at the end of the
  tutorial for searching./p
  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: Tutorial

2006-03-09 Thread Piotr Kosiorowski

Upps, sorry for ignoring this discussion - i was looking for comments in 
JIRA and already committed the change before reading your discussion.
My motivation is to have usable version of tutorial - as simple as it is 
possible to be versioned with the sources - only for historical purposes 
- if somebody wants to use nutch 0.7 a year from now he will be able to 
find a tutorial for it without problems. But for more advanced stuff I 
fully support Wiki. I will wait for other committers opinions before 
doing anything.



Jeff Ritchie wrote:

+1

Site tutorial links pointing to wiki tutorials is the best option.

Jeff.

Richard Braman wrote:

+1.  No need for 2 tutorials.  The only descrepency I saw, was the
invertlinks command not in 0.7.  I updated the wiki to note that that
command only applied to 0.8

-Original Message-
From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Wednesday, 
March 08, 2006 9:30 AM

To: nutch-dev@lucene.apache.org
Subject: Tutorial


This is in response to Piotr's comment to my JIRA entry
(http://issues.apache.org/jira/browse/NUTCH-225).  I haven't been
subscribed to this list, so I'm afraid I missed the discussion about the
tutorial that went on here.

After getting Piotr's comment I went to the archive and read the
earlier thread about the tutorial.  Here's what I understand:

* The tutorial necessarily differs between the 0.7 and the 0.8 branches
and this needs to be reflected on the web site by having both tutorials
up there.

* Some users have requested that the tutorial be moved to the wiki so
that it can be more easily edited and updated.  In recognition of this I
went ahead and added it to the wiki and made some edits based on input
from people who were confused about the use of Intranet Crawl as a
label.  I now realize this needs to be edited some more to indicate that
it is the tutorial for the 0.7 branch.  I'll do that in a bit.

* Piotr wants the existing tutorials (both the one for 0.7 and the one
for 0.8) on the web site as simple versions while copies get put on the
wiki and become more advanced versions.

In an effort to clear things up and move ahead, can we just do a
quick vote on the last point?  I'd propose moving both tutorials to the
wiki and updating the links on the site to reflect that.  I don't think
keeping two copies of each tutorial up to date is going to be
manageable.  I suspect that one is going to go stale and having multiple
copies (even if one is shorter than the other) is just going to confuse
users.

Thanks,
Jake.

[jira] Closed: (NUTCH-91) empty encoding causes exception

2006-03-09 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
 
Piotr Kosiorowski closed NUTCH-91:
--

Fix Version: 0.7.2-dev
 0.8-dev
 Resolution: Fixed

Commited with small extension. Thanks.

 empty encoding causes exception
 ---

  Key: NUTCH-91
  URL: http://issues.apache.org/jira/browse/NUTCH-91
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Michael Nebel
  Fix For: 0.7.2-dev, 0.8-dev


 I found some sites, where the header says:  Content-Type: text/html; 
 charset=. This causes an exception in the HtmlParser. My suggestion:
 Index: 
 src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
 ===
 --- 
 src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  
 (revision 279397)
 +++ 
 src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  
 (working copy)
 @@ -120,7 +120,7 @@
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new 
 ByteArrayInputStream(contentInOctets));
String encoding = StringUtil.parseCharacterEncoding(contentType);
 -  if (encoding!=null) {
 +  if (encoding!=null  !.equals(encoding)) {
  metadata.put(OriginalCharEncoding, encoding);
  if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.put(CharEncodingForConversion, encoding);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-225) Changed the links to the tutorial to point to the wiki

2006-03-07 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-225?page=comments#action_12369405 ] 

Piotr Kosiorowski commented on NUTCH-225:
-

As stated in another thread I prefer to have a simple tutorial kept in version 
control with releases. 
We already have a link to Wiki on nutch Web site so users have possiility to 
find detailed tutorials.
So -1 from me. If no objections I will close this issue.

 Changed the links to the tutorial to point to the wiki
 --

  Key: NUTCH-225
  URL: http://issues.apache.org/jira/browse/NUTCH-225
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Jake Vanderdray


 This is a patch to repoint tutorial links on the nutch site to the wiki.
 Index: site.xml
 ===
 --- site.xml(revision 384005)
 +++ site.xml(working copy)
 @@ -26,7 +26,7 @@
docs label=Documentation
  faq label=FAQ  href=ext:faq /
  wikilabel=Wiki href=ext:wiki /
 -tutoriallabel=Tutorial href=tutorial.html /
 +tutoriallabel=Tutorial href=ext:tutorial /
  webmasters  label=Robothref=bot.html /
  i18nlabel=i18n href=i18n.html /
  apidocs label=API Docs href=apidocs/index.html /
 @@ -48,6 +48,7 @@
  wiki  href=http://wiki.apache.org/nutch/; /
  faq   href=http://wiki.apache.org/nutch/FAQ; / 
  store href=http://www.cafepress.com/nutch/; /
 +tutorial  href=http://wiki.apache.org/nutch/NutchTutorial; /
/external-refs
   
  /site
 Index: i18n.xml
 ===
 --- i18n.xml(revision 384005)
 +++ i18n.xml(working copy)
 @@ -188,7 +188,7 @@
  href=http://jakarta.apache.org/tomcat/;Tomcat/a installed./p
  
  pAn index is also required.  You can collect your own by working
 -through the a 
 href=http://lucene.apache.org/nutch/tutorial.html;tutorial/a.
 +through the a 
 href=http://wiki.apache.org/nutch/NutchTutorial;tutorial/a.
  Once you have an index, follow the steps outlined at the end of the
  tutorial for searching./p
  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Nutch web site

2006-03-06 Thread Piotr Kosiorowski


Hi,
It looks like Nutch web site was updated with site built from latest 
trunk - the only problem is it contains tutorial for unreleased (yet) 
version 0.8. I think we talked about it and agreed to keep tutorial for 
latest release on the Web. I have just updated site in svn (branch-0.7) 
with latest changes (forrest 0.7 compatibility and mailing list 
archives) and rebuilt it using forrest 0.7. If no objections I can 
switch web site to use version from branch instead of trunk.

Regards
Piotr

Re: Nutch web site

2006-03-06 Thread Piotr Kosiorowski



Andrzej Bialecki wrote:
+1, yes it would be really confusing. Since there are more and more 
people trying 0.8, could we perhaps include a short  note that 0.8 and 
later is NOT compatible with this tutorial, and a reference to the 
tutorial for 0.8 (or the trunk/ branch in general)?




I can add both tutorials to Nutch web site named Tutorial for 0.7 
version and Tutorial for 0.8 version. It should make things clear.

Anyone against it?
Piotr

[jira] Commented: (NUTCH-79) Fault tolerant searching.

2006-01-30 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364496 ] 

Piotr Kosiorowski commented on NUTCH-79:


I think it should work without changes I suggested in previous comment - they 
would be simply useful additions.
I was not using it for quite a while so I would get back to it to make sure it 
works with latest code (I hope sooner than later) - but no promises at the 
moment

 Fault tolerant searching.
 -

  Key: NUTCH-79
  URL: http://issues.apache.org/jira/browse/NUTCH-79
  Project: Nutch
 Type: New Feature
   Components: searcher
 Reporter: Piotr Kosiorowski
  Attachments: patch

 I have finally managed to prepare first version of fault tolerant searching I 
 have promised long time ago. 
 It reads server configuration from search-groups.txt file (in startup 
 directory or directory specified by searcher.dir) if no search-servers.txt 
 file is present. If search-servers.txt  is presentit would be read and 
 handled as previously.
 ---
 Format of search-groups.txt:
 * pre
  *  search.group.count=[int] 
  *  search.group.name.[i]=[string] (for i=0 to count-1)
  *  
  *  For each name: 
  *  [name].part.count=[int] partitionCount 
  *  [name].part.[i].host=[string] (for i=0 to partitionCount-1)
  *  [name].part.[i].port=int (for i=0 to partitionCount-1)
  *  
  *  Example: 
  *  search.group.count=2 
  *  search.group.name.0=master
  *  search.group.name.1=backup
  *  
  *  master.part.count=2 
  *  master.part.0.host=host1 
  *  master.part.0.port=
  *  master.part.1.host=host2 
  *  master.part.1.port=
  *  
  *  backup.part.count=2 
  *  backup.part.0.host=host3 
  *  backup.part.0.port=
  *  backup.part.1.host=host4 
  *  backup.part.1.port=
  * /pre.
 
 If more than one search group is defined in configuration file requests are 
 distributed among groups in round-robin fashion. If one of the servers from 
 the group fails to respond the whole group is treated as inactive and removed 
 from the pool used to distributed requests. There is a separate recovery 
 thread that every searcher.recovery.delay seconds (default 60) tries to 
 check if inactive became alive and if so adds it back to the pool of active 
 groups.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-45) Log corrupt segments in SegmentMergeTool

2006-01-20 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-45?page=all ]
 
Piotr Kosiorowski closed NUTCH-45:
--

Fix Version: 0.7.2-dev
 Resolution: Fixed

Applied. Thanks.

 Log corrupt segments in SegmentMergeTool
 

  Key: NUTCH-45
  URL: http://issues.apache.org/jira/browse/NUTCH-45
  Project: Nutch
 Type: Improvement
 Reporter: Otis Gospodnetic
 Priority: Trivial
  Fix For: 0.7.2-dev
  Attachments: SegmentMergeTool.patch

 Just added a LOG.warning line when corrupt segments are encountered, 
 otherwise they just get skipped silently.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-174) Problem encountered with ant during compilation

2006-01-14 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-174?page=all ]
 
Piotr Kosiorowski closed NUTCH-174:
---

Fix Version: 0.7.2-dev
 0.8-dev
 Resolution: Fixed

Fixed some time ago during preparation of 0.7.2 release. Please use version 
from SVN branch-0.7. 

 Problem encountered with ant during compilation
 ---

  Key: NUTCH-174
  URL: http://issues.apache.org/jira/browse/NUTCH-174
  Project: Nutch
 Type: Bug
 Versions: 0.7.1
  Environment: Suse LInux 9.3
 Reporter: Matthias Günter
 Priority: Trivial
  Fix For: 0.8-dev, 0.7.2-dev


 There is a directory missing which causes ant to fail.
 Error message:
 BUILD FAILED
 /home/guenter/workspace/lucene/nutch-0.7.1/build.xml:76: The following error 
 occurred while executing this line:
 /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/build.xml:9: The 
 following error occurred while executing this line:
 /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/build-plugin.xml:85: 
 srcdir 
 /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/nutch-extensionpoints/src/java
  does not exist!
 Compilation worked, when I omitted line 9 in nutch-0.7.1/src/plugin/build.xml:
  !-- ant dir=nutch-extensionpoints target=deploy/  --
 However, I guess that is not what was intended.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: test suite fails?

2006-01-09 Thread Piotr Kosiorowski

It fails on my machine on parse-ext tests. I am not sure what is causing 
it yet and I am afraid I do not have time to investigate it today - 
maybe in few days. I did a small change to make it compile a few days 
ago, but all tests went ok before I committed it.

Regards
Piotr
Stefan Groschupf wrote:

Hi,

is anyone able to run the test suite without any problems?

Stefan

---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net

Re: no static NutchConf

2006-01-04 Thread Piotr Kosiorowski


+1 in general
In fact I like the approach presented by Stefan to pass only required 
parameters to objects that have small number of configurable params 
instead of NutchConf - it makes it obvious which parameters are required 
for such basic objects to run and as they are usually building blocks 
for something bigger it makes it easier to reuse it with different 
params in different parts of the code. But I like the direction and will 
not oppose against passing the whole NutchConf in this case.

Regards
Piotr

Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Piotr Kosiorowski


Andrzej,
Do you think it would be a good idea to commit it in 0.7 branch for 
0.7.2 release? I personally prefer to use released libraries instead of 
RC if possible. It does not require a lot of changes and you have 
already tested it with existing code...

Piotr

[EMAIL PROTECTED] wrote:

Author: ab
Date: Tue Jan  3 23:32:04 2006
New Revision: 365850

URL: http://svn.apache.org/viewcvs?rev=365850view=rev
Log:
Update Commons HTTPClient to v. 3.0.

Add some default headers to prefer HTML content, and in English.

[jira] Closed: (NUTCH-142) NutchConf should use the thread context classloader

2006-01-04 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-142?page=all ]
 
Piotr Kosiorowski closed NUTCH-142:
---

Fix Version: 0.7.2-dev
 0.8-dev
 Resolution: Fixed

 NutchConf should use the thread context classloader
 ---

  Key: NUTCH-142
  URL: http://issues.apache.org/jira/browse/NUTCH-142
  Project: Nutch
 Type: Improvement
 Versions: 0.7
 Reporter: Mike Cannon-Brookes
  Fix For: 0.7.2-dev, 0.8-dev


 Right now NutchConf uses it's own static classloader which is _evil_ in a 
 J2EE scenario.
 This is simply fixed. Line 52:
private ClassLoader classLoader = NutchConf.class.getClassLoader();
 Should be:
private ClassLoader classLoader = 
 Thread.currentThread().getContextClassLoader();
 This means no matter where Nutch classes are loaded from, it will use the 
 correct J2EE classloader to try to find configuration files (ie from 
 WEB-INF/classes).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] 

Piotr Kosiorowski commented on NUTCH-138:
-

I am not sure but I would suspect it is a problem of bad tomcat configuration. 
To handle special characters in query urls one have to change default tomcat 
configuration - especially URIEncoding attribute to UTF8. See:

http://tomcat.apache.org/faq/connectors.html#utf8

Please check if it helps in your particular case so we can close the issue.


 non-Latin-1 characters cannot be submitted for search
 -

  Key: NUTCH-138
  URL: http://issues.apache.org/jira/browse/NUTCH-138
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1
  Environment: Windows XP, Tomcat 5.5.12
 Reporter: KuroSaka TeruHiko
 Priority: Minor


 The search.html currently specifies GET method for query submission.
 Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
 GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
 memory is correct, non ISO-8859-1 characters were woking OK over GET with 
 older versions of Tomcat as far as setCharacterEncoding() is called properly.)
 To allow proper transmission of non-ISO-8859-1, POST method should be used.  
 Here's a proposed patch:
 *** search.html   Tue Dec 13 15:02:15 2005
 --- search-org.html   Tue Dec 13 15:02:07 2005
 ***
 *** 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=post 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 --- 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=get 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
 as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-138?page=all ]
 
Piotr Kosiorowski closed NUTCH-138:
---

Resolution: Invalid

Setting URIEncoding in tomcat config file fixes the problem.


 non-Latin-1 characters cannot be submitted for search
 -

  Key: NUTCH-138
  URL: http://issues.apache.org/jira/browse/NUTCH-138
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1
  Environment: Windows XP, Tomcat 5.5.12
 Reporter: KuroSaka TeruHiko
 Priority: Minor


 The search.html currently specifies GET method for query submission.
 Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
 GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
 memory is correct, non ISO-8859-1 characters were woking OK over GET with 
 older versions of Tomcat as far as setCharacterEncoding() is called properly.)
 To allow proper transmission of non-ISO-8859-1, POST method should be used.  
 Here's a proposed patch:
 *** search.html   Tue Dec 13 15:02:15 2005
 --- search-org.html   Tue Dec 13 15:02:07 2005
 ***
 *** 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=post 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 --- 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=get 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
 as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] 

Piotr Kosiorowski commented on NUTCH-138:
-

BTW - just create user for yourself in nutch Wiki and you shoudl be able to add 
a new page with information without problems. Thanks for checking and 
documenting it.

 non-Latin-1 characters cannot be submitted for search
 -

  Key: NUTCH-138
  URL: http://issues.apache.org/jira/browse/NUTCH-138
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1
  Environment: Windows XP, Tomcat 5.5.12
 Reporter: KuroSaka TeruHiko
 Priority: Minor


 The search.html currently specifies GET method for query submission.
 Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
 GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
 memory is correct, non ISO-8859-1 characters were woking OK over GET with 
 older versions of Tomcat as far as setCharacterEncoding() is called properly.)
 To allow proper transmission of non-ISO-8859-1, POST method should be used.  
 Here's a proposed patch:
 *** search.html   Tue Dec 13 15:02:15 2005
 --- search-org.html   Tue Dec 13 15:02:07 2005
 ***
 *** 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=post 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 --- 59,65 
   /spanspan class=bodytext
   center
   
 ! form name=search action=../search.jsp method=get 
   input name=query size=44nbsp;input type=submit value=Search
   a href=help.htmlhelp/a
   
 BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
 as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: Mega-cleanup in trunk/

2006-01-01 Thread Piotr Kosiorowski


Andrzej Bialecki wrote:

Hi,

I just commited a large patch to cleanup the trunk/ of obsolete and 
broken classes remaining from the 0.7.x development line. Please test 
that things still work as they should ...



Hi,
I am not sure what is wrong but a lot of JUnit test simply does not 
compile - I did svn checkout to new directory to be sure I do not 
anything left from my experiments.


I am looking at it right now but - I would suggest to temporarily do a 
quick cleanup to make trunk testable:


1) Remove permanently - as classes under tests are removed in trunk:
src/test/org/apache/nutch/pagedb/TestFetchListEntry.java
src/test/org/apache/nutch/pagedb/TestPage.java
src/test/org/apache/nutch/db/TestWebDB.java
src/test/org/apache/nutch/db/DBTester.java
src/test/org/apache/nutch/tools/TestSegmentMergeTool.java
2) Remove temporarly and create JIRA issue to fix it:
src/test/org/apache/nutch/fetcher/TestFetcher.java
src/test/org/apache/nutch/fetcher/TestFetcherOutput.java

3) Remove unused import in:
src/test/org/apache/nutch/parse/TestParseText.java
4) Fix (as it looks simple to fix it - I will look at it in meantime):

src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java
src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java
src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java

After removal of all these not compiling classes tests of trunk complete 
sucessfully on my machine (JDK 1.4.2).


If no objections - especially from Andrzej would be raised I can do the 
cleanup tommorow.

P.

Re: how to add additional factor at search time to ranking score

2006-01-01 Thread Piotr Kosiorowski


AJ Chen wrote:

It would be great if I can add some new functions to the nutch code to 
accomplish this. But, if it requires to customize lucene code, that's 
fine. I have tried to use the most recent release (1.4.3) of lucene 
source code, but it did not work.  Is the lucene jar files included in 
the nutch release (0.7.1) very different from lucene 1.4.3?  If yes, is 
it possible to get the source code for lucene used in nutch?


Nutch uses lucene 1.9 (not existing release yet) - build from lucene 
trunk. Simply grab sources from lucene trunk and nutch should work fine 
with them.

P.

[jira] Commented: (NUTCH-142) NutchConf should use the thread context classloader

2006-01-01 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-142?page=comments#action_12361492 ] 

Piotr Kosiorowski commented on NUTCH-142:
-

Thanks. Fixed in 0.7 branch. Left open to fix it in trunk after cleaning trunk 
JUnit test problems (in next few days).

 NutchConf should use the thread context classloader
 ---

  Key: NUTCH-142
  URL: http://issues.apache.org/jira/browse/NUTCH-142
  Project: Nutch
 Type: Improvement
 Versions: 0.7
 Reporter: Mike Cannon-Brookes


 Right now NutchConf uses it's own static classloader which is _evil_ in a 
 J2EE scenario.
 This is simply fixed. Line 52:
private ClassLoader classLoader = NutchConf.class.getClassLoader();
 Should be:
private ClassLoader classLoader = 
 Thread.currentThread().getContextClassLoader();
 This means no matter where Nutch classes are loaded from, it will use the 
 correct J2EE classloader to try to find configuration files (ie from 
 WEB-INF/classes).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-12-31 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-42?page=all ]
 
Piotr Kosiorowski closed NUTCH-42:
--

Fix Version: 0.7.2-dev
 0.8-dev
 Resolution: Fixed

OpenSearch implemented.

 enhance search.jsp such that it can also returns XML
 

  Key: NUTCH-42
  URL: http://issues.apache.org/jira/browse/NUTCH-42
  Project: Nutch
 Type: Wish
   Components: web gui
 Reporter: Michael Wechner
 Priority: Trivial
  Fix For: 0.7.2-dev, 0.8-dev
  Attachments: NutchRssSearch.zip, NutchRssSearch.zip, search.jsp.diff, 
 search.jsp.diff

 Enhance search.jsp such that by specifying a parameter format=xml the JSP 
 will return an XML, whereas if no format is being specified then it will 
 return HTML

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-23 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361206 ] 

Piotr Kosiorowski commented on NUTCH-148:
-

'df' command is required for NDFS operation so if you were not using NDFS in 
0.7.1 and nutch shell scripts you were able to run it on Windows without 
cygwin. Now majority of tools use NDFS so cygwin is required on Windows. I 
would asssume the other bug is also cygwin related - please test it with cygwin 
and report if it fixed the issue.
In future in case if doubts it is better to ask on the nutch-user mailing list 
rather than create JIRA issue first. I will close both your issues now assuming 
they are cygwin related. If you fins that it still does not work with cygwin 
please reopen.


 org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
 --

  Key: NUTCH-148
  URL: http://issues.apache.org/jira/browse/NUTCH-148
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: Windows XP Home
 Reporter: raghavendra prabhu


 I get the following error while running org.apache.nutch.tools.CrawlTool
 The error actually is in deleteduplicates 
 51223 001121 Reading url hashes...
 051223 001121 Sorting url hashes...
 051223 001121 Deleting url duplicates...
 051223 001121 Error moving bad file 
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
 \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: 
 CreateProcess
 : df -k  
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  error=2
 It throws the error here in NFSDataInputStream.java
 The exception is org.apache.nutch.fs.ChecksumException: Checksum 
 error: G:\apach
 e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  at 0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-23 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-148?page=all ]
 
Piotr Kosiorowski closed NUTCH-148:
---

Resolution: Invalid

 org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
 --

  Key: NUTCH-148
  URL: http://issues.apache.org/jira/browse/NUTCH-148
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: Windows XP Home
 Reporter: raghavendra prabhu


 I get the following error while running org.apache.nutch.tools.CrawlTool
 The error actually is in deleteduplicates 
 51223 001121 Reading url hashes...
 051223 001121 Sorting url hashes...
 051223 001121 Deleting url duplicates...
 051223 001121 Error moving bad file 
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
 \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: 
 CreateProcess
 : df -k  
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  error=2
 It throws the error here in NFSDataInputStream.java
 The exception is org.apache.nutch.fs.ChecksumException: Checksum 
 error: G:\apach
 e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  at 0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-147) nutch map reduce does not work in windows map reduce runs in a loop

2005-12-23 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-147?page=all ]
 
Piotr Kosiorowski closed NUTCH-147:
---

Resolution: Invalid

cygwin requirement on Windows  is listed in nutch tutorial. Please reopen if 
problems persists after using it from cygwin environment.

 nutch map reduce does not work in windows map reduce runs in a loop
 ---

  Key: NUTCH-147
  URL: http://issues.apache.org/jira/browse/NUTCH-147
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: Windows system Winxp Pro
 Reporter: raghavendra prabhu
 Priority: Blocker


 Description 
 Crawl Starts 
 and i am able to see the initial messages
 Then the map reduce process starts and it continues to run in a loop 
 I do not find the same problem in linux(linux it works perfectly)
 Below is loop into which i run into 
 clustering.OnlineClusterer)
 051222 182058   Nutch Indexing Filter 
 (org.apache.nutch.indexer.IndexingFilter)
 051222 182058   Nutch Content Parser (org.apache.nutch.parse.Parser)
 051222 182058   Ontology Model Loader (org.apache.nutch.ontology.Ontology)
 051222 182058   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
 051222 182058   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
 051222 182058 found resource crawl-urlfilter.txt at 
 file:/G:/trunklatest/conf/cr
 awl-urlfilter.txt
 051222 182058 crawl\url.txt:0+25
 051222 182059 crawl\url.txt:0+25
 051222 182059  map -521216%
 051222 182100 crawl\url.txt:0+25
 051222 182100  map -1107496%
 051222 182101 crawl\url.txt:0+25
 051222 182101  map -1678544%
 051222 182102 crawl\url.txt:0+25
 051222 182102  map -2265900%
 051222 182103 crawl\url.txt:0+25
 051222 182103  map -2849416%
 051222 182104 crawl\url.txt:0+25
 051222 182104  map -3422908%
 051222 182105 crawl\url.txt:0+25
 The same thing continues

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

2005-12-22 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361128 ] 

Piotr Kosiorowski commented on NUTCH-148:
-

Do you have Cygwin installed? 
Is 'df' working in your cygwin installation?
Do you run crawl from cygwin shell?

Nutch requires cygwin on Windows.

 org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
 --

  Key: NUTCH-148
  URL: http://issues.apache.org/jira/browse/NUTCH-148
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: Windows XP Home
 Reporter: raghavendra prabhu


 I get the following error while running org.apache.nutch.tools.CrawlTool
 The error actually is in deleteduplicates 
 51223 001121 Reading url hashes...
 051223 001121 Sorting url hashes...
 051223 001121 Deleting url duplicates...
 051223 001121 Error moving bad file 
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
 \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: 
 CreateProcess
 : df -k  
 G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  error=2
 It throws the error here in NFSDataInputStream.java
 The exception is org.apache.nutch.fs.ChecksumException: Checksum 
 error: G:\apach
 e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121
  at 0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-19 Thread Piotr Kosiorowski


+1 - especially for amount of support Stefan gives to nutch users.
P.
Andrzej Bialecki wrote:

Hi,

During the past year and more Stefan participated actively in the
development, and contributed many high-quality patches. He's been
spending considerable effort on addressing many issues in JIRA, and
proposing fixes and improvements.

Apparently he has too much free time on his hands, and it's best to
catch him now, before he realizes that there are other ways of spending
time than hacking Nutch code... ;-)

So, I'd like to call for a vote on adding Stefan as a commiter.

Re: svn commit: r357334 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/protocol/Content.java src/java/org/apache/nutch/protocol/ContentProperties.java

2005-12-17 Thread Piotr Kosiorowski



Doug Cutting wrote:

[EMAIL PROTECTED] wrote:


+/*
+ * (non-Javadoc)
+ * + * @see 
org.apache.nutch.io.Writable#write(java.io.DataOutput)

+ */
+public final void write(DataOutput out) throws IOException {



We should either include javadoc or not.  In general, all public methods 
should have javadoc.  In this case, since this is implementing an 
interface method, if no Javadoc comment is added, then the interface's 
will be used.  That would be preferable.  Frequently in this case folks 
add a comment like:


// javadoc inherited

Doug


Doug,
It is not a JavaDoc comment as it does not start with /** - it has 
exactly the efect you mentioned - the JavaDoc would be inherited - in 
fact Eclipse generates such comment automatically. In fact in my opinion
both versions (//javadoc inherited) and commited one are ok and I have 
no preferences towards any of them.


Regards,
Piotr

JUnit test failures

2005-12-15 Thread Piotr Kosiorowski


Hi,
I have problems with JUnit tests in trunk and mapred branches. 
TestFetcher fails in both branches. The same test executes correctly in 
0.7 branch.

Is it only my problem (environment setup) or others are having it too?
I would suspect some changes in redirect handling
Regards
Piotr

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Piotr Kosiorowski


Doug Cutting wrote:

Andrzej Bialecki wrote:

Please also don't forget that the trunk/ will soon be invaded by the 
code from mapred, I guess some time around the middle of January (Doug?) 



Thinking about this more, perhaps we should do it sooner.  There's 
already a branch for 0.7.x releases, so what point is there in not 
merging mapred to trunk now?  We'd have fewer branches to maintain, and 
start getting nightly builds of mapred.  Folks who require 0.7.x 
compatibility can continue to use (and patch) the 0.7.x branch.  
Objections?


Doug

+1. Looking at the questions on mailing lists I do not think many people 
use trunk now.


Piotr

Re: Lucene performance bottlenecks

2005-12-08 Thread Piotr Kosiorowski

Hi,
I started to think about implementing special kind of Lucene Query (if I
remember correctly I would have to write my own Scorer and probably a few
other classes) optimized for Nutch some time ago. I assumed having
specialized query I would be able to avoid accessing some of lucene index
structures multiple times as the same term apears many times in query
generated by Nutch for multitoken queries. I am not an Lucene expert but
maybe it is worth checking if it might give some performance boost. Has
anyone any ideas why it might help or not?
Regards,
Piotr

Re: Urlfilter Patch

2005-12-01 Thread Piotr Kosiorowski


Jérôme Charron wrote:
[...]

build a list of file extensions to include (other ones will be excluded) in
the fecth process.

[...]
I would not like to exclude all others - as for example many extensions 
are valid for html - especially dynamicly generated pages (jsp,asp,cgi 
just to name the easy ones and a lot of custom ones).  But the idea of 
automatically allowing extensions for which plugins are enabled is good 
in my opinion.
Anyway I will try to find my own list of forbidden extensions I prepared 
based on  80mln of urls - I just prepared the list of most common ones 
and went through it manually. I will try to find it over weekend so we 
can combine it with the list discussed in this thread.

P.

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski

On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi,

 I've been profiling a Nutch installation, and to my surprise the largest
 amount of throwaway allocations and the most time spent was not in Nutch
 specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
 This method operates on a LinkedList, which seems to be a huge
 bottleneck. Perhaps it would be possible to replace LinkedList with a
 table?


I had exactly the same findings some time ago and even replaced LinkedList
with a table and started to prepare the patch and summarize my finding as at
the same time this subject was rised on lucene mailing list with patch -
doing exactly the same thing. I cannot find the link to thread right now -
but as far as I remember it is already commited in SVN trunk.
Regards
Piotr

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski

You are right - it is still not committed but the patch is here:
http://issues.apache.org/jira/browse/LUCENE-443.
During tests of my patch - it was very,very similar to this one- I had up to
5% perfomance increase. But probably it will mainly result in nicer GC
behaviour.

Piotr

On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Piotr Kosiorowski wrote:

 On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 
 
 Hi,
 
 I've been profiling a Nutch installation, and to my surprise the largest
 amount of throwaway allocations and the most time spent was not in Nutch
 specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
 This method operates on a LinkedList, which seems to be a huge
 bottleneck. Perhaps it would be possible to replace LinkedList with a
 table?
 
 
 
 
 I had exactly the same findings some time ago and even replaced
 LinkedList
 with a table and started to prepare the patch and summarize my finding as
 at
 the same time this subject was rised on lucene mailing list with patch -
 doing exactly the same thing. I cannot find the link to thread right now
 -
 but as far as I remember it is already commited in SVN trunk.
 
 

 Can't be - I'm working with the latest revision of Lucene from trunk/

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _ __
 [__ || __|__/|__||\/| Information Retrieval, Semantic Web
 ___|||__|| \| || | Embedded Unix, System Integration
 http://www.sigram.com Contact: info at sigram dot com

[jira] Closed: (NUTCH-99) ports are hardcoded or random

2005-11-14 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-99?page=all ]
 
Piotr Kosiorowski closed NUTCH-99:
--

Resolution: Fixed

Patch committed. Thanks Stefan.


 ports are hardcoded or random
 -

  Key: NUTCH-99
  URL: http://issues.apache.org/jira/browse/NUTCH-99
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments:  port_patch_04.txt, port_patch.txt, port_patch_02.txt, 
 port_patch_03.txt

 Ports of tasktracker are random and the port of the datanode is hardcoded to 
 7000 as strting port.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: suspicious outlink count

2005-11-13 Thread Piotr Kosiorowski


EM wrote:

202443 Pages consumed: 13 (at index 13). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does 
nutch bother counting anything over that again?


During PageRank computation nutch retrieves all links from given page
by MD5. If we have many pages with the same MD5 it can retrieve all 
outlinks from these pages - I saw some bot traps that had big site 
structures that had exactly the same MD5 (once I had over a milion of 
identical pages in my index with different urls from the same host).So 
in this case we are getting the union af all such outlinks. In some 
situations having a big number of outlinks is not a problem (like in 
your case - all pages injected from dmoz are outlinks from dmoz) - but 
usually it indicates some problems in your index or at least a reason to 
look at it. So I have decided to print a warning in this case so one can

have a look at such site.
Regards
Piotr

Re: to many hdd reads

2005-10-11 Thread Piotr Kosiorowski

Committed in trunk and branch-0.7 (just in case if we decide to make a
0.7.2release sometime).
Thanks
Piotr

On 10/11/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi,
 don't think I'm fuddy-duddy but is it really sensefull to do following
 in the nutchbean?

 File [] directories = fs.listFiles(indexesDir);
 for(int i = 0; i  fs.listFiles(indexesDir).length; i++) {

 wouldn't it be better to do it like this:
 File [] directories = fs.listFiles(indexesDir);
 for(int i = 0; i  directories.length; i++) {

 First of all that are many unneccesary disck reads and second there is
 theoretically a chance that the numbers of files change until the loop
 and this will throw an exception.

 Should I provide a patch or someone of the contributor just change this
 one word?

 Thanks!
 Stefan

Nutch 0.7.1 and Nutch web site

2005-10-01 Thread Piotr Kosiorowski


Hello,
I have prepared Nutch 0.7.1 release today but I had one problem. I was 
updating the site in branch but to deploy it one must use the version 
from trunk. Currently I simply committed generated site in trunk but 
this solution is far from perfect.

Should we have version independent site - always modified in trunk?
Or should we think about having a site (eg. JavaDocs, tutorial etc) 
versioned and available for all versions at the same time?

I am not sure know so I am asking if sombody has some ideas about it?
Tegards
Piotr

Re: Nutch Suggestion? (Google like did you mean)

2005-09-29 Thread Piotr Kosiorowski

Have a look at http://issues.apache.org/jira/browse/NUTCH-48. I think ngram
based appeoach is appropriate here. I was using it in our search engine.
Regards
Piotr

On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote:

 Hi

 I am very like Google's Did you mean and I notice that nutch now
 does not provider this function.

 In this article http://today.java.net/lpt/a/211, author Tim White
 implemented suggestion using n-gram to generate suggestion index. Do
 you think is it good for nutch? I mean index in nutch will be really
 huge. Or just provide some dictionaries like jazzy(LGPL) does?

 Thanks
 /Jack
 --
 Keep Discovering ... ...
 http://www.jroller.com/page/jmars

[jira] Closed: (NUTCH-89) parse-rss null pointer exception

2005-09-23 Thread Piotr Kosiorowski (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-89?page=all ]
 
Piotr Kosiorowski closed NUTCH-89:
--

Fix Version: 0.8-dev
 0.7
 Resolution: Fixed

Applied in trunk and 0.7 branch. Thanks.

 parse-rss null pointer exception
 

  Key: NUTCH-89
  URL: http://issues.apache.org/jira/browse/NUTCH-89
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7, 0.8-dev
 Reporter: Michael Nebel
  Fix For: 0.7, 0.8-dev
  Attachments: parse-rss.20050910.patch

 The rss-parser causes an exception. The reason is a syntax error in the page. 
 Hitting this pages, the parser trys to add an outlink with null as anchor.  
 The anchor  of a outlink must no be null. 
 java.lang.NullPointerException
 at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
 at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
 at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
 at 
 org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
 at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
 at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
 Exception in thread main java.lang.RuntimeException: SEVERE error logged.  
 Exiting fetcher.
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)
 I suggest the following patch:
 Index: src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
 ===
 --- src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java   
   (revision 279397)
 +++ src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java   
   (working copy)
 @@ -157,11 +157,13 @@
  if (r.getLink() != null) {
  try {
  // get the outlink
 -theOutlinks.add(new Outlink(r.getLink(), r
 -.getDescription()));
 +   if (r.getDescription()!= null ) {
 +   theOutlinks.add(new Outlink(r.getLink(), 
 r.getDescription()));
 +   } else {
 +   theOutlinks.add(new Outlink(r.getLink(), ));
 +   }
  } catch (MalformedURLException e) {
 -LOG
 -.info(nutch:parse-rss:RSSParser Exception: 
 MalformedURL: 
 +LOG.info(nutch:parse-rss:RSSParser Exception: 
 MalformedURL: 
  + r.getLink()
  + : Attempting to continue 
 processing outlinks);
  e.printStackTrace();
 @@ -185,12 +187,13 @@
  
  if (whichLink != null) {
  try {
 -theOutlinks.add(new Outlink(whichLink, theRSSItem
 -.getDescription()));
 -
 +   if (theRSSItem.getDescription()!=null) {
 +   theOutlinks.add(new Outlink(whichLink, 
 theRSSItem.getDescription()));
 +   } else {
 +   theOutlinks.add(new Outlink(whichLink, ));
 +   }
  } catch (MalformedURLException e) {
 -LOG
 -.info(nutch:parse-rss:RSSParser 
 Exception: MalformedURL: 
 +LOG.info(nutch:parse-rss:RSSParser Exception: 
 MalformedURL: 
  + whichLink
  + : Attempting to continue 
 processing outlinks);
  e.printStackTrace();

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2005-09-21 Thread Piotr Kosiorowski (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12330113 ] 

Piotr Kosiorowski commented on NUTCH-95:


I was renaming segments quite often so I would vote for reading the date from 
the segment instead of using dir name. 

 DeleteDuplicates depends on the order of input segments
 ---

  Key: NUTCH-95
  URL: http://issues.apache.org/jira/browse/NUTCH-95
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev, 0.6, 0.7
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 


 DeleteDuplicates depends on what order the input segments are processed, 
 which in turn depends on the order of segment dirs returned from 
 NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
 to deleting wrong records from indexes. The silent assumption that segments 
 at the end of the listing are more recent is not always true.
 Here's the explanation:
 * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
 URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
 is just an int index to the array of open IndexReaders - and if segment dirs 
 are moved/copied/renamed then entries in that array may change their  order. 
 And then for all equal triples Dedup keeps just the first entry. Naturally, 
 if segmentIdx is changed due to dir renaming, a different record will be kept 
 and different ones will be deleted...
 * then Dedup deletes content duplicates, again by computing hashes for each 
 content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
 now we already have a different set of undeleted docs depending on the order 
 of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
 changes when you re-shuffle the input segment dirs - so again, when identical 
 entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
 Solution: use the fetched date from the first record in each segment to 
 determine the order of segments. Alternatively, modify DeleteDuplicates to 
 use the newer algorithm from SegmentMergeTool. This algorithm works by 
 sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
 urlLength). Then:
 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
 latest version, as recorded by Fetcher).
 2. If contentHash is the same, keep the doc with the highest score, and then 
 if the scores are the same, keep the doc with the shortest url.
 Initial fix will be prepared for the trunk/ and then backported to the 
 release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

0.7.1 release

2005-09-20 Thread Piotr Kosiorowski

Hello,
As it looks everything that was planned was commited to 0.7 branch I would 
like to prepare a 0.7.1 release in next few days. I will change branch name 
at the same time to comply with agreed standard.
Any objections?
Regards
Piotr

Re: svn commit: r290163 - in /lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2: ./ lib/

2005-09-19 Thread Piotr Kosiorowski


Hi Andrzej,
Is anything related to clustering commits left? Or should we proceed 
with 0.7.1 release?

Piotr
[EMAIL PROTECTED] wrote:

Author: ab
Date: Mon Sep 19 07:11:07 2005
New Revision: 290163

URL: http://svn.apache.org/viewcvs?rev=290163view=rev
Log:
Update of the clustering plugin, contributed by Dawid Weiss.

Carrot2 components updated to the newest stable versions. Improvements in
tokenizers (speedups) and stop words handling. Internal API changed slightly
(update needed if anyone wants to use other Carrot2 components and uses this
code as a glue). Support added for Danish, Finnish, Norwegian (bokmaal) and
Swedish.

Added:

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-collections-3.1-patched.jar
   (with props)

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/log4j-1.2.11.jar
   (with props)
Removed:

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-collections-3.0.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/log4j-1.2.8.jar
Modified:

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.LICENSE

lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-pool.LICENSE
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/plugin.xml

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar?rev=290163r1=290162r2=290163view=diff
==
Binary files - no diff available.

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar?rev=290163r1=290162r2=290163view=diff
==
Binary files - no diff available.

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar?rev=290163r1=290162r2=290163view=diff
==
Binary files - no diff available.

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar?rev=290163r1=290162r2=290163view=diff
==
Binary files - no diff available.

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar?rev=290163r1=290162r2=290163view=diff
==
Binary files - no diff available.

Modified: 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS?rev=290163r1=290162r2=290163view=diff
==
--- 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS
 (original)
+++ 
lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS
 Mon Sep 19 07:11:07 2005
@@ -5,9 +5,10 @@
 #
 # First name, surname name; Duties; Active from; Institution
 
-Dawid Weiss; Project administrator, various components, core; 2002; Poznan University of Technology, Poland

-StanisÅaw, OsiÅski; Lingo clustering component, ODP Input; 2003; Poznan 
University of Technology, Poland
+Dawid Weiss; Project administrator, various components, core; 2002; Poland
+StanisÅaw, OsiÅski; Lingo clustering component, ODP Input; 2003; Poland
+
 MichaÅ, WrÃ³blewski [*]; AHC clustering components; 2003; Poznan University 
of Technology, Poland

Re: DistributedSearch$Client.updateSegments() blocking other threads

2005-09-16 Thread Piotr Kosiorowski


Hello Andrzej,
You can also try http://issues.apache.org/jira/browse/NUTCH-79
- I think it should also help here - it is a bit complicated as it 
contain additional functionality but if you have any problems I am 
willing to help. I am going to perform some test of it again and maybe 
commit it in some time if others think it is worth it.

Regrads
Piotr

Andrzej Bialecki wrote:

Hi,

I was doing performance testing of a distributed search setup, with 
JMeter, using the code from trunk/.


Whenever one of the backend Servers goes down, there is a hiccup on the 
frontend, because all ParallelCalls started by the Client, which still 
use that dead address, need to timeout. This is expected, and acceptable.


New calls being made in the meantime (before updateSegments() discovers 
that the host is down) will also need to timeout - which is so so, I 
think it could be improved by removing the offending address at the 
first sign of trouble, i.e. not to wait for updateSegments() but 
immediately remove the dead host from liveAddresses. Anyway, read on...


What was curious was that the same hiccup would then occur every 10 
seconds, which is the hardcoded interval for calling 
Client.updateSegments(). It was as if the call to updateSegments() was 
synchronized on the whole class, so that all other calls are blocked 
until updateSegments() completes. I modified the code, so that instead 
of using DistributedSearch$Client itself as a Thread instance, a new 
independent Thread instance is created.


The hiccups are gone now - the list of liveAddresses is still being 
updated as it should whenever Servers go down/up, but now 
updateSegments() doesn't interfere with other calls. I attach the patch 
- but to be honest I'm still not quite sure what was happening...





Index: DistributedSearch.java
===
--- DistributedSearch.java  (revision 280515)
+++ DistributedSearch.java  (working copy)
@@ -112,8 +112,9 @@
 public Client(InetSocketAddress[] addresses) throws IOException {
   this.defaultAddresses = addresses;
   updateSegments();
-  setDaemon(true);
-  start();
+  Thread t = new Thread(this);
+  t.setDaemon(true);
+  t.start();
 }
 
 private static final Method GET_SEGMENTS;

@@ -168,8 +169,10 @@
 liveSegments+=segments.length;
   }
 
-  this.liveAddresses = (InetSocketAddress[]) // update liveAddresses

-liveAddresses.toArray(new InetSocketAddress[liveAddresses.size()]);
+  synchronized(this.liveAddresses) {
+this.liveAddresses = (InetSocketAddress[]) // update liveAddresses
+  liveAddresses.toArray(new InetSocketAddress[liveAddresses.size()]);
+  }
 
   LOG.info(STATS: +liveServers+ servers, +liveSegments+ segments.);

 }

Re: Problems on Crawling

2005-09-16 Thread Piotr Kosiorowski


bin/nutch updatedb db $s1
command updates WebDB with links you fetched in segment $s1.
Regards
Piotr


Daniele Menozzi wrote:

Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do
not have really understood what is the ralationship between
depth,segments,fetching..
Take for example the tutorial, I understand theese 2 steps:

bin/nutch admin db -create
bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000

but, when I do this:

bin/nutch generate db segments

what happens? I think that a dir called 'segments' id created, and inside
of it I can find the links I have previously injected.Ok.Next steps:

bin/nutch fetch $s1 
	bin/nutch updatedb db $s1 

Ok, no problems here. 
But now I cannot understood what happens with this command:


bin/nutch generate db segments

it is the same command of above, but now I've not injected anything in the
DB, it only contais the pages I've previously fetched.
So, does it mean that when I generate a segment, it will automagically be
filled with links found in fetched pages? And where theese links are saved?
And who saves theese link?

Thank you so much, this work is really interesting!
Menoz

Re: Delete an entry in ArrayFile/MapFile

2005-09-06 Thread Piotr Kosiorowski

Hello,
You cannot do it. These structures where not designed for it. But you can 
copy all the data to other ArrayFile skipping entries you want to delete.
Regards
Piotr

On 9/6/05, Ben [EMAIL PROTECTED] wrote:
 
 Hi
 
 How can I delete an entry in the ArrayFile/MapFile if I know the id/key?
 
 Thanks,
 Ben

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Piotr Kosiorowski


Doug Cutting wrote:
Glancing at other Apache projects in subversion, I see that httpd uses 
branch names like 2.2.x and tag names like 2.2.4.  That's a little 
cryptic.  I propose that we use branch names like branch-2.4 and tag 
names like release-2.4.1.  What do folks think?



+1
In fact I wanted to do it this way when I started to create a branch but 
as noone objected against Release-X.Y name for branch that was present 
in Release-HOWTO I prepared ealier (and have not thought it through) I 
decided to go with Release-HOWTO version to avoid confusion.
I can try to change things in next few days if others agree. I will also 
rollback errorneous commit in tags subdirectory.


Regards
Piotr

Re: merge mapred to trunk

2005-08-31 Thread Piotr Kosiorowski


Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?


Doug


+1
P.

Re: null lang bug? and patch?

2005-08-31 Thread Piotr Kosiorowski

Great - I just thought that it would be better if you look at it - 
instead of me digging into the code. I wanted to be on the safe side 
with 0.7.1 release.

Regards
Piotr
Jérôme Charron wrote:

I am a bit lost but just a quick check - shouldn't it also be committed
in Release-0.7 branch?



No, the analyzer extension-point is commited only in trunk.
It's a new feature, so I follow Committer's Rules (
http://wiki.apache.org/nutch/Committer's_Rules)
;-)
Regards

Jérôme

Re: Analysis plugins and lucene-analyzers

2005-08-27 Thread Piotr Kosiorowski


Hello,
I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in 
nutch core but I would like to give another option. I think it is 
possible to create a plugin which contains and exports this library and 
make other analysis plugin depend on it. I am not an expert in it but I 
think such solution is also possible. But it is just a second idea for 
you to consider - I do not have a preference for any of these options.

Regards
Piotr
Andrzej Bialecki wrote:

Jérôme Charron wrote:


Hi,

I would like to add some language specific analysis plugins. In this 
first approach, each plugin would be simply a wrapper of the lucene's 
analyzers.
So each analysis-lang plugin need to import 
lucene-analyzers-1.9-rc1-dev.jar in its lib directory. In order to 
avoid adding this jar in many plugins, I would like to add the 
lucene-analyzers-1.9-rc1-dev.jar in the nutch core lib.

Any comments? Any objection?



I'm wondering if you could implement this plugin as a more or less 
automatic wrapper around any Lucene classes that implement Analyzer, 
i.e. so that it doesn't require recompiling to change/select the 
language, or add a non-standard analyzer from the classpath. I think 
it's possible to do this, but you would have to code a special-case for 
Snowball analyzers, where the default constructor requires an argument. 
All of this could be read from the plugin.xml or n utch-default.xml files.

Re: crawl-urlfilter.txt mechanics

2005-08-22 Thread Piotr Kosiorowski

crawl-urlfilter.txt is bin/nutch crawl specific. If you want to use
each step separatelly - you ar ein fact doing Whole Web crawling
from tutorial - so you need to modify regex-urlfilter.txt instead.
Regards
Piotr

On 8/22/05, Michael Ji [EMAIL PROTECTED] wrote:
 
 Hi,
 
 When I use intranet crawling, such as, call
 bin/nutch crawl ..., crawl-urlfilter.txt works---it
 filters out the urls that is not matched the domain I
 included;
 
 actually, when I take a look at crawltool.java, the
 config files are read in Java Properties by
 'NutchConf.get().addConfResource(crawl-tool.xml)'
 
 But:
 
 When I calling each steps explicitly by myself, such
 as,
 Loop
generate segment
fetch
updateDB
 
 The crawl-urlfilter.txt doesn't work;
 
 My question is:
 
 1) If I want to control the crawler's behavior in
 second case, should I call 'NutchConf.get()...' by
 myself?
 
 2) Where url-filter exactly works? In fetcher? So,
 after loaded from .xml and .txt, all the configuration
 data is kept in Properties for life time of nutch
 running?
 
 thanks,
 
 Michael Ji
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

Re: Failing JUnit test

2005-08-21 Thread Piotr Kosiorowski


Hello Jérôme,
I found it and commited the fix. It was not using UTF-8 encoding sometimes.
But while looking at the code I feel a little bit worried about
LanguageIdentifier.identify(InputStream is) - as it reads bytes from 
file in chunks and coverts each chunk to stink separatelly. If multibyte 
UT-8 character is located at the chunk boundary it would would be split 
in two parts.

Am I right?

Regards
Piotr


Jérôme Charron wrote:

It works on my Linux box - with both JDK 1.4 and 1.5.



ok. so it seems to be constent with my conf.



I will try to track it down.



I assume it is an encoding problem of the Ngram profile files, but I have no 
time evening.

Regards

Jérôme

Re: Failing JUnit test

2005-08-20 Thread Piotr Kosiorowski


It works on my Linux box - with both JDK 1.4 and 1.5.
I will try to track it down.
Regards
Piotr
Jérôme Charron wrote:

I am using JDK 1.5 on
Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the
problem.



OK. Thanks
Jérôme

Failing JUnit test

2005-08-19 Thread Piotr Kosiorowski


Hello,
I have updated my local copy today and JUnit tests started to fail.

expected:el but was:sv
junit.framework.ComparisonFailure: expected:el but was:sv
	at 
org.apache.nutch.analysis.lang.TestLanguageIdentifier.testIdentify(Unknown 
Source)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)


As I suspect it is a result of latest updates to LanguageIdentifier 
plugin or its tests. I am not deep into it I will not try to debug it 
myslef at the moment - so just wanted you to know about the issue.

Regards
Piotr

Release 0.7

2005-08-16 Thread Piotr Kosiorowski


Hello Nutch Committers,
Is anyone working on preparing the release?
If not I can spent some time on it in an hour or so.
Regards
Piotr

Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski


Hello,
I have a problem related to 0.7 release.
After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).
I thought it was my mistake - I started to do it on Windows so I moved 
to linux, but the problem persisted.
I downloaded latest nightly build(nutch-2005-08-16.tar.gz) and it is 
still the same.


I am not using standard nutch script(and build.xml too) for my local 
installation at work so I had a look and noticed that in my build.xml I 
have additional elements inside tar element

tarfileset dir=${build.dir} mode=755 
include name=${final.name}/bin/*/
 /tarfileset

It is strange nobody reported it so far so it may still be my fault.
But if not - should we make a release with bin/* scripts not executable 
or change the build process?


I would go for a change but than I will do the release tommorow - as I 
would like to test it.

Comments?

Regards
Piotr

Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski


So I will move the release till tommorow as I am a bit sleepy now.
Regards
Piotr
Doug Cutting wrote:

Piotr Kosiorowski wrote:


After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).




It is strange nobody reported it so far so it may still be my fault.



No, it looks like a problem with ant's tar task, which erases executable 
bits.  In prior releases I think Nutch used to directly exec 'tar czf' 
since ant's tar task didn't support compression.  Since it added 
compression we started using the ant task...


But if not - should we make a release with bin/* scripts not 
executable or change the build process?



I think we should fix this before we release.

Good job catching it.

Doug

Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski


Hi,
Just for information
The only change I plan to make is change the tar task to:

 target name=tar depends=package
tar compression=gzip longfile=gnu
  destfile=${build.dir}/${final.name}.tar.gz
  tarfileset dir=${build.dir} mode=664
exclude name=${final.name}/bin/* /
include name=${final.name}/** /
  /tarfileset
  tarfileset dir=${build.dir} mode=755
include name=${final.name}/bin/* /
  /tarfileset
/tar
  /target

I will commit it tommorow and test.
Regards
Piotr

Doug Cutting wrote:

Piotr Kosiorowski wrote:


After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).




It is strange nobody reported it so far so it may still be my fault.



No, it looks like a problem with ant's tar task, which erases executable 
bits.  In prior releases I think Nutch used to directly exec 'tar czf' 
since ant's tar task didn't support compression.  Since it added 
compression we started using the ant task...


But if not - should we make a release with bin/* scripts not 
executable or change the build process?



I think we should fix this before we release.

Good job catching it.

Doug

Re: VOTE: clustering plugin update for Rel 0.7

2005-08-15 Thread Piotr Kosiorowski


Hi,
Maybe it would be a better idea to go for 0.7 branch and schedule a new
0.7.1 release in short time?
It is difficult for me to judge if the patch I had not seen is good for 
release. So I would say 0 from me (if you think it is good enough I will 
 not object).

Regards,
Piotr

Andrzej Bialecki wrote:

Hi,

This is yet another request for exception from the no-commit rule before 
release ... *sigh*


Dawid Weiss reported that he prepared an updated version of the Carrot2 
clustering plugin, which contains significant updates and improvements. 
He suggests that it would be a good idea to include it in the 0.7 release.


If committers agree, I can commit this updated version before the 
release (which should happen tomorrow?), however I'm not sure if I can 
test it sufficiently enough to be sure that nothing breaks ... If the 
decision is positive and you have a collection of test segments, which 
works with recent code, your help in testing would be appreciated.


Please vote +1 for and -1 against.

Re: FW: Fetcher, ParseText, ParseData - need to modify

2005-08-15 Thread Piotr Kosiorowski


Hello,
To change nutch standard html parsing the best place to start would be 
probably parse-html plugin.

Regards
Piotr
Fuad Efendi wrote:

1. This is part of ParseText:
Any Accessories Backup Devices  Media Barebone Systems Camcorder
Accessories Camcorders Cases  External Enclosures CD / DVD Drives 
Media Cooling Devices Digital Camera Accessories Digital Cameras

- it is content of Dropdown, OPTIONS in HTML


2. I have some sub-text in ParseText which seems to be an anchor, I
compared visually with web-page...


-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 15, 2005 1:20 PM

To: nutch-dev@lucene.apache.org
Subject: Fetcher, ParseText, ParseData - need to modify


I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some OPTIONS tags within
a text too.
  LOG.info(ParseText = +text);
  LOG.info(ParseData = + parseData);

I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.

Where to start? Would be nice to have plugins modifying Fetcher
behaviour...

Re: page ranking weights

2005-08-15 Thread Piotr Kosiorowski

Boost for the page maybe calculated in few different ways (and in few 
different places in nutch):

1) PageRank based score
- calculated by nutch analyze command based on WebDB
- during fetchlist generation scores from WebDB are stored in segment
- indexing phase uses score to set the boost for a page
2) based on number of incoming links
- during fetchlist generation inlinks are stored in segment
	- during indexing number of inlinks is read from segment and used in 
boost calculation


There is a separate command (updatesegs) to update score and inlink 
information in existing segments.

Regards
Piotr

Jay Pound wrote:

also how does it keep track of incoming links globally on these pages, if
the weight is determined by # of incoming links then there would have to be
somewhere it keeps track so when you split your indexes it can still have an
accurate value for the distributed search?
-J
- Original Message - 
From: Jay Pound [EMAIL PROTECTED]

To: nutch-dev@lucene.apache.org
Sent: Thursday, August 11, 2005 4:49 PM
Subject: page ranking weights




at which step does nutch figure out the weight of each page, the updatedb
step? or the index step?
Thanks,
-Jay

Nutch versions - Was: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-10 Thread Piotr Kosiorowski


Hello,
I think a lot of people will wait before moving to mapreduce 
implementation for some time so we will have a 0.7 version to support.
I was a heavy CVS branch user in my previous job taking care about 
common library so I fully agree that such branch would be needed for bug 
fixing. I would go (as Doug suggested) for lazy creation mode of this 
branch - should be created on first need to commit to it.


I also tracked down version numbers in nutch source code:
1) CHANGES.txt - that was easy all releases are listed here
2) default.properties
version=0.7-dev
3) nutch-default.xml
property
   namehttp.agent.version/name
value0.07/value
   descriptionA version string to advertise in the User-Agent
header./description
 /property

So we have a small inconsistency in naming versions (I have commited 
0.07 in nutch-default.xml yesterday but it used to be 0.06 so I have not 
changed the format) - I will fix soon.


I think we all refer to 0.7 as next number (and 0.6 as current) so 
nutch-default.xml contains wrong format. In fact it should still contain 
-dev suffix.


To make undocumented comvention documented I would also like to suggest 
naming releases with X.Y format and naming code developed to make a 
(release X.Y)   X.Y-dev.


I will try to put a draft of doc on Release HOWTO on Wiki tommorow.
Regards,
Piotr







Andrzej Bialecki wrote:

Doug Cutting wrote:


We may want to start a branch for the release too, as described at:

http://svnbook.red-bean.com/en/1.1/ch04s04.html#svn-ch-4-sect-4.4.1

If we think we may someday want need 0.7.1 release, then we will need 
a 0.7 branch to make this from.  We can start this branch later by 
basing it on the 0.7 tag.  But we should never alter the 0.7 tagged 
code once it is created.


Thoughts?



I agree. I have a similar experience from another OSS project (FreeBSD), 
where this schema is being used extensively. This allows to provide some 
limited maintenance for past releases. Considering that this is the last 
release before merging the map-reduce, doing a branch seems very 
appropriate.

Re: clucene-java bindings

2005-08-09 Thread Piotr Kosiorowski


Hello Ben,
I personally would be interested mainly in search part of it if speed 
increase would be significant. I am running my indices on linux/ AMD 
Opterons - I hope CLucene will work well in this environment. I assume 
CLucene is compatible with Java lucene index format as we do have some 
tools in Java that manipulate Lucene indices. If you have abything to 
integrate with nutch I am willing to help with integration and test it.

Regards,
Piotr

Ben van Klinken wrote:

Hi Nutch People,

I am a developer of CLucene, which is a full C++ port of Apache
Lucene. I would like to propose something to users of Nutch:

I have been working on some SWIG wrappers for CLucene in various
higher-level languages such as C#,Java and COM. I started working on
the Java wrapper for the purpose of 'stealing' Java test suites for
the purpose of testing CLucene.

I have already managed to run about half of the luceneDotNet tests
successfully using the CLucene-csharp bindings (the rest can mostly
not be done because of the lack of director support in the Swig Csharp
module). This has been useful in tracking down bugs, etc.

Without too much effort, I have managed to get the Java bindings
working. I have so far been able to get the IndexFiles demo program to
run with very few changes to the Java code (I had to change the
imports code and put a System.loadLibrary call in - though these
differences would eventually be able to be removed completely).

I only spent a minute looking at speeds, but I indexed a directory
which took 2.5 seconds on java lucene and the same thing took 1.5
seconds in clucene-java. Of course this is not saying much, but it
means that clucene-java *might* be faster.

So what I wanted to propose to users and developer of Nutch this: with
a bit of effort, clucene-java could be good enough to be 'dropped
into' the nutch project thereby speeding up the nutch indexer. We
could write directors for clucene-java which would pass off some
things like the analysers into java. This would be beneficial to nutch
because of the added speed. If the clucene-java wrapper was written
well, there would be no need for any code change in nutch, aside from
changing which lucene jar file is loaded.

This is just some preliminary thoughts, I'm sure there is still a lot
to think about. But I have shown that the concept could work using the
demo files and I think that it could give nutch indexing/search a
reasonable speed boost.

What do people think? I am prepared to nut out this one with whoever
is interested

cheers,
ben

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski


Hello Doug,
I read your email ten times and still I am not sure
what the problem is.
Regards,
Piotr
Doug Cutting wrote:

[EMAIL PROTECTED] wrote:


-  valuehttp://www.nutch.org/docs/en/bot.html/value
+  valuehttp://lucene.apache.org/nutch/bot.html/value



I think this should now be:

http://lucene.apache.org/nutch/bot.html

The docs/en pages have mostly been reduced to the about page, whose 
translations I hate to throw away, even though they don't really fit 
into the new Forrest-based website.


Doug

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski


No problem at all. I have a lot to learn yet and it is nice
people like you check my commits for stupid mistakes. Four eyes
are always better than two :).
Regards,
Piotr

Doug Cutting wrote:

Piotr Kosiorowski wrote:


I read your email ten times and still I am not sure
what the problem is.



The problem is with me.


Doug Cutting wrote:


[EMAIL PROTECTED] wrote:


-  valuehttp://www.nutch.org/docs/en/bot.html/value
+  valuehttp://lucene.apache.org/nutch/bot.html/value



I clicked on what I thought was the lower link, then looked at the 
browser and saw the wrong thing with the wrong link.  But I must have 
accidentally clicked on the upper link.  Sorry for the confusion!


Doug

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Piotr Kosiorowski

Will do it tommorow - I wanted to put down a kind of release checklist 
in Wiki - starting with where to change numbers. But would like to cover 
also release howto - but in fact I am not sure how to do make a relase 
yet. But will try to gather this information.

Regards
Piotr
Andrzej Bialecki wrote:

[EMAIL PROTECTED] wrote:


Author: pkosiorowski
Date: Mon Aug  8 13:44:23 2005
New Revision: 230887

URL: http://svn.apache.org/viewcvs?rev=230887view=rev
Log:
User agent string related properties updated.



Piotr, could you also check other places where the version number is 
harcoded? We should set them to 0.07 now, so that we have the right 
values in the release ...

Tutorial

2005-08-08 Thread Piotr Kosiorowski

Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right.  If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to set urlfilter and urls file for
nutch.org domain, but www.nutch.org redirects to
http://lucene.apache.org/nutch and all links are rejected by
urlfilter.

So I suggest to change it so:
urls file will contain: http://lucene.apache.org/nutch
crawl-urlfilter.txt will contain:
+^http://([a-z0-9]*\.)*apache.org/
I would also add pdf and png to list of rejected extensions in
crawl-urlfilter.txt file so users would not be confused by errors in
log file. pdf parsing plugin is disabled in default configuration.
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.
Regards
Piotr

NUTCH 79 Fault tolerant searching.

2005-08-08 Thread Piotr Kosiorowski


Hello,
I just created an issue in JIRA 
http://issues.apache.org/jira/browse/NUTCH-79 containing the code for 
fault tolerant searching. I think it is too late to include it in 0.7 
release but I would wait for comments and test it in the meantime.
I would like to commit it when release would be done and merging of 
mapreduce branch would be finished.

Waiting for comments,
Piotr

Re: JIRA access

2005-08-08 Thread Piotr Kosiorowski


Thanks. It works.
Piotr
Doug Cutting wrote:

Piotr Kosiorowski wrote:

Looking around in JIRA I found out I cannot  resolve an issue. I am 
not sure how it works but I suspect I lack some rights to do so. Am I 
right?



I have added you to the nutch-developers Jira group.  Now you should be 
able to resolve issues, etc.


Doug

1 2 >

1 - 100 of 107 matches

Mail list logo