Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
> An end-to-end unit test would help coverage a lot: something that
> performs a simple crawl and runs queries against it.  Ideally this would
> start an in-process web server serving test content, crawl that content,
> then start a web server serving queries.  This could be run in both
> local and pseudo-distributed mode.

I agree this kind of functional tests are needed.
But having a lot of unit tests on basis classes has a lot of advantages:
1. If limit cases (null params, and so on) are tested it provides a kind of
specification.
So if a developper changes such limit cases then unit tests fails, and we
immediatly know that we are not backward compatible.
2. Finding class level bugs is really more easy with unit tests rather than
with functional tests (it is really painful to find a bug in a low level
class such as StringUtil with functional tests, whereas it can be quickly
checked by a good unitary test)
In other words the most code is covered by unit tests, the more it is easy
to find problems that occurs during functional tests.

But sure, some tests that requires a lot of contextual objects are not easy
to implement and functional tests are the solution.
(we can perhaps plan to integrate such tests for 1.0)

Jérôme


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-07 Thread Doug Cutting

Jérôme Charron wrote:

I absolutely agree Dawid. But I don't think Nutch has enought human
resources
to have a Q&A person.
I will make a try to integrate a code coverage tool, and see if it gives us
some good
indices on unit tests needed efforts.


I think more unit tests could go a long way towards improving things. 
Ideally we'd also run nightly builds on Windows.


An end-to-end unit test would help coverage a lot: something that 
performs a simple crawl and runs queries against it.  Ideally this would 
start an in-process web server serving test content, crawl that content, 
then start a web server serving queries.  This could be run in both 
local and pseudo-distributed mode.


Doug


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
> > My feeling was simply that the closest we are to Nutch-1.0, the more be
> need
> > some Q&A metrics (for us and for nutch users). No?
> I absolutely agree Jérôme, really. It's just that developers usually
> tend to hook up dozens of Q&A plugins and never look at what they output
> (that's the usual scenario with Maven-built projects that I observed).

Yes, that's right...;-)

What I think we need is a Q&A _person_ rather than just tools. But I'm
> always a bit skeptical, don't take it personally ;)

I absolutely agree Dawid. But I don't think Nutch has enought human
resources
to have a Q&A person.
I will make a try to integrate a code coverage tool, and see if it gives us
some good
indices on unit tests needed efforts.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-07 Thread Dawid Weiss



My feeling was simply that the closest we are to Nutch-1.0, the more be need
some Q&A metrics (for us and for nutch users). No?


I absolutely agree Jérôme, really. It's just that developers usually 
tend to hook up dozens of Q&A plugins and never look at what they output 
(that's the usual scenario with Maven-built projects that I observed). 
What I think we need is a Q&A _person_ rather than just tools. But I'm 
always a bit skeptical, don't take it personally ;)


D.


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Jérôme Charron
> With code coverage... I don't know. It's
> up to you guys -- you spend much more time on Nutch code than I do and
> you know best what is needed and what isn't.

My feeling was simply that the closest we are to Nutch-1.0, the more be need
some Q&A metrics (for us and for nutch users). No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


PMD integration (was: Re: Add ".settings" to svn:ignore on root Nutch folder?)

2006-04-06 Thread Piotr Kosiorowski

Hi,
I have downloaded the patches and generally like them (I had only read 
them not applied yet). I have one question - am I reading it correctly 
that right now it is checking only main code (without plugins?).

P.


Dawid Weiss wrote:


All right, I though I'd give it a go since I have a spare few minutes. 
Jura is off, so I made the patches available here --


http://ophelia.cs.put.poznan.pl/~dweiss/nutch/

pmd.patch is the build file patch and libraries (binaries are in a 
separate zip file pmd-ext.zip).


pmd-fixes.patch fixes the current core code to go through pmd smoothly. 
I removed obvious unused code, but left FIXME comments where I wasn't 
sure if the removal can cause side effects (in these places PMD warnings 
are suppressed with NOPMD comments).


I also discovered a bug in PMD... eh... nothing's perfect.

https://sourceforge.net/tracker/?func=detail&atid=479921&aid=1465574&group_id=56262 



D.


Piotr Kosiorowski wrote:
+1 - I offer my help - we can coordinate it and I can do a part of 
work. I

will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss <[EMAIL PROTECTED]> wrote:



Other options (raised on the Hadoop list) are Checkstyle:

PMD seems to be the best choice for an Apache project and they all seem
to perform at a similar level.


Anything that generates a lot of false positives is bad: it either
causes us to skip analysis of lots of files, or ignore the warnings.
Skipping the JavaCC-generated classes is reasonable, but I'm wary of
skipping much else.

I thought a bit about this. The warnings PMD may actually make sense to
fix. Take a look at maxDoc here:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 private int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
accessor in LimitExceeded. It also may look confusing because you
declare a field private to a class, but use it from the outside...
changing declarations to something like this:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 final int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

removes the warning and also seems to make more sense (note that package
scope of maxDoc doesn't really expose it much more than before because
the entire class is private).

So... if you agree to change existing warnings as shown above (there's
not that many) then integrating PMD with a set of sensible rules may
help detecting bad smells in the future (I couldn't resist -- it really
is called like this in software engineering :). I only used dead code
detection ruleset for now, other rulesets can be checked and we will see
if they help or quite the contrary.

If developers agree to the above I'll create a patch together with what
needs to be fixed to cleanly compile. Otherwise I see little sense in
integrating PMD.

D.












Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Dawid Weiss


All right, I though I'd give it a go since I have a spare few minutes. 
Jura is off, so I made the patches available here --


http://ophelia.cs.put.poznan.pl/~dweiss/nutch/

pmd.patch is the build file patch and libraries (binaries are in a 
separate zip file pmd-ext.zip).


pmd-fixes.patch fixes the current core code to go through pmd smoothly. 
I removed obvious unused code, but left FIXME comments where I wasn't 
sure if the removal can cause side effects (in these places PMD warnings 
are suppressed with NOPMD comments).


I also discovered a bug in PMD... eh... nothing's perfect.

https://sourceforge.net/tracker/?func=detail&atid=479921&aid=1465574&group_id=56262

D.


Piotr Kosiorowski wrote:

+1 - I offer my help - we can coordinate it and I can do a part of work. I
will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss <[EMAIL PROTECTED]> wrote:



Other options (raised on the Hadoop list) are Checkstyle:

PMD seems to be the best choice for an Apache project and they all seem
to perform at a similar level.


Anything that generates a lot of false positives is bad: it either
causes us to skip analysis of lots of files, or ignore the warnings.
Skipping the JavaCC-generated classes is reasonable, but I'm wary of
skipping much else.

I thought a bit about this. The warnings PMD may actually make sense to
fix. Take a look at maxDoc here:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 private int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
accessor in LimitExceeded. It also may look confusing because you
declare a field private to a class, but use it from the outside...
changing declarations to something like this:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 final int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

removes the warning and also seems to make more sense (note that package
scope of maxDoc doesn't really expose it much more than before because
the entire class is private).

So... if you agree to change existing warnings as shown above (there's
not that many) then integrating PMD with a set of sensible rules may
help detecting bad smells in the future (I couldn't resist -- it really
is called like this in software engineering :). I only used dead code
detection ruleset for now, other rulesets can be checked and we will see
if they help or quite the contrary.

If developers agree to the above I'll create a patch together with what
needs to be fixed to cleanly compile. Otherwise I see little sense in
integrating PMD.

D.








Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Piotr Kosiorowski
+1 - I offer my help - we can coordinate it and I can do a part of work. I
will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss <[EMAIL PROTECTED]> wrote:
>
>
> > Other options (raised on the Hadoop list) are Checkstyle:
>
> PMD seems to be the best choice for an Apache project and they all seem
> to perform at a similar level.
>
> > Anything that generates a lot of false positives is bad: it either
> > causes us to skip analysis of lots of files, or ignore the warnings.
> > Skipping the JavaCC-generated classes is reasonable, but I'm wary of
> > skipping much else.
>
> I thought a bit about this. The warnings PMD may actually make sense to
> fix. Take a look at maxDoc here:
>
> class LuceneQueryOptimizer {
>
>private static class LimitExceeded extends RuntimeException {
>  private int maxDoc;
>  public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
>}
> ...
>
> maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
> accessor in LimitExceeded. It also may look confusing because you
> declare a field private to a class, but use it from the outside...
> changing declarations to something like this:
>
> class LuceneQueryOptimizer {
>
>private static class LimitExceeded extends RuntimeException {
>  final int maxDoc;
>  public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
>}
> ...
>
> removes the warning and also seems to make more sense (note that package
> scope of maxDoc doesn't really expose it much more than before because
> the entire class is private).
>
> So... if you agree to change existing warnings as shown above (there's
> not that many) then integrating PMD with a set of sensible rules may
> help detecting bad smells in the future (I couldn't resist -- it really
> is called like this in software engineering :). I only used dead code
> detection ruleset for now, other rulesets can be checked and we will see
> if they help or quite the contrary.
>
> If developers agree to the above I'll create a patch together with what
> needs to be fixed to cleanly compile. Otherwise I see little sense in
> integrating PMD.
>
> D.
>
>
>
>


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Dawid Weiss


> Other options (raised on the Hadoop list) are Checkstyle:

PMD seems to be the best choice for an Apache project and they all seem 
to perform at a similar level.


Anything that generates a lot of false positives is bad: it either 
causes us to skip analysis of lots of files, or ignore the warnings. 
Skipping the JavaCC-generated classes is reasonable, but I'm wary of 
skipping much else.


I thought a bit about this. The warnings PMD may actually make sense to 
fix. Take a look at maxDoc here:


class LuceneQueryOptimizer {

  private static class LimitExceeded extends RuntimeException {
private int maxDoc;
public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
  }
...

maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic 
accessor in LimitExceeded. It also may look confusing because you 
declare a field private to a class, but use it from the outside... 
changing declarations to something like this:


class LuceneQueryOptimizer {

  private static class LimitExceeded extends RuntimeException {
final int maxDoc;
public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
  }
...

removes the warning and also seems to make more sense (note that package 
scope of maxDoc doesn't really expose it much more than before because 
the entire class is private).


So... if you agree to change existing warnings as shown above (there's 
not that many) then integrating PMD with a set of sensible rules may 
help detecting bad smells in the future (I couldn't resist -- it really 
is called like this in software engineering :). I only used dead code 
detection ruleset for now, other rulesets can be checked and we will see 
if they help or quite the contrary.


If developers agree to the above I'll create a patch together with what 
needs to be fixed to cleanly compile. Otherwise I see little sense in 
integrating PMD.


D.




Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Doug Cutting

Other options (raised on the Hadoop list) are Checkstyle:

http://checkstyle.sourceforge.net/

and FindBugs:

http://findbugs.sourceforge.net/

Although these are both under LGPL and thus harder to include in Apache 
projects.


Anything that generates a lot of false positives is bad: it either 
causes us to skip analysis of lots of files, or ignore the warnings. 
Skipping the JavaCC-generated classes is reasonable, but I'm wary of 
skipping much else.


Sigh.

Doug

Dawid Weiss wrote:


Ok, PMD seems like a good idea. I've added it to the build file. Unused 
code detection shows a few catches (javacc-generated classes need to be 
ignored because they contain a lot of junk), but unfortunately it also 
displays false positives such as in:


MapWritable.java   429   {Avoid unused private fields such as 
'fKeyClassId'}


This field is private but is used in an outside class (through a 
synthetic accessor I presume, so a simple syntax tree analysis PMD does 
is insufficient to catch it).


These things would need to be marked in the code as ignorable... Do you 
want me to create a JIRA issue for this, Doug? Or should we drop the 
subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to 
the codebase (Xerces can be reused).


D.



Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss


I'm a fan of automated testing and code analysis utilities, but I must 
say they only make sense if people actually use them and look at their 
results. So it's not really just about integration -- it's about looking 
at the results of these tools. PMD is neat because it can simply 
interrupt your build process so you'll have to either fix the warning or 
explicitly mark it as ignored. With code coverage... I don't know. It's 
up to you guys -- you spend much more time on Nutch code than I do and 
you know best what is needed and what isn't.


Let me know about PMD. I'll create the patch tomorrow if there's a 
consensus on if and how we should use it. For those impatient, the patch 
is in the attachment. Place the required PMD JARs in lib/pmd-ext/ and 
run 'ant pmd'.


D.

Jérôme Charron wrote:

I would not be opposed to integrating PMD or something similar into
Nutch's build.xml.  What do others think?  Any volunteers?

I'll do it. I meant to see PMD anyway so it'll be a good exercise.


Dawid, what about integrating a Code Coverage Tool like EMMA (
http://emma.sourceforge.net/)
while integrating PMD ?

Jérôme

Index: build.xml
===
--- build.xml   (revision 391739)
+++ build.xml   (working copy)
@@ -198,6 +198,34 @@
   
 
   
+   
+  
+  
+   
+   
+ 
+ 
+ 
+   
+   
+ unusedcode
+  
+ 
+ 
+
+   
+   
+   
+  
+
+   
+   
+   
+   FAILURE: PMD shows ${pmd.failures} rule 
violations. See ${pmd.report} for details.
+  
+
+  

   
   


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
> > I would not be opposed to integrating PMD or something similar into
> > Nutch's build.xml.  What do others think?  Any volunteers?
> I'll do it. I meant to see PMD anyway so it'll be a good exercise.

Dawid, what about integrating a Code Coverage Tool like EMMA (
http://emma.sourceforge.net/)
while integrating PMD ?

Jérôme


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss


Ok, PMD seems like a good idea. I've added it to the build file. Unused 
code detection shows a few catches (javacc-generated classes need to be 
ignored because they contain a lot of junk), but unfortunately it also 
displays false positives such as in:


MapWritable.java   429   {Avoid unused private fields such as 
'fKeyClassId'}


This field is private but is used in an outside class (through a 
synthetic accessor I presume, so a simple syntax tree analysis PMD does 
is insufficient to catch it).


These things would need to be marked in the code as ignorable... Do you 
want me to create a JIRA issue for this, Doug? Or should we drop the 
subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to 
the codebase (Xerces can be reused).


D.



Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
> PMD looks like a useful such tool:
> http://pmd.sourceforge.net/ant-task.html
> I would not be opposed to integrating PMD or something similar into
> Nutch's build.xml.  What do others think?  Any volunteers?

+1 (Very configurable, very good tool!)


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss


One can presumably disable such minor warnings in Eclipse.  Arguably the 
bug is that Eclipse warns about such things by default, rather than in a 
'pedantic' mode.


I agree -- some of them are really annoying. Plus, Eclipse has been 
having notorious problems showing warnings for unused parameters in 
overriden methods... But I still think some of the warnings can be 
valuable and your idea with PMD is a good one.


One caution: we have run into problems where includes were removed 
because a tool said they were unused, but they were required for the 
Javadoc.  So code-analysis tools are not infallible!


Eclipse deals with these properly -- I use it all the time. I believe it 
also shows warnings for classes referenced in JavaDocs and not imported.


I would not be opposed to integrating PMD or something similar into 
Nutch's build.xml.  What do others think?  Any volunteers?


I'll do it. I meant to see PMD anyway so it'll be a good exercise.

D.


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Doug Cutting

Dawid Weiss wrote:
Eclipse shows a good few warnings 
in the present codebase. They are usually minor things like malformed 
JavaDocs, unused variables and such.


One can presumably disable such minor warnings in Eclipse.  Arguably the 
bug is that Eclipse warns about such things by default, rather than in a 
'pedantic' mode.


Does it make sense to fix them? I 
can correct at least some of them when I go through the code, but such 
patches are tedious to review and need to be applied promptly to avoid 
conflicts.


With folks using a variety of tools to edit their code, some such 
warnings are inevitable.  Fixing these can clutter patches whose intent 
is to fix something else.  One way to deal with these is to ocassionally 
make a big commit which fixes all of them across the whole project 
(according to a particular warning system, anyway) and makes no other 
changes.


One caution: we have run into problems where includes were removed 
because a tool said they were unused, but they were required for the 
Javadoc.  So code-analysis tools are not infallible!


Alternately, we could try to integrate some tool into the ant build 
process that checks for unused variables, etc.  Then we could make 
"warning-free" a requirement for commits.


PMD looks like a useful such tool:

http://pmd.sourceforge.net/ant-task.html

I would not be opposed to integrating PMD or something similar into 
Nutch's build.xml.  What do others think?  Any volunteers?


Doug


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-04 Thread Dawid Weiss


It works fine Doug, thanks.


Please tell me if it is correct, since I don't use Eclipse.


I'm at the vi (or rather vim) level very often, but emacs is still ahead 
of me ;) And on a more serious note, Eclipse shows a good few warnings 
in the present codebase. They are usually minor things like malformed 
JavaDocs, unused variables and such. Does it make sense to fix them? I 
can correct at least some of them when I go through the code, but such 
patches are tedious to review and need to be applied promptly to avoid 
conflicts.


D.


Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-04 Thread Doug Cutting

Dawid Weiss wrote:
Would it be a problem to add Eclipse's ".settings" folder to ignored 
files (since Eclipse project files are already there anyway). This file 
is used when one wants to override default project configuration (code 
formatting, specific JVM etc).


I just made this change.

Please tell me if it is correct, since I don't use Eclipse.

Thanks,

Doug


Add ".settings" to svn:ignore on root Nutch folder?

2006-04-04 Thread Dawid Weiss


Would it be a problem to add Eclipse's ".settings" folder to ignored 
files (since Eclipse project files are already there anyway). This file 
is used when one wants to override default project configuration (code 
formatting, specific JVM etc).


Dawid