robots.txt redirect (NUTCH-124)

2009-03-21 Thread Mathijs Homminga

Hi everybody,

Can someone shine a light on NUTCH-124:
RobotRulesParser.java doesn't follow redirects when requesting the  
robots.txt file. Doug patched this, but that didn't make it to the  
trunk.

What is the wished behavior here?


For example, when requesting the following url:
http://7is7.com/software/stateye/download/stateye097f.html

... RobotRulesParser requests the following robots.txt:
http://7is7.com/robots.txt

... however, that file doesn't exist, it redirects to:
http://www.7is7.com/robots.txt

... that robots.txt tells us the initial url is disallowed.
But does it really? Or is robots.txt file only applicable to http://www.7is7.com 
 and not http://7is7.com.


So the question is: should we follow such redirects?

Thanks,
Mathijs

Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Andrzej Bialecki

Doğacan Güney wrote:

On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote:

Sami Siren wrote:

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a
local job, no optional filesystems etc), the *.job and *.war files and
scripts. Scripts would check for the presence of plugins/ dir, and offer an
option to create it from *.job. Assumption here is that this shouldbe enough
to run full cycle in local mode, and that people who want to run a
distributed cluster will first install a plain Hadoop release, and then just
put the *.job and bin/nutch on the master.

* source: no build artifacts, no .svn (equivalent to svn export), simple
tgz.


this sounds good to me. additionally some new documentation needs to be
written too.


I added a simple patch to NUTCH-728 to make a plain source release from svn,
what do people think should we add the plain source package into next rc. I
would not like to make changes to binary package now but propose that we do
those changes post 1.0.



+1 for including plain source release in next rc.

As for, local/distributed separation, it is a good idea but I think we
should hold
it for 1.1 (or something else) if it requires architectural changes
(thus needs review
and testing).


Yes, sorry for not being more explicit - my proposal was for 1.1, I 
think 1.0 has to go out as it is (and I'd even hesitate to create a 
source-only release now - we would have to test that it's still 
buildable and fully functional.)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Jukka Zitting
Hi,

On Fri, Mar 20, 2009 at 1:10 PM, Andrzej Bialecki a...@sigram.com wrote:
 Yes, sorry for not being more explicit - my proposal was for 1.1, I think
 1.0 has to go out as it is (and I'd even hesitate to create a source-only
 release now - we would have to test that it's still buildable and fully
 functional.)

To be accurate, the source release *is* the collection of bits that
the release manager is using to produce binaries and other release
artifacts. It's just a packaged svn export of the release tag.

If the release manager can build and test the sources, then anyone
else should be able do the same using the exact same set of bits.
Verifying that is one of the key parts of the release vote.

From that perspective I'm even a bit worried about the idea of having
an Ant target that exports and packages the tag, as it suggests that
the release manager is not necessarily using that set of bits to build
the release.

BR,

Jukka Zitting


Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Jukka Zitting
Hi,

On Sat, Mar 21, 2009 at 12:28 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 To be accurate, the source release *is* the collection of bits that
 the release manager is using to produce binaries and other release
 artifacts. It's just a packaged svn export of the release tag.

Or, to express this in another way, the release manager can produce
the source release simply by packaging the entire source tree he's
using just before invoking any Ant targets to produce the binaries.

BR,

Jukka Zitting


Re: Problems compiling Nutch in Eclipse

2009-03-21 Thread Rodrigo Reyes C.
Ninad

Thanks for your answer. I have to say I am eager to read all you have
written in your blog about Nutch inner workings. I've already done
everything your blog post tells to do (and a couple more things like
downloading a couple of extra jars that are not included in the SVN
version).

Nevertheless, I am still getting the error I wrote. I think I should also
mention I am not working on 0.9 code base but on the trunk code base. Maybe
that is why I am getting this error.

Rodrigo
PS: By the way, I did managed to have Nutch crawling yesterday late at
night. Still, I haven't been able to compile this specific plugin (rtf
plugin)

2009/3/21 Ninad Raut ninad.evera...@gmail.com

 Check out my blog :
 http://j2eewebsearch.blogspot.com/

 Check out the third point...

 Let me know if you you get it all right. Your comments will be appreciated.

 Regards,
 Ninad


 On Sat, Mar 21, 2009 at 6:32 AM, Rodrigo Reyes C. 
 rre...@corbitecso.comwrote:

 Hi

 I have configured my eclipse project as stated here

 http://wiki.apache.org/nutch/RunNutchInEclipse0.9

 Still, I am getting the following errors:

- The return type is incompatible with Parser.getParse(Content)
RTFParseFactory.java
nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtfline 52
Java Problem
- Type mismatch: cannot convert from ParseResult to Parse
TestRTFParser.java
nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtfline 78
Java Problem

 Any ideas on what could be wrong? I already included both
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/and
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/jars.

 Thanks in advance

 --
 Rodrigo Reyes C.





Re: Problems compiling Nutch in Eclipse

2009-03-21 Thread Doğacan Güney
RTF parser is not built by default because the jars it uses has some
licensing issues. And it is out of sync with current trunk so it
does not even build anymore.

This issue may help:
https://issues.apache.org/jira/browse/NUTCH-644

On Sat, Mar 21, 2009 at 03:02, Rodrigo Reyes C. rre...@corbitecso.com wrote:
 Hi

 I have configured my eclipse project as stated here

 http://wiki.apache.org/nutch/RunNutchInEclipse0.9

 Still, I am getting the following errors:

 The return type is incompatible with Parser.getParse(Content)
 RTFParseFactory.java
 nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtf    line 52
 Java Problem
 Type mismatch: cannot convert from ParseResult to Parse
 TestRTFParser.java
 nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtf    line 78
 Java Problem

 Any ideas on what could be wrong? I already included both
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ and
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ jars.

 Thanks in advance

 --
 Rodrigo Reyes C.





-- 
Doğacan Güney