Re: [Nutch-cvs] svn commit: r516885 - /lucene/nutch/trunk/build.xml

2007-03-11 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: Author: siren Date: Sun Mar 11 04:02:27 2007 New Revision: 516885 URL: http://svn.apache.org/viewvc?view=revrev=516885 Log: reduce the size of .job from 19+M down to 14+M This is a welcome optimization, but I feel it's risky - this should have been discussed

Re: [Nutch-cvs] svn commit: r516888 - /lucene/nutch/trunk/bin/nutch

2007-03-11 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: Author: siren Date: Sun Mar 11 04:12:23 2007 New Revision: 516888 URL: http://svn.apache.org/viewvc?view=revrev=516888 Log: fix bin/nutch: line 152: cygpath: command not found on linux (FC5), hope i am not breaking it for some other env Modified:

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Björn Wilmsmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I think text classification could be used for this purpose. You would have to extract text blocks from HTML code (for example enclosed in td/td or div/div), then compare each block against a previously trained model and discard those blocks

Re: [Nutch-cvs] svn commit: r516888 - /lucene/nutch/trunk/bin/nutch

2007-03-11 Thread Sami Siren
How the code ended up in this place on Linux? The $cygwin condition should have prevented that, because it evaluates to true only on Cygwin, where this utility is required to translate the paths. You also changed the if syntax - before it was using the /bin/test utility to evaluate the

Re: [Nutch-cvs] svn commit: r516885 - /lucene/nutch/trunk/build.xml

2007-03-11 Thread Sami Siren
Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: Author: siren Date: Sun Mar 11 04:02:27 2007 New Revision: 516885 URL: http://svn.apache.org/viewvc?view=revrev=516885 Log: reduce the size of .job from 19+M down to 14+M This is a welcome optimization, but I feel it's risky - this

Re: Indexing the Interesting Part Only...

2007-03-11 Thread d e
Bjorn - now THAT is a cool idea! I love it. *Very* clever. The indexed website could change layout and my program would not care even a little bit! My immediate questions are: - Is it possible that the web crawling might slow to a crawl if I do it in the middle of the Nutch process (or does

[jira] Reopened: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-11 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reopened NUTCH-432: -- After this got applied there's this error printed on console when run on FC5: bin/nutch: line 152:

Re: [Nutch-cvs] svn commit: r516888 - /lucene/nutch/trunk/bin/nutch

2007-03-11 Thread Sami Siren
Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: Author: siren Date: Sun Mar 11 04:12:23 2007 New Revision: 516888 URL: http://svn.apache.org/viewvc?view=revrev=516888 Log: fix bin/nutch: line 152: cygpath: command not found on linux (FC5), hope i am not breaking it for some other env

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Björn Wilmsmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 d e wrote: - Is it possible that the web crawling might slow to a crawl if I do it in the middle of the Nutch process (or does that not matter because Nutch is doing stuff in multiple threads anyway so I have little to be concerned

Re: Indexing the Interesting Part Only...

2007-03-11 Thread d e
Good thinking, Bjoern. Still, does the HTML Parser have a hook so it can break the text up into elements that will be indexed as discrete documents? This may be a dumb question but we are just getting our feet wet with spidering and really need some pointers! Exactly how would the parser plug in

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Björn Wilmsmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 d e wrote: Good thinking, Bjoern. Still, does the HTML Parser have a hook so it can break the text up into elements that will be indexed as discrete documents? This may be a dumb question but we are just getting our feet wet with spidering and

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
d e wrote: I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the junk *on each page*. I am indexing news sites. I want to harvest news STORIES, not the advertisements and other junk text around the outside of each page. Got suggestions for THAT problem? I guess you are

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
Michael Wechner wrote: d e wrote: I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the junk *on each page*. I am indexing news sites. I want to harvest news STORIES, not the advertisements and other junk text around the outside of each page. Got suggestions for THAT

Hadoop 0.11.2 vs. 0.12.1

2007-03-11 Thread Andrzej Bialecki
Hi all, After our discussion about which Hadoop release to use for the upcoming Nutch release, I decided to ask around on the Hadoop mailing list. The message was clear that we should go with 0.12.1 - see below: Owen O'Malley wrote: On Mar 10, 2007, at 12:32 AM, Andrzej Bialecki wrote: I

HEADSUP: reverting my changes

2007-03-11 Thread Sami Siren
Hi, I'll revert my changes i committed today and yesterday shorty. This is because there seems to be some instability in performance and due to time constraints I might not be able to debug it through. I'll get back to those changes sometime after the release is out. Sorry for the trouble! --

Re: 0.9 release

2007-03-11 Thread Sean Dean
My Nutch cycle completed successfully over the weekend. Deployment and searching also works fine. The only major/minor functional difference I noticed was that during fetching Hadoop stored the fetched data in memory until it reached a certain amount (100 megabytes or so) then wrote it all to

Re: Hadoop 0.11.2 vs. 0.12.1

2007-03-11 Thread Sean Dean
It looks like we might want to at least give it a try then, with the worst possible case of Nutch users having to keep speculative execution disabled if it causes grief again. If other problems arise, then we can just revert back to 0.11.2 which seems to be stable in terms of all the Nutch

Re: Hadoop 0.11.2 vs. 0.12.1

2007-03-11 Thread Dennis Kubes
It looks like we might want to at least give it a try then, with the worst possible case of Nutch users having to keep speculative execution disabled if it causes grief again. If other problems arise, then we can just revert back to 0.11.2 which seems to be stable in terms of all the Nutch

Re: 0.9 release

2007-03-11 Thread Dennis Kubes
My Nutch cycle completed successfully over the weekend. Deployment and searching also works fine. The only major/minor functional difference I noticed was that during fetching Hadoop stored the fetched data in memory until it reached a certain amount (100 megabytes or so) then wrote it all