[EMAIL PROTECTED] wrote:
Author: siren
Date: Sun Mar 11 04:02:27 2007
New Revision: 516885
URL: http://svn.apache.org/viewvc?view=revrev=516885
Log:
reduce the size of .job from 19+M down to 14+M
This is a welcome optimization, but I feel it's risky - this should have
been discussed
[EMAIL PROTECTED] wrote:
Author: siren
Date: Sun Mar 11 04:12:23 2007
New Revision: 516888
URL: http://svn.apache.org/viewvc?view=revrev=516888
Log:
fix bin/nutch: line 152: cygpath: command not found on linux (FC5), hope i am
not breaking it for some other env
Modified:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
I think text classification could be used for this purpose. You would
have to extract text blocks from HTML code (for example enclosed in
td/td or div/div), then compare each block against a
previously trained model and discard those blocks
How the code ended up in this place on Linux? The $cygwin condition
should have prevented that, because it evaluates to true only on Cygwin,
where this utility is required to translate the paths.
You also changed the if syntax - before it was using the /bin/test
utility to evaluate the
Andrzej Bialecki wrote:
[EMAIL PROTECTED] wrote:
Author: siren
Date: Sun Mar 11 04:02:27 2007
New Revision: 516885
URL: http://svn.apache.org/viewvc?view=revrev=516885
Log:
reduce the size of .job from 19+M down to 14+M
This is a welcome optimization, but I feel it's risky - this
Bjorn - now THAT is a cool idea! I love it. *Very* clever. The indexed
website could change layout and my program would not care even a little bit!
My immediate questions are:
- Is it possible that the web crawling might slow to a crawl if I do
it in the middle of the Nutch process (or does
[
https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reopened NUTCH-432:
--
After this got applied there's this error printed on console when run on FC5:
bin/nutch: line 152:
Andrzej Bialecki wrote:
[EMAIL PROTECTED] wrote:
Author: siren
Date: Sun Mar 11 04:12:23 2007
New Revision: 516888
URL: http://svn.apache.org/viewvc?view=revrev=516888
Log:
fix bin/nutch: line 152: cygpath: command not found on linux (FC5),
hope i am not breaking it for some other env
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
d e wrote:
- Is it possible that the web crawling might slow to a crawl if I do
it in the middle of the Nutch process (or does that not matter
because Nutch
is doing stuff in multiple threads anyway so I have little to be
concerned
Good thinking, Bjoern. Still, does the HTML Parser have a hook so it can
break the text up into elements that will be indexed as discrete documents?
This may be a dumb question but we are just getting our feet wet with
spidering and really need some pointers!
Exactly how would the parser plug in
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
d e wrote:
Good thinking, Bjoern. Still, does the HTML Parser have a hook so
it can
break the text up into elements that will be indexed as discrete
documents?
This may be a dumb question but we are just getting our feet wet with
spidering and
d e wrote:
I'm sorry! I guess I was REALLY not clear. I mean my problem is to
drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the outside of
each page. Got suggestions for THAT problem?
I guess you are
Michael Wechner wrote:
d e wrote:
I'm sorry! I guess I was REALLY not clear. I mean my problem is to
drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the
outside of
each page. Got suggestions for THAT
Hi all,
After our discussion about which Hadoop release to use for the upcoming
Nutch release, I decided to ask around on the Hadoop mailing list. The
message was clear that we should go with 0.12.1 - see below:
Owen O'Malley wrote:
On Mar 10, 2007, at 12:32 AM, Andrzej Bialecki wrote:
I
Hi,
I'll revert my changes i committed today and yesterday shorty. This is
because there seems to be some instability in performance and due to
time constraints I might not be able to debug it through. I'll get back
to those changes sometime after the release is out.
Sorry for the trouble!
--
My Nutch cycle completed successfully over the weekend. Deployment and
searching also works fine.
The only major/minor functional difference I noticed was that during fetching
Hadoop stored the fetched data in memory until it reached a certain amount (100
megabytes or so) then wrote it all to
It looks like we might want to at least give it a try then, with the worst
possible case of Nutch users having to keep speculative execution disabled if
it causes grief again. If other problems arise, then we can just revert back to
0.11.2 which seems to be stable in terms of all the Nutch
It looks like we might want to at least give it a try then, with the worst
possible case of Nutch users having to keep speculative execution disabled
if it causes grief again. If other problems arise, then we can just revert
back to 0.11.2 which seems to be stable in terms of all the Nutch
My Nutch cycle completed successfully over the weekend. Deployment and
searching also works fine.
The only major/minor functional difference I noticed was that during
fetching Hadoop stored the fetched data in memory until it reached a
certain amount (100 megabytes or so) then wrote it all
19 matches
Mail list logo