Hi,
I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?
In general I agree it is the view layer and should
working with hadoop he/
she should feel free to update the patch and post it in the hadoop jira.
Stefan
On 18.01.2007, at 15:39, Doug Cutting wrote:
Stefan Groschupf wrote:
We run the gui in several production environemnts with patched
hadoop code - since this is from our point of view
dont know it is the right time to do this job.
On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp
files
since it takes a lot time when refactoring. So how about
Hi,
great to hear people still working on things. It shows once more
getting something in early would save some effort. :)
Just some random comments.
We run the gui in several production environemnts with patched hadoop
code - since this is from our point of view the clean approach.
Hi Sami,
I quess you refer to these:
• LocalJobRunner:
• Run as kind of singelton
• Have a kind of jobQueue
• Implement JobSubmissionProtocol status-report
methods
• implement killJob method
Right!
-how about writing a nutchrunner that just extends the
functionality of
did you erver browse this: http://wiki.media-style.com/display/
nutchDocu/Home
Nothing big, but it will give you some ideas, also about plugins.
On 25.11.2006, at 06:32, Armel T. Nene wrote:
I agree with you that documentation is vital not the just extending
the
current version but also for
Hi,
try to have no regular expression filter and check if this helps.
Let me know if this solve the problem.
You may be want to do a thread dump and send the log to the list to
check where exactly the fetcher freezes.
Stefan
Am 03.11.2006 um 15:53 schrieb Aisha:
Hi,
I don't know why but
There is a eclipse java cc plugin.
It compiles your the grammar and you can write easily test code.
However it has it's own issues so you may just want to generate the
java files with the nutch ant script and write than unit tests again
these files.
HTH
Stefan
On 10.09.2006, at 00:49, heack
Another alternative would be to construct a new workflow that just
adds the Patch Available status and still permits issues to be re-
opened.
+1
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your
index.
In general the hash of the content of a page is used as key for the
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we
Hi,
+ You may have problems with some imports in parse-mp3 and parse-
rtf plugins. Because of incompatibility with apache licence they
were left from sources. You can find it here:
+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/
lib/
+
+
Hi Michi,
what is your motivation for that?
Stefan
Am 25.08.2006 um 06:52 schrieb Michael Wechner:
Hi
I think it would be very useful if the NutchBean would check if the
crawl dir exists and throw at least a warning
in case it doesn't:
Index:
Hi Renaud,
I think you were meaning editing: http://wiki.apache.org/nutch/
RunNutchInEclipse , not http://wiki.apache.org/nutch/
RenaudRichardet , right?
Right! Sorry for the misunderstanding.I have no idea about your
personal page so it would be a bad move to edit it. :-)
Thanks again for
Hi Renaud,
I updated your page with some more details, I hope that is ok for you.
Thanks for creating it.
Stefan
Am 23.08.2006 um 11:51 schrieb Apache Wiki:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki
for change notification.
The following page has
One must also remember that proper junit testing can be used to
verify functionality.
There's lot of code currently that is not guarded by unit tests and
I hereby invite everybody to participate in this endless effort and
make Nutch unit tests better ;)
I completely agree!!!
Nutch has more
[
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ]
Stefan Groschupf commented on NUTCH-354:
Since this issue is already closed I can not attach the patch file, so I attach
it as text within this comment
Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls
from my POV.
Greetings,
Stefan
During AIRWeb'06 we announced the availability of the collection.
We are currently planning a Web Spam challenge based on the
[
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ]
Stefan Groschupf commented on NUTCH-356:
Hi Enrico,
there will be as much PluginRepositories as Configuration objects.
So in case you create many
crawling simulation
---
Key: NUTCH-357
URL: http://issues.apache.org/jira/browse/NUTCH-357
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Fix
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]
Stefan Groschupf updated NUTCH-357:
---
Attachment: protocol-simulation-pluginV1.patch
A very first preview of a plugin that helps to simulate crawls. This protocol
plugin can be used
Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8.1, 0.9.0
MapWritables recycle entries from it internal linked-List for performance
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is
found. This can cause wrong
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]
Stefan Groschupf updated NUTCH-354:
---
Attachment: resetNextEntryInMapWritableV1.patch
Resets the next Entry of a recycled entry.
MapWritable, nextEntry is not reset when Entries
[
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ]
Stefan Groschupf commented on NUTCH-343:
Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Stefan Groschupf updated NUTCH-341:
---
Attachment: doNotDeleteTmpIndexMergeDirV1.patch
+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Attachment: respectFetcherParsePropertyV1.patch
Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a
contributor to commit
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Priority: Major (was: Trivial)
Fetcher ignores the fetcher.parse value configured in config file
/NUTCH-350
Project: Nutch
Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical
Intranet crawls or focused crawls will fetch many pages from the same host.
This causes that a thread will be blocked since a other thread already fetch
from
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]
Stefan Groschupf updated NUTCH-350:
---
Attachment: protocolRetryV5.patch
This patch will dramatically increase the number of successfully fetched pages
of a intranet crawl over the time
[
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ]
Stefan Groschupf commented on NUTCH-322:
I think this is a serious problem. Page A server side redirect to Page B. Page
A is never writen to the output
: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8.1
Attachments: doNotRefecthForwarderPagesV1.patch
Pages that do a serverside forward are not written with a status change back
into the crawlDb. Also the nextFetchTime
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]
Stefan Groschupf updated NUTCH-353:
---
Attachment: doNotRefecthForwarderPagesV1.patch
Since we discussed that nutch need to be more polite we should fix that asap.
pages that serverside
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Stefan Groschupf resolved NUTCH-322.
Resolution: Duplicate
duplicate of NUTCH-353
Fetcher discards ProtocolStatus, doesn't store redirected pages
[
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ]
Stefan Groschupf commented on NUTCH-347:
Please submit this patch!
Thanks!
Build: plugins' Jars not found
[
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ]
Stefan Groschupf commented on NUTCH-346:
+1
I agree, can you please create a patch file and attach it to this bug.
Thanks
Improve readability of logs
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ]
Stefan Groschupf commented on NUTCH-345:
Shouldn't the DeflateUtils also be part of the protocol-http plugin?
Also since it is a larger contribution
[
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ]
Stefan Groschupf commented on NUTCH-349:
my vote goes to #2.
Having a tool that need to be started manually would be better than complicate
the already
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ]
Stefan Groschupf commented on NUTCH-233:
Hi Otis,
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]
Stefan Groschupf updated NUTCH-348:
---
Attachment: sortPatchV1.patch
What people think about this kind of solution?
Generator is building fetch list using *lowest* scoring URLs
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
When a page has no outlinks but several links to itself e.g. it has a set of
anchors the scores of the page are distributed to its outlinks. But all this
outlinks pointing to the page back. This causes
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ]
Stefan Groschupf commented on NUTCH-318:
Yes this happens only in a distributed environment. Please also see my last
mail in the hadoop dev list. I think
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ]
Stefan Groschupf commented on NUTCH-318:
Shouldn't that be fixed in .8 since by today this tool just produce no output?!
log4j not proper configured
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ]
Stefan Groschupf commented on NUTCH-233:
I think this should be fixed in .8 too, since everybody that does real whole
web crawl with over a 100 Mio pages
I like it!
Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki:
Stefan Neufeind wrote:
Andrzej Bialecki wrote:
Stefan Groschupf wrote:
Hi developers,
we have command like readdb and readlinkdb but segread. Wouldn't
be more consistent to name the command readseg instead segread?
... just
Hi,
I remember there was a search result comparison tool within nutch.
Is that still alive? How to use it / find it? I was not able to find
it by browsing the trunk sources.
Is there any such a tool people can suggest to compare search results
with yahoo or google result to play with
Hi developers,
in nutch-default.xml property plugin.includes we say: In any case
you need at least include the nutch-extensionpoints plugin.
But we do not include it by default.
valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-
I may - but since you know the details of the plugin subsystem,
tell me what _should_ be there? I.e. should we really include it in
the plugin.includes list, or not?
This is a philosophically question.
I personal prefer restrict definitions, since applications behavior
is better traceable.
-325
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
Fix For: 0.8-dev
In URLFilters constructor we use an array as long as we have filters defined in
the urlfilter.order property
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]
Stefan Groschupf updated NUTCH-325:
---
Attachment: UrlFiltersNPE.patch
A patch that uses a Arralist instead of an array and put only entries into the
list when the entry is not null. Means
Hi Developers,
another thing in the discussion to be more polite.
I suggest that we log a message in case an requested URL was blocked
by a robots.txt.
Optimal would be if we only log this message in case the current used
agent name is only blocked and it is not a general blocking of all
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]
Stefan Groschupf updated NUTCH-323:
---
Attachment: MapWritableCopyConstructor.patch
Attached patch add a copy constructor to the map writable and use it in the
CrawlDatum.set methode. However
Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
Configuration properties db.score.link.external and db.score.link.internal are
ignored.
In case of e.g. message board webpages or pages that have larger navigation
menus on each page having a lower impact
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]
Stefan Groschupf updated NUTCH-324:
---
Attachment: InternalAndExternalLinkScoreFactor.patch
Multiply the score of a page during distributeScoreToOutlink with
db.score.link.internal
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]
Stefan Groschupf resolved NUTCH-319.
Resolution: Won't Fix
Sorry, that is bogus since it is wriiten to the logging stream.
OPICScoringFilter should use logging API instead
Hi,
shouldn't db.max.inlinks be in the nutch-default.xml configuration?
Stefan
Hi,
OPICScoringFilter line 91:
content.getMetadata().set(Fetcher.SCORE_KEY, + datum.getScore());
and line 96,102 we set and get the Fetch Sore as Strings. :-o.
Wouldn't it be better to have the Metadata support floats as well
instead of serializing and parsing strings?
In general wouldn't it
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki
Priority: Trivial
Fix For: 0.8-dev
OPICScoringFilter line 107 should be a logging not a
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?
--
This message
changes in
verions 0.8. The problem is the log message does not say what file
is not
found. So, it's hard to debug. Any idea?
Thanks,
AJ
On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables
: Stefan Groschupf
Priority: Critical
Fix For: 0.8-dev
In the latest .8 sources the readdb command doesn't dump any information
anymore.
This is realeated to the miss configured log4j.properties file.
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in bin/nutch.
Btw, please do not crosspost.
Thanks.
Stefan
Am 09.07.2006 um 21:47 schrieb AJ Chen:
I checked out the 0.8 code from trunk and tried to set it up in
eclipse.
When trying
Hi,
this question is difficult to answer and may be there more experts in
the nutch user list than in the developer list.
In nutch 0.8 you can use the new scoring api to change the scoring of
a page for being scheduled for crawling based on the it's scores.
Have a look to the opic score
Hi Jérôme,
I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.
Stefan
On 06.07.2006, at 08:54, Jérôme Charron wrote:
Hi,
I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to
We tried your suggested fix:
Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
and this worked without any problem.
Thanks for catching that, this saved us a lot of time.
Stefan
On 07.07.2006, at 16:08, Jérôme Charron wrote:
I have the same problem on
+1, but I really would love to see NUTCH-293 as part of nutch .8
since this all about being more polite.
Thanks.
Stefan
On 05.07.2006, at 03:46, Doug Cutting wrote:
+1
Piotr Kosiorowski wrote:
+1.
P.
Andrzej Bialecki wrote:
Sami Siren wrote:
How would folks feel about releasing 0.8 now,
Hi,
as far I can see nutch's html parser does only support the meta tag
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there
is an inoffiziel html noindex tag.
http://www.webmasterworld.com/forum10003/2703.htm
May be this would be another thing to make nutch more polite.
Also
Hi Feng,
map Writrable is a kind of hashmap.
You can put in any key value pair, but the key and values need to be
Writables:
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/
Writable.html
You can use UTF8 as StingKey and Value or ByteWritable as key and
Utf8 as Values.
Etc.
Hi Lourival,
this means all pages older than 30 days are potential candidates for
a fetch list that is created by segment generation process.
Stefan
Am 12.06.2006 um 16:33 schrieb Lourival Júnior:
Hi all!
I have a question about nutch-default.xml configuration file. There
is a
Ok. So, have you any solution to do this job automatically? I have
a shell
script, but I don't see if this really works yet.
Shell scripts are the best solution.
Sorry if I'm being redundant. I'm learn about this tool and I have
a lot of
questions :).
No Problem, but the nutch user
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV5.patch
Release Candidate 1 of this patch.
This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV4.patch
Attached a patch that does only use any time 4 byte for the ip. Means we do
ignore ipv6. This save us a 4 byte
java doc of CrawlDb is wrong
Key: NUTCH-302
URL: http://issues.apache.org/jira/browse/NUTCH-302
Project: Nutch
Type: Bug
Reporter: Stefan Groschupf
Priority: Trivial
Fix For: 0.8-dev
CrawlDb has the same java doc
[ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]
Stefan Groschupf updated NUTCH-301:
---
Attachment: CommonGramsCacheV1.patch
Cache HashMap COMMON_TERMS in configuration instance.
CommonGrams loads analysis.common.terms.file for each query
[
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ]
Stefan Groschupf commented on NUTCH-293:
Any comments? There was already a posting in the nutch agent mailing list,
where someone had banned nutch since nutch does
Hi,
after playing around to figure out the best place to resolve IP's of
freshly discovered ulrs I agree with Andrzej that the
Parseoutputformat isn't the best place.
The problem here, Parseoutputformat is not multithreaded and we
definitely need many threads for ip lookup.
I think in
[
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ]
Stefan Groschupf commented on NUTCH-293:
Hi Andrzej,
I agree but writing a queue based fetcher is a big step. I already have some
basic code (nio based).
Also I don't
As far I understand hadoop use commons logging. Should we switch to
use commons logging as well?
Am 06.06.2006 um 11:02 schrieb Jérôme Charron:
URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons
Logging, configured
[
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ]
Stefan Groschupf commented on NUTCH-258:
Scott,
I agree with you. However we need a clean patch to solve the problem, we can
not just comment things out of the code
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV1.patch
To keep the discussion alive attached a _first draft_ for storing the ip in the
crawlDatum for public discussion
hmm... didn't think about that, are there more opinions about this?
I don't believe this don't be evil thing at all. I think it is just a
question of time google feel we attack the appliance server market
and I believe nutch has a serious chance to do so (some time in the
far feature.
I have a proposal for a simple solution: set a flag in the current
Configuration instance, and check for this flag. The Configuration
instance provides a task-specific context persisting throughout the
lifetime of a task - but limited only to that task. Voila - problem
solved. We get rid
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298:
---
Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a
404 for a robots.txt is returned no page is fetched at all from the host
The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)
This will be a very long way.
I collect some thoughts and a list of web spam related papers in my
blog.
http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072-
: Stefan Groschupf
Fix For: 0.8-dev
What happen:
Is no RobotRuleSet is in the cache for a host, we create try to fetch the
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do
robotRules = EMPTY_RULES; (line: 402)
EMPTY_RULES is a RobotRuleSet created
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298:
---
Attachment: fixNpeRobotRuleSet.patch
fix the npe in RobotRuleSet happen in case we use a empthy RuleSet
if a 404 for a robots.txt is returned no page
Hi,
just posted a fix for a NPE in case a empty RobotRuleSet is used.
The patch only contains a two lines fix, since I learned that this
best way to get things committed sooner. :)
However I really don't like the RobotRuleSet implementation since
entries are copied between a arraylist and a
[
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ]
Stefan Groschupf commented on NUTCH-282:
Is that related to host grouping we discussed? Can we in this case close this
bug?
Showing too few results on a page (Paging
[
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ]
Stefan Groschupf commented on NUTCH-286:
This is difficult to realize since the http error code is readed from response
in the fetcher and setted into the protocol
[
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ]
Stefan Groschupf commented on NUTCH-292:
+1, Can someone create a clean patch file?
OpenSearchServlet: OutOfMemoryError: Java heap space
[
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ]
Stefan Groschupf commented on NUTCH-291:
lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified
[
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ]
Stefan Groschupf commented on NUTCH-290:
If a parser throws an exeption:
Fetcher, 261:
try {
parse = this.parseUtil.parse(content);
parseStatus
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
Stefan Groschupf closed NUTCH-287:
--
Resolution: Won't Fix
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html
Exception when searching with sort
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
Stefan Groschupf closed NUTCH-284:
--
Resolution: Won't Fix
Yes, I was missing index-basic.
NullPointerException during index
-
Key: NUTCH-284
[
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ]
Stefan Groschupf commented on NUTCH-284:
Please try discuss such things first in the user mailing list than open a
issue.
Maintaining the issue tracking is very time
[
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ]
Stefan Groschupf commented on NUTCH-281:
Can you submit a patch file?
cached.jsp: base-href needs to be outside comments
[
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ]
Stefan Groschupf commented on NUTCH-274:
Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the
Injector?
Empty row in/at end of URL-list
[
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ]
Stefan Groschupf commented on NUTCH-290:
As far I understand the code, the next parser is only used if the previous
parser return with a unsuccessfully paring status
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
Stefan Groschupf closed NUTCH-286:
--
Resolution: Won't Fix
I hope everybody agree with the statement: We can not detect http response
codes based on responded html content.
Prune
support for Crawl-delay in Robots.txt
-
Key: NUTCH-293
URL: http://issues.apache.org/jira/browse/NUTCH-293
Project: Nutch
Type: Improvement
Components: fetcher
Versions: 0.8-dev
Reporter: Stefan Groschupf
[ http://issues.apache.org/jira/browse/NUTCH-293?page=all ]
Stefan Groschupf updated NUTCH-293:
---
Attachment: crawlDelayv1.patch
A frist darft of a crawl delay support for nutch. The problem I see is that in
case ip based delay is configured it can
Hi,
I heard there is a bug in JVM 1.5_06 beta, can you try a older or
may be a 1.4 jvm and report if this happens with a other jvm as well.
Thanks,
Stefan
Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren:
Hi everyone,
I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to
Think about using the google API.
However the way to go could be:
+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity
1 - 100 of 341 matches
Mail list logo