[ http://issues.apache.org/jira/browse/NUTCH-315?page=all ]
Sami Siren resolved NUTCH-315.
--
Resolution: Duplicate
duplicate of NUTCH-318
> CrawlDbReader usage text - implementation mismatch
> --
>
>
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423546 ]
Sami Siren commented on NUTCH-318:
--
I agree :) so the next thing to do is change readdb -stats to print to stdout,
i'll go ahead and do that. Are there any other c
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423542 ]
Andrzej Bialecki commented on NUTCH-318:
-
I think also that producing no output on the console is confusing to new users,
especially in the "local" mode. I
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ]
Stefan Groschupf commented on NUTCH-318:
Yes this happens only in a distributed environment. Please also see my last
mail in the hadoop dev list. I think th
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423531 ]
Sami Siren commented on NUTCH-318:
--
Perhaps this is happening in distributed setup? in 1 machine setup output is
done to log file see NUTCH-315
> log4j not proper
Is there any nutch api can do this?
-Original Message-
From: Lourival Júnior [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 26, 2006 1:41 AM
To: nutch-dev@lucene.apache.org
Subject: Re: How can i get a page content or parse data by the page's url
If I'm not wrong you can´t do this. The
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ]
Chris A. Mattmann updated NUTCH-258:
Fix Version/s: 0.8-dev
> Once Nutch logs a SEVERE log item, Nutch fails forevermore
> --
>
>
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ]
Renaud Richardet updated NUTCH-330:
---
Attachment: clSearch.diff
forgot the "echo" in sh...
> command line tool to search a Lucene index
> --
>
>
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ]
Stefan Groschupf commented on NUTCH-233:
I think this should be fixed in .8 too, since everybody that does real whole
web crawl with over a 100 Mio pages wi
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ]
Renaud Richardet updated NUTCH-330:
---
Attachment: clSearch.diff
unified diff against head
> command line tool to search a Lucene index
> --
>
>
command line tool to search a Lucene index
--
Key: NUTCH-330
URL: http://issues.apache.org/jira/browse/NUTCH-330
Project: Nutch
Issue Type: Improvement
Components: searcher
Affects Versio
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ]
Stefan Groschupf commented on NUTCH-318:
Shouldn't that be fixed in .8 since by today this tool just produce no output?!
> log4j not proper configured, rea
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]
Sami Siren updated NUTCH-325:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> UrlFilters.java throws NPE in case urlfilter.order contains Filters that are
> not in plugin.includes
>
[ http://issues.apache.org/jira/browse/NUTCH-247?page=all ]
Sami Siren updated NUTCH-247:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> robot parser to restrict.
> -
>
> Key: NUTCH-247
>
[ http://issues.apache.org/jira/browse/NUTCH-233?page=all ]
Sami Siren updated NUTCH-233:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> wrong regular expression hang reduce process for ever
>
[ http://issues.apache.org/jira/browse/NUTCH-310?page=all ]
Sami Siren updated NUTCH-310:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> Review Log Levels
> -
>
> Key: NUTCH-310
> URL: http://i
[ http://issues.apache.org/jira/browse/NUTCH-262?page=all ]
Sami Siren updated NUTCH-262:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> Summary excerpts and highlights problems
>
>
>
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Sami Siren updated NUTCH-322:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> Fetcher discards ProtocolStatus, doesn't store redirected pages
> --
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ]
Sami Siren updated NUTCH-318:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> log4j not proper configured, readdb doesnt give any information
> --
[ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]
Sami Siren updated NUTCH-251:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> Administration GUI
> --
>
> Key: NUTCH-251
> URL: http:/
[ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]
Sami Siren updated NUTCH-246:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> segment size is never as big as topN or crawlDB size in a distributed
> deployement
> -
[ http://issues.apache.org/jira/browse/NUTCH-74?page=all ]
Sami Siren updated NUTCH-74:
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> French Analyzer Plugin
> --
>
> Key: NUTCH-74
> URL: ht
[ http://issues.apache.org/jira/browse/NUTCH-86?page=all ]
Sami Siren updated NUTCH-86:
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> LanguageIdentifier API enhancements
> ---
>
> Key: NUTCH-86
[ http://issues.apache.org/jira/browse/NUTCH-249?page=all ]
Sami Siren updated NUTCH-249:
-
Fix Version/s: 0.9-dev
(was: 0.8-dev)
> black- white list url filtering
> ---
>
> Key: NUTCH-249
>
I'm interested in a plugin to filter results so that they are limited to
a collection of domains that are specified by the user at the time of
the search.
If such a filter does not currently exist I'm willing to work on one if
someone is willing to point me in the right direction.
rjsjr
If I'm not wrong you can´t do this. The segread command only accept these
arguments:
SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata]
[-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
-fix
Hi all,
How can i get a page content or parse data by the page's url.
Just like the command:
$ bin/nutch segread crawl/segments/20060725213636/ -dump
will dump pages in the segment.
I'm using nutch 0.7.2 on cygwin under winxp.
Thanks!
Aaron
Robert Sanford wrote:
> Running Nutch 0.7.2 but I'm willing to move up to 0.8 if need be.
>
> I have created an "Intranet" crawl using the file containing a list of
> URIs and the list of regex to allow in conf/crawl-urlfilter.txt. Using
> search.jsp I get lots and lots of good results so I'm quit
I am currently running Nutch 0.7.2 under Jboss 4.0.1 using Java 1.5.0_01
for Win32. The index runs were created under Cygwin.
What I have found so far is that Nutch will not index keywords within
the href attribute of an anchor tag and I want Nutch to do so.
I provide a co-branding service for cu
Running Nutch 0.7.2 but I'm willing to move up to 0.8 if need be.
I have created an "Intranet" crawl using the file containing a list of
URIs and the list of regex to allow in conf/crawl-urlfilter.txt. Using
search.jsp I get lots and lots of good results so I'm quite happy so
far.
But, I want to
Sami Siren wrote:
There is a package available for testing in
http://people.apache.org/~siren/nutch-0.8/
please give it some testing and post in your opinion - is it good
enough to be a public release?
I have some doubts because of NUTCH-266, but so far only 3 people have
reported this to b
There is a package available for testing in
http://people.apache.org/~siren/nutch-0.8/
please give it some testing and post in your opinion - is it good enough
to be a public release?
I have some doubts because of NUTCH-266, but so far only 3 people have
reported this to be problem
(me incl
32 matches
Mail list logo