Hi,
I'm indexing blog permalinks taken from a Roller Weblogger aggregator
- like how Technorati does it. I noticed that 'inject' omits URLs with
'?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner).
How can I include these?
change:
NUTCH/conf/regex-urlfilter.txt
from:
[EMAIL PROTECTED]
to:
[EMAIL PROTECTED]
That's it.
Stefan
Am 19.12.2005 um 11:56 schrieb Miguel A Paraz:
Hi,
I'm indexing blog permalinks taken from a Roller Weblogger aggregator
- like how Technorati does it. I noticed that 'inject' omits URLs
Hi Jérôme,
Many thanks for this email. I had found I needed 'nutch-extensionpoints',
but
with your explaination below I have a better understanding of the reason it
is needed.
Thanks once again.
Stephen
On 12/19/05, Jérôme Charron [EMAIL PROTECTED] wrote:
nutch-extensionpoints is the plugin
Hi,
I am crawling some sites using nutch. My requirement is, when i run a nutch
crawl, then somehow it should be able to reuse the data in webdb populated
in previous crawl.
In other words my question is suppose my crawl is running and i cancel it
somewhere in middle, then is there someway i can
It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run
Hi Stefan,
Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.
Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run a
crawl it should re use the webdb
Still do not clearly understand you plans, sorry. However pages from
the webdb are recrawled every 30 days (but configurable in the nutch-
default.xml).
The new folder are so called segments and you can put it to the trash
after 30 days.
So what you can do is first never updated your webdb
Actually i wanted to reuse the processings i do in a particular crawl for
future crawls so as to avoid downloading pages which are not of my interest.
Here is an example:
1. Suppose i am crawling http://www.abc.com website.
2. Then this gets injected in webdb and Fetchlist tool populates
About this blocking you can try to use the urlfilters, change the
filter between each fetch/generate
+^http://www.abc.com
-^http://www.bbc.co.uk
Pushpesh Kr. Rajwanshi wrote:
Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me
hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters
Where can I find the build instructions for Nutch?
Just typing ant ended with an error complaining that
there is no such directory as
...\src\plugin\nutch-extensionpoints\src\java
This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried
Hello, I ran into the same problem (which I think is fixed in future
releases). For Nutch 0.7.1, just create the missing directories and run
the ant script again.
HTH,
DaveG
-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED]
Sent: Monday, December 19, 2005 2:38 PM
This is a known bug. Just create a empty folder
...\src\plugin\nutch-extensionpoints\src\java
and it will work. This is fixed in latest trunk that you can checkout
form apache's subversion server.
Stefan
Am 19.12.2005 um 20:38 schrieb Teruhiko Kurosaka:
Where can I find the build
Teruhiko Kurosaka wrote:
Where can I find the build instructions for Nutch?
Just typing ant ended with an error complaining that
there is no such directory as
...\src\plugin\nutch-extensionpoints\src\java
mkdir -p that directory and try again.
If you're tracking your build in a local CVS,
It is a known bug in 0.7.1 distribution. You can get the sources
directly from svn and it build fine. It is also fixed in preparation for
0.7.2 release and in trunk. Or you can fix it locally by creating empty
src/java folder I am not sure if it is the only one empty folder missing
in
Thank you, everybody. I can build now!
-Original Message-
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: 2005年12月19日 11:42
To: nutch-user@lucene.apache.org
Subject: RE: build instructions?
Hello, I ran into the same problem (which I think is fixed in future
releases).
Pushpesh,
We extended nutch with a whitelist filter and you might find it useful.
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all
--Flo
Pushpesh Kr. Rajwanshi wrote:
hmmm... actually my requirement is
I have followed the tutorial at media-style.com and actually have a
mapred installation of nutch working. Thanks Stefan :)
My question now is the correct steps to continuously fetch and index. I
have read some people talking about mergesegs and updatedb however
Stefan's tutorial doesn't list these
Stefan's tutorial doesn't list these as steps.
I will add these steps hopefully until this year.
If you want to
continually fetch more and more levels from your crawldb and
appropriately update your index what is the correct method for
doing so?
Currently I am doing this:
generate
fetch
Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
I'm using Nutch for an intranet style search engine, on a single
site, so I don't
Hi,
did you tried...
property
namedb.ignore.internal.links/name
valuetrue/value
descriptionIf true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping the only the highest quality
links.
Thank you Stefan, for your speedy response.
I have indeed changed that setting to false. However, that doesn't
deal with my problem. The offending method is getAnchors in
org.apache.nutch.db.WebDBAnchors, which is called from
org.apache.nutch.tools.FetchListTool. This method makes the array
Hi Nguyen,
Thank you for you information, but I would like to confirm that. I do see a
variable that define the next fetch interval but I am not sure of it. If
anyone has more information on this regard please let me know.
Thank you in advance,
On 12/19/05, Nguyen Ngoc Giang [EMAIL
The scheme of intranet crawling is like this: Firstly, you create a webdb
using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector.
The seed URL is inserted into your webdb, marked by current date and time.
Then, you create a fetch list using FetchListTool. The FetchListTool
Hi,
Does Nutch's search result show similar pages like Google? I went to
Modzex.com which is using Nutch but I don't see similar pages in its search
result.
Many thanks.
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam
25 matches
Mail list logo