Hi!
I´ve a java application that I would like to upgrade with nutch. What jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without success.
I wonder what is
2009/10/1 Jaime Martín james...@gmail.com
Hi!
I´ve a java application that I would like to upgrade with nutch. What
jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:
http://mydomain.edu/staff/userid
I've added the staff URL to the urls seed file. But even with a crawl set to
Jaime Martín wrote:
Hi!
I´ve a java application that I would like to upgrade with nutch. What jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without
tsmori wrote:
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:
http://mydomain.edu/staff/userid
I've added the staff URL to the urls seed file. But even with
thank you for the info. that´s really a problem. I have a java project and
for some of its new features I would like to use nutch. As I need to
customise nutch my idea was next:
- 1st: change what needed for my requirements in my downloaded nutch and
generate a nutch library
- 2nd: add that
yes check also if some userids dont have some caracteres like ?, @, *, !, =
they are filtred by default : -[...@=]
Date: Thu, 1 Oct 2009 18:15:38 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Nutch randomly skipping locations during crawl
tsmori wrote:
Hi Jaime,
Depending on what exactly you're trying to do, there are some other
projects that offer crawler functionality which could be easier to
embed.
The two I know about are:
- Droids (http://incubator.apache.org/droids/), though I haven't
really used it.
- Bixo
Hi Jaime,
You don't have to embed; try (simplified) Nutch + SOLR (Nutch has plugin for
SOLR). And use SolrJ client for SOLR from your application. This is very
easy.
-Fuad
http://www.linkedin.com/in/liferay
-Original Message-
From: Jaime Martín [mailto:james...@gmail.com]
Sent:
BELLINI ADAM wrote:
hi,
but how to dump the content ? i tried this command :
./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto
and it said :
Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.
It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl
tsmori wrote:
Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.
It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking
2009/9/29 Ольга Пескова opesk...@mail.ru:
Hello!
Please check the url:
http://wiki.apache.org/nutch/
I can't find any content there.
Just as a point of reference, I got the FrontPage to pull up just
prior to sending this e-mail. I'm not sure what is wrong with your
connection to it, but I
2009/10/1 Kirby Bohling kirby.bohl...@gmail.com:
2009/9/29 Ольга Пескова opesk...@mail.ru:
Hello!
Please check the url:
http://wiki.apache.org/nutch/
I can't find any content there.
Just as a point of reference, I got the FrontPage to pull up just
prior to sending this e-mail. I'm not
Hi all,
I am trying to use nutch to crawl and index a list of about 50K URLs
with depth=1. I am running indexing with the command:
nutch-1.0/bin/nutch crawl urls/ -depth 1 -topN 10
with appropriate changes to the configuration files.
I find that the fetching always terminates
FWIW, I often have problems getting to wiki.apache.org. I could not get there
this morning, and had to read what I needed out of the google cache.
|-Original Message-
|From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
|Tomblin
|Sent: Thursday, October 01, 2009 4:32
16 matches
Mail list logo