how to upgrade a java application with nutch?

2009-10-01 Thread Jaime Martín
Hi!
I´ve a java application that I would like to upgrade with nutch. What jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without success.
I wonder what is the proper nutch build.xml target I should execute for this
and what of the generated jars are to be included in my app. Maybe apart
from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
them?
thanks in advance!


Re: how to upgrade a java application with nutch?

2009-10-01 Thread Paul Tomblin
2009/10/1 Jaime Martín james...@gmail.com

 Hi!
 I´ve a java application that I would like to upgrade with nutch. What
 jars
 should I add to my lib applicaction to make it possible to use nutch
 features from some of my app pages and business logic classes?
 I´ve tried with nutch-1.0.jar generated by war target without success.
 I wonder what is the proper nutch build.xml target I should execute for
 this
 and what of the generated jars are to be included in my app. Maybe apart
 from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
 them?


Maybe I'm doing it wrong, but I used the nutch-1.0.job file instead of the
jar.

-- 
http://www.linkedin.com/in/paultomblin


Nutch randomly skipping locations during crawl

2009-10-01 Thread tsmori

This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:

http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site. 

What should I look for to find out why this is happening?


-- 
View this message in context: 
http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25696893.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: how to upgrade a java application with nutch?

2009-10-01 Thread Andrzej Bialecki

Jaime Martín wrote:

Hi!
I´ve a java application that I would like to upgrade with nutch. What jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without success.
I wonder what is the proper nutch build.xml target I should execute for this
and what of the generated jars are to be included in my app. Maybe apart
from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
them?
thanks in advance!



Nutch is not designed for embedding in other applications, so you may 
face numerous problems. I did such an integration once, and it was far 
from obvious. A lot depends also whether you want to run it on a 
distributed cluster or in a single JVM (local mode).


Take a look at build/nutch*.job, it's a jar file that contains all 
dependencies needed to run Nutch except for Hadoop libraries (which are 
also required).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki

tsmori wrote:

This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:

http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site. 


What should I look for to find out why this is happening?




* Check that the pages there are not forbidden by robot rules (which may 
be embedded inside HTML meta tags of index.html, or the top-level 
robots.txt).


* check that your crawldb actually contains entries for these pages - 
perhaps they are being filtered out.


* check your segments whether these URLs were scheduled for fetching, 
and if so, then what was the status of fetching.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: how to upgrade a java application with nutch?

2009-10-01 Thread Jaime Martín
thank you for the info. that´s really a problem. I have a java project and
for some of its new features I would like to use nutch. As I need to
customise nutch my idea was next:
- 1st: change what needed for my requirements in my downloaded nutch and
generate a nutch library
- 2nd: add that library in the other project and invoke libraries features
when needed

is that not advisable? what is the best way then to generate a nutch library
to be used in other java projects? or is that not possible without becoming
crazy due to configuration issues?



2009/10/1 Andrzej Bialecki a...@getopt.org

 Jaime Martín wrote:

 Hi!
 I´ve a java application that I would like to upgrade with nutch. What
 jars
 should I add to my lib applicaction to make it possible to use nutch
 features from some of my app pages and business logic classes?
 I´ve tried with nutch-1.0.jar generated by war target without success.
 I wonder what is the proper nutch build.xml target I should execute for
 this
 and what of the generated jars are to be included in my app. Maybe apart
 from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
 them?
 thanks in advance!


 Nutch is not designed for embedding in other applications, so you may face
 numerous problems. I did such an integration once, and it was far from
 obvious. A lot depends also whether you want to run it on a distributed
 cluster or in a single JVM (local mode).

 Take a look at build/nutch*.job, it's a jar file that contains all
 dependencies needed to run Nutch except for Hadoop libraries (which are also
 required).

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




RE: Nutch randomly skipping locations during crawl

2009-10-01 Thread BELLINI ADAM

yes check also if some userids dont have some caracteres like ?, @, *, !, =

they are filtred by default :  -[...@=]






 Date: Thu, 1 Oct 2009 18:15:38 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch randomly skipping locations during crawl
 
 tsmori wrote:
  This is strange. I manage the webservers for a large university library. On
  our site we have a staff directory where each user has a location for
  information. The URLs take the form of:
  
  http://mydomain.edu/staff/userid
  
  I've added the staff URL to the urls seed file. But even with a crawl set to
  depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
  to only fetch about 50% of the locations in this area of the site. 
  
  What should I look for to find out why this is happening?
  
  
 
 * Check that the pages there are not forbidden by robot rules (which may 
 be embedded inside HTML meta tags of index.html, or the top-level 
 robots.txt).
 
 * check that your crawldb actually contains entries for these pages - 
 perhaps they are being filtered out.
 
 * check your segments whether these URLs were scheduled for fetching, 
 and if so, then what was the status of fetching.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Re: how to upgrade a java application with nutch?

2009-10-01 Thread Ken Krugler

Hi Jaime,

Depending on what exactly you're trying to do, there are some other  
projects that offer crawler functionality which could be easier to  
embed.


The two I know about are:

 - Droids (http://incubator.apache.org/droids/), though I haven't  
really used it.
 - Bixo (http://bixo.101tec.com/), which is a project I'm actively  
working on.


-- Ken

On Oct 1, 2009, at 9:37am, Jaime Martín wrote:

thank you for the info. that´s really a problem. I have a java  
project and

for some of its new features I would like to use nutch. As I need to
customise nutch my idea was next:
- 1st: change what needed for my requirements in my downloaded nutch  
and

generate a nutch library
- 2nd: add that library in the other project and invoke libraries  
features

when needed

is that not advisable? what is the best way then to generate a nutch  
library
to be used in other java projects? or is that not possible without  
becoming

crazy due to configuration issues?



2009/10/1 Andrzej Bialecki a...@getopt.org


Jaime Martín wrote:


Hi!
I´ve a java application that I would like to upgrade with nutch.  
What

jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without  
success.
I wonder what is the proper nutch build.xml target I should  
execute for

this
and what of the generated jars are to be included in my app. Maybe  
apart
from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a  
few of

them?
thanks in advance!


Nutch is not designed for embedding in other applications, so you  
may face
numerous problems. I did such an integration once, and it was far  
from
obvious. A lot depends also whether you want to run it on a  
distributed

cluster or in a single JVM (local mode).

Take a look at build/nutch*.job, it's a jar file that contains all
dependencies needed to run Nutch except for Hadoop libraries (which  
are also

required).

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378



RE: how to upgrade a java application with nutch?

2009-10-01 Thread Fuad Efendi
Hi Jaime,

You don't have to embed; try (simplified) Nutch + SOLR (Nutch has plugin for
SOLR). And use SolrJ client for SOLR from your application. This is very
easy.
-Fuad


http://www.linkedin.com/in/liferay

 -Original Message-
 From: Jaime Martín [mailto:james...@gmail.com]
 Sent: October-01-09 5:59 AM
 To: nutch-user@lucene.apache.org
 Subject: how to upgrade a java application with nutch?
 
 Hi!
 I´ve a java application that I would like to upgrade with nutch. What
jars
 should I add to my lib applicaction to make it possible to use nutch
 features from some of my app pages and business logic classes?
 I´ve tried with nutch-1.0.jar generated by war target without success.
 I wonder what is the proper nutch build.xml target I should execute for
this
 and what of the generated jars are to be included in my app. Maybe apart
 from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
 them?
 thanks in advance!




Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,
but how to dump the content  ? i tried this command :



./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto

and it said :

Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
  


but the crawl_generate is in this path :

/usr/local/nutch-1.0/crawl/segments/20091001120102

and not in this one :

/usr/local/nutch-1.0/crawl/segments/20091001120102/content

can you plz just give me the correct command ?


This command will dump just the content part:

./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch 
-nogenerate -noparse -noparsedata -noparsetext


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Nutch randomly skipping locations during crawl

2009-10-01 Thread tsmori

Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.

It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl log and the hadoop log, the missing users are not even fetched. 

If it's a permissions issue, it's a very odd one. All the directories here
have the same group membership and all files and directories under it are
owner, group, and world readable/executable.

The issue seems to be that they're not fetched and there's no indication in
the logs why they aren't.
-- 
View this message in context: 
http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25705239.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki

tsmori wrote:

Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.

It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl log and the hadoop log, the missing users are not even fetched. 


Check the segment's crawl_generate and crawl_fetch, and also check your 
crawldb for status. Logs don't always contain this information.



The issue seems to be that they're not fetched and there's no indication in
the logs why they aren't.


See above.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Something wrong with nutch.wiki

2009-10-01 Thread Kirby Bohling
2009/9/29 Ольга Пескова opesk...@mail.ru:
 Hello!

 Please check the url:
 http://wiki.apache.org/nutch/
 I can't find any content there.

Just as a point of reference, I got the FrontPage to pull up just
prior to sending this e-mail.  I'm not sure what is wrong with your
connection to it, but I don't believe it is the server.

Kirby


Re: Something wrong with nutch.wiki

2009-10-01 Thread Paul Tomblin
2009/10/1 Kirby Bohling kirby.bohl...@gmail.com:
 2009/9/29 Ольга Пескова opesk...@mail.ru:
 Hello!

 Please check the url:
 http://wiki.apache.org/nutch/
 I can't find any content there.

 Just as a point of reference, I got the FrontPage to pull up just
 prior to sending this e-mail.  I'm not sure what is wrong with your
 connection to it, but I don't believe it is the server.

It was down for a number of hours today, but evidently it's back up now.



-- 
http://www.linkedin.com/in/paultomblin


Fetcher problems with stable version of nutch-1.0 ?

2009-10-01 Thread Vijay
Hi all,

I am trying to use nutch to crawl and index a list of about 50K URLs
with depth=1.  I am running indexing with the command:
nutch-1.0/bin/nutch crawl urls/ -depth 1 -topN 10
  with appropriate changes to the configuration files.

  I find that the fetching always terminates prematurely and the logs show
an error that looks like:


activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1
Aborting with 200 hung threads.
Fetcher: done


   I have not seen this particular error message when using nutch-0.9. Is it
advisable to revert to using nutch-0.9? Or do we have some kind of patch to
fix this error?



Thanks,
Vijay


RE: Something wrong with nutch.wiki

2009-10-01 Thread Brian Tingle
FWIW, I often have problems getting to wiki.apache.org.  I could not get there 
this morning, and had to read what I needed out of the google cache.

|-Original Message-
|From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
|Tomblin
|Sent: Thursday, October 01, 2009 4:32 PM
|To: nutch-user@lucene.apache.org
|Subject: Re: Something wrong with nutch.wiki
|
|2009/10/1 Kirby Bohling kirby.bohl...@gmail.com:
| 2009/9/29 Ольга Пескова opesk...@mail.ru:
| Hello!
|
| Please check the url:
| http://wiki.apache.org/nutch/
| I can't find any content there.
|
| Just as a point of reference, I got the FrontPage to pull up just
| prior to sending this e-mail.  I'm not sure what is wrong with your
| connection to it, but I don't believe it is the server.
|
|It was down for a number of hours today, but evidently it's back up now.
|
|
|
|--
|http://www.linkedin.com/in/paultomblin