Re: Update on ignoring menu divs

2010-02-28 Thread Sami Siren

Andrzej Bialecki wrote:

On 2010-02-28 18:42, Ian M. Evans wrote:

Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

There is no such functionality out of the box. One direction that is 
worth pursuing would be to create an HtmlParseFilter plugin that wraps 
the Boilerpipe library http://code.google.com/p/boilerpipe/ .


Andrzej, have you tested that lib? If the result is of decent quality it 
would be nice to have that wrapped as a plugin in Nutch.


--
 Sami Siren


Re: Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12

2010-02-24 Thread Sami Siren

Hannu,

Do you use same set of QueryFilters both in the webapp and when running 
from shell?


Perhaps your filter is not executed when running from cli? You can 
verify how your query is transformed by running bin/nutch 
org.apache.nutch.searcher.Query and entering some queries.


--
 Sami Siren

Hannu Väisänen wrote:

I am using Nutch 1.0 to index files written in Finnish.

I have written a filter MorphologyHVSuggestionFilter that converts
Finnish words to a base form (that you find in dictionaries) and
I index just the base forms so that I find all inflected forms
when searching just for the base form.



When I search for the word 'kuka' like this

bin/nutch org.apache.nutch.searcher.NutchBean kuka
Total hits: 245

Tomcat6 finds also 245 hits.


But when I search for word 'kuusi'

bin/nutch org.apache.nutch.searcher.NutchBean kuusi
Total hits: 212

Tomcat6 finds only 14 hits.



Tomcat6 log shows this for word 'kuka':


2010-02-16 21:25:40,909 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,909 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token1 (kuka,0,4)
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token2 (kuka,0,4)
2010-02-16 21:25:40,910 INFO  NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO  NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO  NutchBean - lang: fi
2010-02-16 21:25:40,910 INFO  NutchBean - lang: fi
2010-02-16 21:25:40,911 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,911 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,939 INFO  NutchBean - re-searching for 40 raw hits, query: kuka 
-site:
2010-02-16 21:25:40,939 INFO  NutchBean - re-searching for 40 raw hits, query: kuka 
-site:
2010-02-16 21:25:40,941 INFO  NutchBean - found 0 raw hits
2010-02-16 21:25:40,941 INFO  NutchBean - found 0 raw hits
2010-02-16 21:25:40,969 INFO  NutchBean - total hits: 245
2010-02-16 21:25:40,969 INFO  NutchBean - total hits: 245


Tomcat6 log shows this for word 'kuusi':

2010-02-16 21:23:12,777 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,777 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token1 (kuusi,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuu,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 
(kuusi,0,0,posIncr=0)
2010-02-16 21:23:12,778 INFO  NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO  NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO  NutchBean - lang: fi
2010-02-16 21:23:12,778 INFO  NutchBean - lang: fi
2010-02-16 21:23:12,780 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for url
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for anchor
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for content
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for title
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for host
2010-02-16 21:23:12,813 INFO  NutchBean - total hits: 14
2010-02-16 21:23:12,813 INFO  NutchBean - total hits: 14


The difference between words 'kuka' and 'kuusi' is that the word 'kuka'
has only one base form (which happens to be 'kuka') but the word
'kuusi' has two base forms 'kuusi' and 'kuu' ('moon'; 'si' is a
possessive suffix).

So is it possible that when I search through tomcat6 Nutch returns
only those files that have both words 'kuusi' and 'kuu'. If so, how
can I change this that it finds files that has either 'kuusi' or 'kuu'
(or, of course, any other base forms of the word I search for :-).




Re: Content storage, results highlighting

2010-02-24 Thread Sami Siren


The schema.xml file there is usable only when using Solr as the search 
server. Are you using Solr?


--
 Sami Siren

Pedro Bezunartea López wrote:
 Hi,


I've developed a web application in lucene that searches web pages using a
nutch generated index. I'd like to highlight the query searched for when
showing the results, and I understand that the content of the pages need to
be stored, as well as indexed.

This is what I've tried so far:
1.- In the file conf/nutch-site.xml, I changed the value of
file.content.ignored to false.
2.- In the file conf/schema.xml I modified the line:
 field name=content type=text stored=false indexed=true/
to
 field name=content type=text stored=true indexed=true/
3.- In the sources file
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java,
line 116 to:
 LuceneWriter.addFieldOptions(content, LuceneWriter.STORE.YES,
LuceneWriter.INDEX.TOKENIZED, conf)

I tried running the command bin/nutch crawl urls -dir crawl -depth 10 -topN
5000 after the first two steps, but the crawl didn't store the contents. I
then tried the third step, recompiled nutch, and run the crawl command again
to no avail.

What am I missing? Any hints, please?

TIA,

Pedro.





Re: Nutch near future - strategic directions

2009-11-26 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:

Lots of good thoughts and ideas, easy to agree with.

Something for the ease of use category:
-allow running on top of plain vanilla hadoop


What does it mean plain vanilla here? Do you mean the current DB 
implementation? That's the idea, we should aim for an abstract layer 
that can accommodate both HBase and plain MapFile-s.


I was simply trying to say that we should not bundle Hadoop anymore with 
Nutch and instead just mention the specific version it should run on top 
of as a requirement. I am not totally sure anymore if this is a good idea...


I do not know details about the HBase branch. Would using HBase allow us 
easy migration from  one data model to another (without complex code we 
now have in our datums). How easy is HBase to manage/setup/configure?


I think Avro looks promising as a data storage technology: has some 
support for data model evolution, can be accessed natively from many 
programming languages, is relatively well performing... The downside at 
the moment is that it is not yet fully supported by hadoop mapred (I think).



-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to 
pull required dependencies for their specific crawler


+1, with slight preference towards ivy.


I was not clear here, I think I was referring to users of Nutch instead 
of Developers. And in that case the choise of a tool would be up to the 
user after the artifacts are in the repo.


Also, I think what I wanted to day is more about the model how would 
people that want to do some customization operate instead of a 
technology choice.


Creating new plugin:
-create your own build configuration (or use a template we provide)
-implement plugin code
-publish to m2 repository

Creating your custom crawler:
-create your own build configuration (or use a template we might 
provide), specify the dependencies you need (plugins basically, from 
apache or from anybody else as long as they are available through some 
repository)

-potentially write some custom code

We could also still provide a default Nutch crawler also, as a build 
configuration (basically just xml file + some config) if we wanted.


The new Hadoop maven artifacts also help with this vision since we could 
also access hadoop apis (and dependencies) through similar mechanism.



My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite 
heavy in nature and would require large changes. I am just thinking 
that would it still be better to take a fresh start instead of trying 
to do this incrementally on top of existing code base.


Well ... that's (almost) what Dogacan did with the HBase port. I agree 
that we should not feel too constrained by the existing code base, but 
it would be silly to throw everything away and start from scratch - we 
need to find a middle ground. The crawler-commons and Tika projects 
should help us to get rid of the ballast and significantly reduce the 
size of our code.


I am not aiming to throw everything away, just trying to relax the back 
compatibility burden and give innovation a chance.


In the history of Nutch this approach is not something new (remember 
map reduce?) and in my opinion it worked nicely then. Perhaps it is 
different this time since the changes we are discussing now have many 
abstract things hanging in the air, even fundamental ones.


Nutch 0.7 to 0.8 reused a lot of the existing code.


I am hoping that this time it will not be different.



Of course the rewrite approach means that it will take some time 
before we actually get into the point where we can start adding real 
substance (meaning new features etc).


So to summarize, I would go ahead and put together a branch nutch 
N.0 that would consist of (a.k.a my wish list, hope I am not being 
too aggressive here):


-runs on top of plain hadoop


See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits 
and is easy to use)
-basic http/https crawling functionality (with db abstraction or 
hbase directly and smart data structures that allow flexible and 
efficient usage of the data)

-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the 
hidden gems we might have, or some solutions for the interesting 
challenges.


I believe that's more or less where Dogacan's port is right now, except 
it's not merged with the OSGI port.


Are you sure OSGI is the way to go? I Know it has all these nice 
features and all but for some reason I feel that we could live with 
something simpler. From functional pow: just drop your jars info 
classpath and you're all set. So 2 changes here: 1. plugins are jars 2. 
no individual classloaders for plugins.


--
 Sami Siren


Re: Nutch near future - strategic directions

2009-11-18 Thread Sami Siren

Lots of good thoughts and ideas, easy to agree with.

Something for the ease of use category:
-allow running on top of plain vanilla hadoop
-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to 
pull required dependencies for their specific crawler


My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite 
heavy in nature and would require large changes. I am just thinking 
that would it still be better to take a fresh start instead of trying to 
do this incrementally on top of existing code base.


In the history of Nutch this approach is not something new (remember map 
reduce?) and in my opinion it worked nicely then. Perhaps it is 
different this time since the changes we are discussing now have many 
abstract things hanging in the air, even fundamental ones.


Of course the rewrite approach means that it will take some time before 
we actually get into the point where we can start adding real substance 
(meaning new features etc).


So to summarize, I would go ahead and put together a branch nutch N.0 
that would consist of (a.k.a my wish list, hope I am not being too 
aggressive here):


-runs on top of plain hadoop
-use osgi (or some other more optimal extension mechanism that fits and 
is easy to use)
-basic http/https crawling functionality (with db abstraction or hbase 
directly and smart data structures that allow flexible and efficient 
usage of the data)

-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the 
hidden gems we might have, or some solutions for the interesting challenges.


ps. many of the interesting challenges in your proposal seem to fall in 
the category of data analysis and manipulation that are mostly, used 
after the data has been crawled or between the fetch cycles so many of 
those could be implemented into current code base also, somehow I just 
feel that things could be made more efficient and understandable if the 
foundation (eg. data structures, extendability for example) was in 
better shape. Also if written nicely other projects could use them too!


--
 Sami Siren


Andrzej Bialecki wrote:

Hi all,

The ApacheCon is over, our release 1.0 has been out already for some 
time, so I think it's a good moment to discuss what are the next steps 
in Nutch development.


Let me share with you the topics I identified and presented in the 
ApacheCon slides, and some topics that are worth discussing based on 
various conversations I had there, and the discussions we had on our 
mailing list:


1. Avoid duplication of effort
--
Currently we spend significant effort on implementing functionality that 
other projects are dedicated to. Instead of doing the same work, and 
sometimes poorly, we should concentrate on delegating and reusing:


* Use Tika for content parsing: this will require some effort and 
collaboration with the Tika project, to improve Tika's ability to handle 
more complex formats well (e.g. hierarchical compound documents such as 
archives, mailboxes, RSS), and to contribute any missing parsers (e.g. 
parse-swf).


* Use Solr for indexing  search: it is hard to justify the effort of 
developing and maintaining our own search server - Solr offers much more 
functionality, configurability, performance and ease of integration than 
our relatively primitive search server. Our integration with Solr needs 
to be improved so that it's easier to setup and operate.


* Use database-like storage abstraction: this may seem like a serious 
departure from the current architecture, but I don't mean that we should 
switch to an SQL DB ... what this means is that we should provide an 
option to use HBase, as well as the current plain MapFile-s (and perhaps 
other types of DBs, such as Berkeley DB or SQL, if it makes sense) as 
our storage. There is a very promising initial port of Nutch to HBase, 
which is currently closely integrated with HBase API (which is both good 
and bad) - it provides several improvements over our current storage, so 
I think it's worth using as the new default, but let's see if we can 
make it more abstract.


* Plugins: the initial OSGI port looks good, but I'm not sure yet at 
this moment if the benefits of OSGI outweigh the cost of this change ...


* Shard management: this is currently an Achilles' heel of Nutch, where 
users are left on their own ... If we switch to using HBase then at 
least on the crawling side the shard management will become much easier. 
This still leaves the problem of deploying new content to search 
server(s). The candidate framework for this side of the shard management 
is Katta + patches provided by Ted Dunning (see ???). If we switch to 
using Solr we would have to  also use the Katta / Solr integration, and 
perhaps Solr/Hadoop integration as well

Re: Fetcher2 Slow

2009-03-30 Thread Sami Siren

Roger Dunk wrote:
Andrzej stated in NUTCH-669 that some people reported performance 
issues with Fetcher2, i.e. that it doesn't use the available bandwidth. 
These reports are unconfirmed, and they may have been caused by 
suboptimal URL / host distribution in a fetchlist - but it would be good 
to review the synchronization and threading aspects of Fetcher2.


To address this, I've tried just now generating a fetchlist using 
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to 
guarantee unique hosts, but the problem still remains.


Therefore, I believe it's clearly not an issue of suboptimal URL / host 
distribution. If you require any further information to confirm my 
report, you need only ask!



I have so far seen two sources for slowness, don't know it they are 
related to your case:


1. You are using nutch from behind nat box. I experienced this problem 
when I did some test crawling from a machine sitting behind adsl router 
that did NAT. Soon after starting a crawl the maximum number of NAT 
connections was reached in the router and furter connections could only 
be made after old ones timeouted from NAT table. These connections were 
mostly DNS connections.


2. Your machine has ip6 enabled. This I noticed more recently when I was 
wondering relatively slow fetching speed on a box. After disabling ipv6 
totally I was able to fetch 2-4 times faster without any other config 
changes.


--
 Sami Siren



[ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Sami Siren

I am pleased to announce the availability of  Apache Nutch 1.0.

Apache Nutch, a subproject of Apache Lucene, is open source web-search 
software. It builds on Lucene Java, adding web-specifics, such as a 
crawler, a link-graph database, parsers for HTML and other document formats.


Apache Nutch 1.0 contains a number of bug fixes and improvements such as 
Solr Integration, new indexing framework and new scoring framework just 
to mention a few. Details can be found in the changes file:


http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

When downloading from a mirror site, please remember to verify the 
downloads using signatures found on the Apache site:

http://www.apache.org/dist/lucene/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch

-- Sami Siren (on behalf of the Apache Nutch community)


Re: Fwd: fetch but not index

2009-03-11 Thread Sami Siren

?? wrote:


HI,all
in the crawl log, I can see ''fetching 
http://www.na.gov.la/docs/eng/currentnews/Vietnamese%20Ambassador.html',

 but at the end of the Indexing, I not found it create Indexing, why?
 please help me


Hi,

that url seems to be blocked by the robots.txt of that site. That is why 
it does not end up in the index.


--
 Sami Siren




Re: Running multiple processes on a single machine

2009-03-11 Thread Sami Siren

dayz...@gmail.com wrote:

Hi,

If I want to run several parsers on a single quad-core machine 
simultaneously, would I still need to have Hadoop setup as a 
single-node cluster? 
I think that the fetcher is currently the only component that can take 
advantage of multiple cores when running in local mode. We should 
perhaps address that at some point since it is not that hard to 
parallelize at least some of the processing inside individual tools so 
single machine users could benefit from multiple cores.


I am not sure but I think that the only way to do it properly is run 
jobtracker and tasktracker on that machine and configure proper block 
sizes  number of map and reduce tasks.
Can several updatedbs be run simultaneously? I believe not, since the 
db seems to be locked when it's being updated.
Locking prevents multiple applications of accessing crawl db 
simultaneously (also linkdb).


--
Sami Siren



Re: Working with Solr. Doubts

2009-03-10 Thread Sami Siren

Javier Puerto wrote:

Hi to all,

We are working with Nutch 0.8 to crawl about 18 web sites in an
intranet, each site has an average of 40.000~50.000 documents. Actually
we had the contents splits in four parts, and ran a
DistributedSearchServer for each and the client configured for the 4
servers.

Now we had a Apache Droids crawling the filesystem for transform and
index a lot of documents to Solr. we need to unify the front-end client
to be able to search in Solr and Nutch.


I thought to upgrade the release of Nutch for the Solr support but I
have some doubts:

Nutch has a front-end for Solr or I had to develop all?
Can I search on multiple Solr servers in the same way that Nutch with
DistributedSearchServer?
Can I search on Nutch and Solr simultaneously and merge the results?

If anyone has had any similar problem or any suggestions to clarify my
doubts, thank you very much!
  


I think you could simplify your setup by using solr, scale should not be 
problem, last week I tested the Solr Nutch integration with collection 
size of over 6 M docs on an old pc (growing roughly 1 M per day). 
Response times were still pretty good for simple queries even when I let 
Solr create snippets.


--
Sami Siren



Re: Exception when crawling

2009-03-04 Thread Sami Siren

dealmaker wrote:
I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM). 
What's wrong?
  
There was a change in hadoop that caused this problem to appear. It has 
now been fixed on build #743


--
Sami Siren



Re: How do you setup your svn for your nutch code?

2009-03-02 Thread Sami Siren

just a FYI,

there is also (unofficial) git repos for many apache projects - 
including nutch here:


http://jukka.zitting.name/git/

--
Sami Siren


Dingding Ye wrote:

similar.

1. git-svn clone nutch-trunk

Then create a git project which is my working project.  After that, clone
the nutch-git repo as a remote repo of this git project

2.  git remote add

Now when you want to update the nutch,  update at nutch-git at first. Then
update the branch of your working repo. Finally merge to your working branch



On Mon, Mar 2, 2009 at 12:10 PM, dealmaker vin...@gmail.com wrote:

  

need more detail.  Do u clone main trunk to your local main trunk, and then
create a local branch for personal project, then do merge periodically for
your local main trunk which u cloned?


Dingding Ye wrote:


Just personal choice and i think the branch/merge feature of git is
powerful
than svn.  It helps the smooth merge.

What i did before is to clone main trunk.  It should fit for 0.9 also.

However, if you make rapid changes to the sources, i think none are
helpful
and you have to solve the conflicts yourself..

On Mon, Mar 2, 2009 at 11:55 AM, dealmaker vin...@gmail.com wrote:

  

and also, do u clone the main trunk or just for examples 0.9?


Dingding Ye wrote:


I have used git-svn to clone the nutch project.
And then use a git repo to manage personal version and do periodical
  

merge


with the git version of nutch.

On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin...@gmail.com wrote:

  

no, it's not the official 1.0.  Even so, there may be 1.1 in future.


I


just
want to know how to setup svn for future versions that needs minimum
maintenance.
Thanks.


Tony Wang-3 wrote:


from my understanding, Nutch 1.0 is already in the latest nightly
  

build.


On Sun, Mar 1, 2009 at 5:22 PM, dealmaker vin...@gmail.com
  

wrote:


Hi,
 I am modifying Nutch 0.9 code for my project.  Currently, I put


all


my


0.9
code in my local main trunk.  But I know that 1.0 will be out


soon,


and


want
to use 1.0 code instead in near future. What is the best way to


setup


svn


to
do that?  Should I just sync the main trunk from apache server to


my


local
trunk and setup branch for 1.0 in local?
Thanks.
--
View this message in context:



http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22280092.html


Sent from the Nutch - User mailing list archive at Nabble.com.




--
Are you RCholic? www.RCholic.com
? ? ? ? ? ? ? ? ? ?


  

--
View this message in context:



http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22280605.html


Sent from the Nutch - User mailing list archive at Nabble.com.



  

--
View this message in context:



http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22281721.html


Sent from the Nutch - User mailing list archive at Nabble.com.



  

--
View this message in context:
http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22281816.html
Sent from the Nutch - User mailing list archive at Nabble.com.





  




Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren

Hi,

and thanks for being persistent. Can you specify what is the version of 
nutch that you are running, is it a nightly build (if yes, which one?) 
or did you check out the svn trunk? And just to be sure: you are running 
with default configuration?


--
Sami Siren

ahammad wrote:

I checked hadoop.log and this is what it has:

java.lang.IllegalArgumentException: it doesn't make sense to have a field
that is neither indexed nor stored
at org.apache.lucene.document.Field.init(Field.java:279)
at
org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
at
org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)


I don't understand what that refers to specifically. I'm running it at it's
default configuration, without any of the advanced indexing that I have in
my 0.9 install.

Cheers.



Andrzej Bialecki wrote:
  

ahammad wrote:


I am aware that this is still a development version, but I need to test a
few
things with Nutch/Solr so I installed the latest dev version of Nutch
1.0.

I tried running a crawl like I did with the working 0.9 version. From the
log, it seems to fetch all the pages properly, but it fails at the
indexing:

CrawlDb update: starting
CrawlDb update: db: kb/crawldb
CrawlDb update: segments: [kb/segments/20090302135858]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: kb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135757
LinkDb: adding segment:
file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135807
LinkDb: adding segment:
file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135858
LinkDb: done
Indexer: starting
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)


I took a look at all the configuration and as far as I can tell, I did
the
same thing with my 0.9 install. Could it be that I didn't install it
properly? I unzipped it and ran ant and ant war in the root directory.
  
Please check the logs in the logs/ directory - the above message is not 
informative, the real reason of the failure can be found in the logs.


--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






  




Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren

I can see this error also. not sure yet what's going wrong...

--
Sami Siren

Justin Yao wrote:

log4j configure:

log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout

log4j.logger.org.apache.nutch=TRACE
log4j.logger.org.apache.hadoop=TRACE

Output:

2009-03-02 17:53:21,987 DEBUG indexer.Indexer - IFD [Thread-11]: 
setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@118d189 

2009-03-02 17:53:21,988 DEBUG indexer.Indexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-justin/mapred/local/index/_1068960877 
autoCommit=true 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@648016 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1551b0 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=1 index=
2009-03-02 17:53:21,993 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-03-02 17:53:21,994 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-03-02 17:53:22,009 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.IllegalArgumentException: it doesn't make sense to have a 
field that is neither indexed nor stored

at org.apache.lucene.document.Field.init(Field.java:279)
at 
org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) 

at 
org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) 

at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) 


at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) 

at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 


at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-02 17:53:22,567 FATAL indexer.Indexer - Indexer: 
java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)



Andrzej Bialecki wrote:

Justin Yao wrote:

Same problem here if using build #740 (Mar 2, 2009 4:01:53 AM)
I switched to build #736 (Feb 26, 2009 4:01:15 AM) and it worked then.


Could you please send the error message from the logs/, which you got 
with build #740? Thanks!








Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren

Sami Siren wrote:

I can see this error also. not sure yet what's going wrong...


it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideas what 
changed in hadoop that might have caused this?


--
 Sami Siren





--
Sami Siren

Justin Yao wrote:

log4j configure:

log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout

log4j.logger.org.apache.nutch=TRACE
log4j.logger.org.apache.hadoop=TRACE

Output:

2009-03-02 17:53:21,987 DEBUG indexer.Indexer - IFD [Thread-11]: 
setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@118d189 

2009-03-02 17:53:21,988 DEBUG indexer.Indexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-justin/mapred/local/index/_1068960877 
autoCommit=true 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@648016 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1551b0 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=1 index=
2009-03-02 17:53:21,993 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-03-02 17:53:21,994 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-03-02 17:53:22,009 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.IllegalArgumentException: it doesn't make sense to have a 
field that is neither indexed nor stored

at org.apache.lucene.document.Field.init(Field.java:279)
at 
org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) 

at 
org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) 

at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) 


at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) 

at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 


at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-02 17:53:22,567 FATAL indexer.Indexer - Indexer: 
java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)



Andrzej Bialecki wrote:

Justin Yao wrote:

Same problem here if using build #740 (Mar 2, 2009 4:01:53 AM)
I switched to build #736 (Feb 26, 2009 4:01:15 AM) and it worked then.


Could you please send the error message from the logs/, which you got 
with build #740? Thanks!










Re: log org.apache.solr.common.SolrException: Bad Request when indexing feeds with solrindexer.

2009-02-23 Thread Sami Siren

Felix Zimmermann wrote:

Hi,

 


I get this log error when indexing feeds with solrindexer:

 

  


2009-02-23 23:04:11,438 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-02-23 23:04:11,439 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-02-23 23:04:11,441 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.feed.FeedIndexingFilter

2009-02-23 23:04:11,584 WARN  mapred.LocalJobRunner - job_local_0001

org.apache.solr.common.SolrException: Bad Request

 


Bad Request

 


request: http://127.0.0.1:8080/solr3/update?wt=javabinversion=2.2

at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:343)

at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:183)

at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.jav
a:217)

at
  
Hi, I would check the Solr log to see why it is failing, probably Nutch 
is providing content to a field not present in sol schema.


--
Sami Siren



Re: Nutch 1.0 - Setting up and running Nutch for crawling and Solr for indexing and querying.

2009-02-22 Thread Sami Siren

Tony Wang wrote:

I don't see that Nutch 1.0 has been released. Where did you download it?
  
Nutch 1.0 has not been released yet, the community is working to get it 
out as we speak. There are still some issues that needs to be fixed 
before the release can take place. Everybody's involvement in testing 
the current nightly builds and providing documentation patches or wiki 
updates is appreciated.


--
Sami Siren

nightly build? thanks

On Fri, Feb 20, 2009 at 6:31 PM, Kham Vo k...@mac.com wrote:

  

Hello Nutch 1.0 designers,

I successfully installed and set up Nutch 1.0 (build # 722).  Ran bin/nutch
crawl urls -dir crawl -depth 3 -topN 50 and it seemed to work, fetching data
from specified sites.  No error.  My question is do I need to do anything
special in order to get Nutch to post the data to another instance of
apache-solr running at http://localhost:8983 for indexing.  I googled for
any documentation on how to correctly set up Nutch 1.0 such that nutch is
for crawling and solr is for indexing and display.  Nothing so far.

Your help is greatly appreciated.

Kham






  




Re: HTTP Status 500 - No Context configured to process this request

2009-02-22 Thread Sami Siren

samuel.gre...@mesaaz.gov wrote:
I have tried tomcat 6.0 and after escaping some quotes in a string in 
search.jsp, it works withou error.  However, it returns no results.  I 
suspect it is not finding the correct crawl files. 
  

That is the common case, other being there is no data available.

I have started tomcat in the nutch directory.

I have also added a preference to nutch:

property
  namesearcher.dir/name
  valuecrawl/value
  description
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory index containing
  merged indexes, or the directory segments containing segment
  indexes.
  /description
/property


Any other steps to take?
  


No, that should do it. Couple of things you can try:

- double check that your configuration is indeed in use the file to 
check is in ${webapps}/ROOT/WEB-INF/classes/nutch-site.xml
- use absolute directory in searcher.dir, that way it does not matter 
where or how you start tomcat.


you can also check that you can actually get results back from nutch 
command line:

- double check $nutch-home/conf/nutch-site.xml (searcher.dir)
- execute (from command line) bin/nutch 
org.apache.nutch.searcher.NutchBean query


--
Sami Siren

Thanks
Sam



Hi,

I just dropped Nutch web app into tomcat version 6.0.18 and it worked 
fine, perhaps you should upgrade your Tomcat?


--
 Sami Siren

samuel.gre...@mesaaz.gov wrote:
  

Hi,

I am following the tutorial here: 


http://nutch.sourceforge.net/docs/en/tutorial.html

Crawling works fine, as does the test search from the command line. When 



  
I try to fire up tomcat after moving ROOT.war into place, I get some 
errors in the tomcat logs and a page with


HTTP Status 500 - No Context configured to process this request

2009-02-19 15:55:46 WebappLoader[]: Deploy JAR 
/WEB-INF/lib/xerces-2_6_2.jar to C:\Program Files\Apache Software 
Foundation\Tomcat 4.1\webapps\ROOT\WEB-INF\lib\xerces-2_6_2.jar

2009-02-19 15:55:47 ContextConfig[] Parse error in default web.xml
org.apache.commons.logging.LogConfigurationException: User-specified log 



  
class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or 

is 
  

not useable.
at 



org.apache.commons.digester.Digester.createSAXException(Digester.java:3181)
  
at 



org.apache.commons.digester.Digester.createSAXException(Digester.java:3207)
  
at 
org.apache.commons.digester.Digester.endElement(Digester.java:1225) 
 etc.


So it looks like the root of the error is default web.xml, not in the 
Log4JLogger - although I know very little about Java.  I haven't played 
with it for a few years.


Anyone know what is going on here? 


versions/info:

nutch 0.9
Tomcat 4.1
jre1.5.0_08
jdk1.6.0_12
NUTCH_JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12
JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12

Thanks!
Sam






  




Re: Feed indexing with solrindex not working.

2009-02-22 Thread Sami Siren

Felix Zimmermann wrote:

Hi,

  

Hi,
 


indexing RSS-Feeds with solrindex does not work. I expect missing special
field-definitions in schema.xml of solr. Could somebody tell me the correct
field-definitions please? In future, it would be the best to put a default
schema.xml into the conf-dir(?)
  
There is an open issue for this 
https://issues.apache.org/jira/browse/NUTCH-699. Please contribute your 
findings there.


--
Sami Siren


Re: HTTP Status 500 - No Context configured to process this request

2009-02-20 Thread Sami Siren

Hi,

I just dropped Nutch web app into tomcat version 6.0.18 and it worked 
fine, perhaps you should upgrade your Tomcat?


--
Sami Siren

samuel.gre...@mesaaz.gov wrote:

Hi,

I am following the tutorial here: 


http://nutch.sourceforge.net/docs/en/tutorial.html

Crawling works fine, as does the test search from the command line.  When 
I try to fire up tomcat after moving ROOT.war into place, I get some 
errors in the tomcat logs and a page with


HTTP Status 500 - No Context configured to process this request

2009-02-19 15:55:46 WebappLoader[]: Deploy JAR 
/WEB-INF/lib/xerces-2_6_2.jar to C:\Program Files\Apache Software 
Foundation\Tomcat 4.1\webapps\ROOT\WEB-INF\lib\xerces-2_6_2.jar

2009-02-19 15:55:47 ContextConfig[] Parse error in default web.xml
org.apache.commons.logging.LogConfigurationException: User-specified log 
class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or is 
not useable.
at 
org.apache.commons.digester.Digester.createSAXException(Digester.java:3181)
at 
org.apache.commons.digester.Digester.createSAXException(Digester.java:3207)
at 
org.apache.commons.digester.Digester.endElement(Digester.java:1225) 
 etc.


So it looks like the root of the error is default web.xml, not in the 
Log4JLogger - although I know very little about Java.  I haven't played 
with it for a few years.


Anyone know what is going on here? 


versions/info:

nutch 0.9
Tomcat 4.1
jre1.5.0_08
jdk1.6.0_12
NUTCH_JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12
JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12

Thanks!
Sam
  




Re: Distributed Search Server fails with Trunk

2009-02-19 Thread Sami Siren

Höchstötter Nadine wrote:

Hi,
I run Nutch on a single server, I have two crawl directories, that's why I use 
Nutch  in distributed search server mode as described in the hadoop manual.
But since I have a new Trunk Version (04.02.2009) it fails. Local search on one 
index works fine. But distributed search throws following exception:

  

...

We do not run Nutch in PseudoDistributedMode. We only use the distributed 
search mode. With Nutch-0.9 this was working properly.
Did anyone have the same problem?
  
Yes, I just verified that this is happening, can you please file a Jira 
issue, fix for version = 1.0, priority = blocker.


thanks.

--
Sami Siren



Re: nutch restart after recrawl

2009-02-19 Thread Sami Siren

Alexander Aristov wrote:

Hi People

Is there a way to tell nutch to re-initialize index after re-crawl without
application restart.
  
Not really. I added a jira NUTCH-376 to track this enhancement, but no 
work has done in that front.


One potential solution to this problem is to use solr as indexing back 
end, the integration is in nightly version of nutch. I am not sure if 
the procedure is documented anywhere.


--
Sami Siren

All scripts suggest restarting nutch but this leads that searching is
unavailable for a few minutes.

May I call an API or something?

  




Re: Fetcher2 doesn't print status information on console

2009-02-19 Thread Sami Siren

Koch Martina wrote:

Hi,

I'm testing Fetcher2 of the current trunk and wondered why Fetcher2 doesn't 
report any status on the console.
Other tools like Injector  or Fetcher report not only in the hadoop.log, but also to STDOUT to some 
extent, e.g. Generator: starting, Fetcher: done and so on.

Did I configure something wrong or is this a wanted behaviour in Fetcher2?
I can't see any difference in the logging logic of Fetcher2.
  


The logging configuration is in file conf/log4j.properties, there you 
have entry for Fetcher:


log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout

but not for Fetrcher2. If you add such line for Fetcher2 it should start 
outputting logging to stdout.


--
Sami Siren


Thanks in advance.

Kind regards,
Martina


  




Re: Fetcher2 crashes with current trunk

2009-02-19 Thread Sami Siren

Dog(acan Güney wrote:

I think I have found the bug here, but I am in a hurry now, I will
create a JIRA issue
and post (what is hopefully) the fix later today.
  


Great! thanks.

--
Sami Siren

On Tue, Feb 17, 2009 at 21:39, Dog(acan Güney doga...@gmail.com wrote:
  

2009/2/17 Sami Siren ssi...@gmail.com:


Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is 
reproducible.

  

No we don't. But you are right that we should. I am very busy and I
forgot about it. I will
examine this problem in more detail tomorrow and will open an issue if
I can reproduce
the bug.



--
Sami Siren


Dog(acan Güney wrote:
  

Thanks for detailed analysis. I will take a look and get back to you.

On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote:



Hi,

sorry for the late reply. We did some further digging and found that the error 
has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just 
happens much later (after about 20 fetch cycles).
We did many test runs, eliminated as much plugins as possible and identified 
URLs which are most likely to fail.
With the following configuration we get a corrupt crawldb after two fetch2 
cycles:
- activated plugins: protocol-http, parse-html, feed
- generate.max.per.host - 100
- URLs to fetch:
http://www.prosieben.de/service/newsflash/
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
http://www.prosieben.de/kino_dvd/news/60897/
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
http://www.prosieben.de/spielfilm_serie/topstories/61051/
http://www.prosieben.de/kino_dvd/news/60897/

When starting from an higher URL like http://www.prosieben.de these URLs get 
the following warn message after some fetch cycles:
WARN  parse.ParseOutputFormat - Can't read fetch time for: 
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
But the crawldb does not get corrupt immediately after the first occurence of 
such messages, it gets corrupted some cyles later.

Any suggestions are highly appreciated.
Something seems to go wrong with the feed plugin, but I can't diagnose exactly 
when and why...

Thanks in advance.

Kind regards,
Martina



-Ursprüngliche Nachricht-
Von: Dog(acan Güney [mailto:doga...@gmail.com]
Gesendet: Freitag, 13. Februar 2009 09:37
An: nutch-user@lucene.apache.org
Betreff: Re: Fetcher2 crashes with current trunk

On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina k...@huberverlag.de wrote:

  

Hi all,

we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
(Nutch-683) manually applied.
We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle 
at depth 1.
When we use Fetcher2, we can do this cycle four times in a row without any 
problems. If we start the fifth cycle the Injector crashes with the following 
error log:

2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics 
with processName=JobTracker, sessionId= - already initialized
2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
2009-02-12 00:00:05,554

Re: Restarting Nutch

2009-02-18 Thread Sami Siren

[moving this to nutch-user]

Hrishikesh Agashe wrote:

Hi,

 


I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

 


For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?
  

Nutch does not try to resume the action that was interrupted.

Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?
  
Hadoop will execute the remaining tasks at nodes that are available. 
Usually data will be stored on a shared/distributed filesystem (like 
HDFS). If your setup is similar and you ensure that the filesystem can 
survive single node failures your data should be safe.


--
Sami Siren


Re: How many kb is a page's index?

2009-02-18 Thread Sami Siren

buddha1021 wrote:

hi:
How many kb is a page's index? on average!
  

Hi,

There's a quite recent estimate on
http://www.lucidimagination.com/search/document/c6c099bf31b0de55/index_ratio#de145fe338543d5b


and when build distribute search clusters, the node is 1u server? or the
common pc that people daily used on windiws? which can maximize performance?
  
Well it can be anything, the important thing is to set up a small system 
with similar hardware and see how it performs. That way you can get 
quite accurate estimates on larger scale systems running on similar 
hardware.


--
Sami Siren


Re: Fetcher2 crashes with current trunk

2009-02-17 Thread Sami Siren
Do we have a Jira issue for this, seems like a blocker for 1.0 to me if 
it is reproducible.


--
Sami Siren


Dog(acan Güney wrote:

Thanks for detailed analysis. I will take a look and get back to you.

On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote:
  

Hi,

sorry for the late reply. We did some further digging and found that the error 
has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just 
happens much later (after about 20 fetch cycles).
We did many test runs, eliminated as much plugins as possible and identified 
URLs which are most likely to fail.
With the following configuration we get a corrupt crawldb after two fetch2 
cycles:
- activated plugins: protocol-http, parse-html, feed
- generate.max.per.host - 100
- URLs to fetch:
http://www.prosieben.de/service/newsflash/
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
http://www.prosieben.de/kino_dvd/news/60897/
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
http://www.prosieben.de/spielfilm_serie/topstories/61051/
http://www.prosieben.de/kino_dvd/news/60897/

When starting from an higher URL like http://www.prosieben.de these URLs get 
the following warn message after some fetch cycles:
WARN  parse.ParseOutputFormat - Can't read fetch time for: 
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
But the crawldb does not get corrupt immediately after the first occurence of 
such messages, it gets corrupted some cyles later.

Any suggestions are highly appreciated.
Something seems to go wrong with the feed plugin, but I can't diagnose exactly 
when and why...

Thanks in advance.

Kind regards,
Martina



-Ursprüngliche Nachricht-
Von: Dog(acan Güney [mailto:doga...@gmail.com]
Gesendet: Freitag, 13. Februar 2009 09:37
An: nutch-user@lucene.apache.org
Betreff: Re: Fetcher2 crashes with current trunk

On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina k...@huberverlag.de wrote:


Hi all,

we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
(Nutch-683) manually applied.
We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle 
at depth 1.
When we use Fetcher2, we can do this cycle four times in a row without any 
problems. If we start the fifth cycle the Injector crashes with the following 
error log:

2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics 
with processName=JobTracker, sessionId= - already initialized
2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
java.lang.RuntimeException: java.lang.NullPointerException
  at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
  at org.apache.hadoop.io.MapWritable.readFields

Re: Trying to understand how webapp works

2009-02-17 Thread Sami Siren

Bartek wrote:

Hello,

I am trying to figure out how webapp part is working.

I've installed nutch and crawled some site. Then deployed .war file 
and in file {tomcat.dir}/nutch/WEB-INF/classes/nutch-site.xml

I've put correct searcher.dir, in my case /usr/local/nutch/crawls/site1

Everything is working fine but...

When I removed whole crawls dir (/usr/local/nutch/crawls) web 
application is still working fine. Searching is working (but not cache 
- I assume that it can't find segments)


So could someone explain to me why it is still working?
You didn't restart tomcat after killing the directory did you? It might 
be working because the webapp still has references to all files it 
needs. Restart tomcat and it should work no more.


--
Sami Siren



Re: indexing after fetching

2009-02-17 Thread Sami Siren

Nicolas MARTIN wrote:

I need to know if Nutch necessarily index data that have been fetched when
running the bin/crawl command ?
  

Hi,

bin/nutch crawl command will index the data at the end of the cycle. If 
you do not wish to index just use the individual commands
inject, generate, fetch, updatedb, generate...  


--
Sami Siren



Re: nutch jdk?

2009-02-09 Thread Sami Siren

Dennis Kubes wrote:
jdk1.5 or better, I am currently on jdk1.6 sun.  For the webapp we use 
tomcat but should run on any jsp/servlet container, websphere included.

I think you need 1.6 now (for trunk) since we use Hadoop 0.19.

--
Sami Siren



Re: nutch jdk?

2009-02-09 Thread Sami Siren

buddha1021 wrote:


Sami Siren-2 wrote:
  

Dennis Kubes wrote:

jdk1.5 or better, I am currently on jdk1.6 sun.  For the webapp we use 
tomcat but should run on any jsp/servlet container, websphere included.
  

I think you need 1.6 now (for trunk) since we use Hadoop 0.19.

--
 Sami Siren





which sdk will be used? j2se sdk or j2eesdk?
  

You don't need the ee version of java to run nutch.

if i use the ubuntu as the os to buid a distribute search system which
contains several nodes,which version is the best os for the every node? the
desktop edition or the server edition? thank you!
I think either of those should be good enough as would many other linux 
distributions.


--
Sami Siren


Re: how to create a new ngp file for Telugu in nutch

2008-08-21 Thread Sami Siren

nalgonda wrote:

Hi,

how to create a new ngp file for tamil
i tried using java org.apache.nutch.analysis.lang.NGramProfile -create te
sample_te.txt UTF8
but i get error 
No java lang class in org/apache/nutch/lang/analysis/Ngramprofile


what's that how to solve?
  

Hi,

I think the easiest way is to enable language-identifier plugin and 
execute class through the plugin command:


bin/nutch plugin language-identifier 
org.apache.nutch.analysis.lang.NGramProfile -create te sample_te.txt utf-8



--
Sami Siren


directions for web ui? [was Re: web2 plugins compilation error]

2008-08-21 Thread Sami Siren

Hi,

The web2 ui was originally an effort to make the web ui more modular and 
easier to customize. The architecture below the surface is mildly put 
outdated and it relies on a dirty trick that allows jsp's to be executed 
from inside .jar files.


If I would write it again today I would probably use webmacro or 
velocity instead to get rid of the hack that breaks on servlet 
containers with different versions of jsp api. My recommendation is: do 
not use it ;)


It has long been on my todo list to start discussion about the future of 
a end user interface in Nutch and possible future directions (where did 
all that time go?). I think that we need a simple to maintain ui that is 
easy to customize (both of the current ui fail to satisfy those 
requirements IMO).


What kind of thought do others have?

--
Sami Siren



michos101 wrote:

Hi,
i am trying to enable the web2 plugins but i am get an issue on when i try
to compile the plugins

i get the following errors

init:

compile-plugins:

deploy:

init:

init-plugin:

compile:

jar:
  [jar] Warning: skipping jar archive
/usr/local/mputa01/build/webui-extensionpoints/webui-extensionpoints.jar
because no files were included.

deps-test:

deploy:

init:

init-plugin:
 [echo] Copying UI configuration
 [echo] Copying UI templates

deps-jar:

prepare-web:
   [delete] Deleting directory
/usr/local/mputa01/build/web-caching-oscache/tmp/_web
 [copy] Copying 5 files to
/usr/local/mputa01/build/web-caching-oscache/tmp/_web

compile-jsp:

compile:
 [echo] Compiling plugin: web-caching-oscache
[javac] Compiling 4 source files to
/usr/local/mputa01/build/web-caching-oscache/classes
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:32: 
package org.apache.nutch.webapp.common does not exist

[javac] import org.apache.nutch.webapp.common.Search;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:33: 
package org.apache.nutch.webapp.common does not exist

[javac] import org.apache.nutch.webapp.common.ServiceLocator;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:127: 
cannot find symbol

[javac] symbol  : class ServiceLocator
[javac] location: class org.apache.nutch.webapp.CacheManager
[javac]   public Search getSearch(String id, ServiceLocator locator) 
throws NeedsRefreshException  {

[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:127: 
cannot find symbol

[javac] symbol  : class Search
[javac] location: class org.apache.nutch.webapp.CacheManager
[javac]   public Search getSearch(String id, ServiceLocator locator) 
throws NeedsRefreshException  {

[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:162: 
cannot find symbol

[javac] symbol  : class Search
[javac] location: class org.apache.nutch.webapp.CacheManager
[javac]   public void putSearch(String id, Search search){
[javac]^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:27: 
package org.apache.nutch.webapp.common does not exist

[javac] import org.apache.nutch.webapp.common.Search;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:28: 
package org.apache.nutch.webapp.common does not exist

[javac] import org.apache.nutch.webapp.common.ServiceLocator;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:29: 
package org.apache.nutch.webapp.common does not exist

[javac] import org.apache.nutch.webapp.common.Startable;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:30: 
cannot find symbol

[javac] symbol  : class SearchController
[javac] location: package org.apache.nutch.webapp.controller
[javac] import org.apache.nutch.webapp.controller.SearchController;
[javac]  ^
[javac]
/usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:39: 
cannot find symbol

[javac] symbol: class SearchController

Re: Next Generation Nutch

2008-04-12 Thread Sami Siren
 of 
different tools in any of these areas.  What this means is the ability 
to have different components such as web crawlers (as long as the end 
data structure is the same), for example Fetcher, Fetcher2, Grub, 
Heretrix, or even specialized crawlers.  And different components for 
different analysis types.  I don't see a lot of cross-cutting concerns 
here.  And where there is, url normalization for example, I think it 
can be handled better through dependency injection.


Which brings me to three.  I think it is time to get rid of the plugin 
framework.

+1
I want to keep the functionality of the various plugins but I think a 
dependency injection framework, such as spring, creating the 
components needed for logic inside of tools is a much cleaner way to 
proceed.  This would allow much better unit and mock testing of tool 
and logic functionality.  
The lack of junit tests in nutch has been a big burden for it (in 
general amount of junit tests seems to somewhat correlate to how 
easy/hard they are to write :) so if we architecture the system to be 
easily testable (small isolated units) we could simultaneously rise the 
bar for junit testing it and also make it easier to refactor later.


It would allow Nutch to run on a non nutchified Hadoop cluster, 
meaning just a plain old hadoop cluster.  We could have core jars and 
contrib jars and a contrib directory which is pulled from by shell 
scripts when submitting jobs to Hadoop.  With the multiple-resources 
functionality in Hadoop it would be a simple matter of creating the 
correct command lines for the job to run.


And that brings me to separation of data and presentation.  Currently 
the Nutch website is one monolithic jsp application with plugins.  I 
think the next generation should segment that out into xml / json 
feeds and a separate front end that uses those feeds.  Again this 
would make it much easier to create web applications using nutch.


And of course I think that shard management, a la Hadoop master and 
slave style, is a big requirement as well.  I also think a full test 
suite with mock objects and local and MiniMR and MiniDFS cluster 
testing is important as is better documentation and tutorials (maybe 
even a book :)).  So up to this point I have created MapReduce jobs 
that use spring for dependency injection and it is simple and works 
well.  The above is the direction I would like to head down but I 
would also like to see what everyone else is thinking.


Dennis



--
Sami Siren



Re: Nutch training at ApacheCon EU 2008

2008-03-25 Thread Sami Siren

Frisa, Raquel, VF-ES (rfrisar) wrote:

Hello,

I was right now thinking about attending your training session but it's not 
there!  What's happened?  Do you know if there's something planned related to 
Nutch?
  

Hi,

The training was canceled due to low demand.

There are still plenty of interesting lucene/solr/hadoop related stuff 
there to attend to.


--
Sami Siren



Re: can't find hadoop classes necessary to use Nutch API

2007-11-29 Thread Sami Siren
Ana Rodighiero wrote:
 I have Nutch running on my server and it crawls and searches just fine. I am
 writing a java program to use the search api, but cannot compile because I
 am missing some classes from hadoop. Are these classes included somewhere in
 the nutch or tomcat downloads? If not, how is the compiled distribution of
 nutch running without them? Where can I get the hadoop jar files?
 Specifically, I am trying to make a NutchBean, which requires hadoop's
 Configuration and Path classes. I'm not doing anything with multiple
 servers, so  those may be the only ones I need. Is there any way to use
 Nutch without them? Thank you for answers to any or all of these questions.

The hadoop jar (hadoop-version-core.jar) should be available under
lib/. Nutch cannot be compiled/run without it.

--
 Sami Siren


Re: java.lang.NoClassDefFoundError Nutch 0.9

2007-11-08 Thread Sami Siren
karthik085 wrote:
 Hi, 
 
 I got nutch from svn tags - release0.9 - but can't get rid of this problem.
 I did
 ant compile
 ant jar
 ant war
 All of them build successfully with different versions of ant - 1.6.5 and
 1.7.0

do ant job

-- 
 Sami Siren


Re: PDF problems, inc. documents returned with XLS extension

2007-10-22 Thread Sami Siren
George Weller wrote:
 Hi all,
 
 First I note in the logs that a large number of PDF documents have been
 fetched, and yet only two have been indexed, and indeed only these two
 appear in search results. The content limit is set high enough to allow
 these documents to be indexed, so I can't think why this should be.

Are there any related errors on log?

 Secondly for those documents that ARE indexed, rather bizarrely, the
 document titles in the search results have a '.xls' extension. I can even
 search for all PDF document just by using the query 'xls'. Note that this
 suffix is most definitely NOT in the actual title of those files. I also
 chanced upon a site that seems to use Nutch (no affiliation- I just googled)
 and found the same problem...

In the examples from your site the title is extracted from the pdf
metadata so it just uses the title stored within the pdf doc.

-- 
 Sami Siren


Re: Indexer does not update the Lucene TITLE field

2007-10-19 Thread Sami Siren
Sergio Morales wrote:
 Hi Sami,
 
 Thanks for the info.
 
 Is there any other way to share this?

create a jira issue and attach to it?

--
 Sami Siren


Re: Indexer does not update the Lucene TITLE field

2007-10-19 Thread Sami Siren
Sergio Morales wrote:
 Hi,
  
 I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
  
 It seems the indexer is unable to update the field TITLE of the Lucene 
 index when processing specific html documents.
  
  
 Please find below a brief summay of this issue:
  
 1.- Extracted this new version in a separate directory and copy across the 
 following configuration files:
 - {nutch_home_9.0}/bin/url folder, containing the urls
 - {nutch_home_9.0}/conf/nutch-site.xml
 - {nutch_home_9.0}/conf/crawl-urlfilter.txt
  
 2.- To reproduce the issue, you would need to copy the attached html document 
 to your webserver/filesytem.

There was not any html document attached. This is because mailing list
software removes them.

-- 
 Sami Siren


Re: Problems running multiple nutch nodes

2007-10-04 Thread Sami Siren
Uygar BAYAR wrote:
 hi 
   thanks for the solution..it's solved my log problem but not my
 http://www.nabble.com/java.lang.OutOfMemoryError%3A-Requested-array-size-exceeds-VM-limit-tf4562352.html
 and gives this error message
 
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:131)
 at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:149)
 

if it works on local jobrunner you possibly forgot to increase memory
for spawned vm processes with hadoop conf like:

property
namemapred.child.java.opts/name
value-Xmx1000m/value
/property

--
 Sami Siren



Re: IOException using feed plugin - NUTCH-444

2007-07-03 Thread Sami Siren
Kai_testing Middleton wrote:
 I hope someone can suggest a method to proceed with this RuntimeExceptionI'm 
 getting.

recheck that you have scoring plugin enabled properly (scoring-opic) in
nutch configuration (in the snippet you gave below it did not exist,
also the pluginRepository log you showed did not have it registered)

--
 Sami Siren


 
 java.lang.RuntimeException: No scoring plugins - at least one scoring plugin 
 is required!
 at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87)
 at 
 org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
 
 As far as I can tell I'm using NUTCH-444 out-of-the-box since I have a 
 nightly build.
 
 --Kai M.
 
 
 - Original Message 
 From: Kai_testing Middleton [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Friday, June 29, 2007 5:24:57 PM
 Subject: Re: IOException using feed plugin - NUTCH-444
 
 The exception is:
java.lang.RuntimeException: No scoring plugins - at least one scoring 
 plugin is required!
 
 I note that my nutch-site.xml does contain a reference to scoring-opic so I 
 wonder why it would give that exception.
 
 --Kai M.
 
 - Original Message 
 From: Kai_testing Middleton [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Friday, June 29, 2007 11:36:11 AM
 Subject: Re: IOException using feed plugin - NUTCH-444
 
 Here is the more detailed stack trace:
 java.lang.RuntimeException: No scoring plugins - at least one scoring plugin 
 is required!
 at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87)
 at 
 org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
 
 In fact, here is a complete hadoop.log for the command I attempt:
 nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 21 | 
 tee crawl.log
 
 2007-06-29 11:28:58,785 INFO  crawl.Crawl - crawl started in: 
 /usr/tmp/lee_apollo
 2007-06-29 11:28:58,788 INFO  crawl.Crawl - rootUrlDir = /usr/tmp/lee_urls.txt
 2007-06-29 11:28:58,789 INFO  crawl.Crawl - threads = 10
 2007-06-29 11:28:58,790 INFO  crawl.Crawl - depth = 2
 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: starting
 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: crawlDb: 
 /usr/tmp/lee_apollo/crawldb
 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: urlDir: 
 /usr/tmp/lee_urls.txt
 2007-06-29 11:28:58,926 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2007-06-29 11:28:59,936 INFO  plugin.PluginRepository - Plugins: looking in: 
 /usr/local/nutch-2007-06-27_06-52-44/plugins
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Plugin 
 Auto-activation mode: [true]
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Registered Plugins:
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - CyberNeko HTML 
 Parser (lib-nekohtml)
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Site Query Filter 
 (query-site)
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Basic URL 
 Normalizer (urlnormalizer-basic)
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Html Parse 
 Plug-in (parse-html)
 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Pass-through URL 
 Normalizer (urlnormalizer-pass)
 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository - Regex URL Filter 
 Framework (lib-regex-filter)
 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository - Feed 
 Parse/Index/Query Plug-in (feed)
 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository - Basic Indexing 
 Filter (index-basic)
 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository - Basic Summarizer 
 Plug-in (summary-basic)
 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository - Text Parse 
 Plug-in (parse-text)
 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository - JavaScript Parser 
 (parse-js)
 2007-06-29 11:29:00,261

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren
Doğacan Güney wrote:
 Hi,
 
 On 6/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Is this actually planned (addition of SolrIndexer to Nutch)?
 A search for SolrIndexer in JIRA got no hits.
 
 There is NUTCH-442 (one of the most popular issues). But, after Sami's
 work, there have been no further developments.
 
 I think Sami Siren's original patch no longer works with Solr, I am
 not sure if it still applies to nutch. So, if anyone wants to tackle
 this, here are a couple of items off the top of my mind:

It still applies to nutch (actually there were just two additional
classes) and works with the original client (don't know if it's still
available).

I am currently working on something around solr-nutch integration and
hoping that I can give out something within the next few weeks.

 
 1) Bring Sami's patch up-to-date (both with solr and with nutch). I
 think a seperate Indexer job is unnecessary, we should just change
 Indexer.OutputFormat to check for a parameter, and if its true,
 OutputFormat should also send documents to Solr (besides writing it to
 lucene index in DFS).

I actually think that the endless adding of configuration options does
not do any good to anyone, we should instead start to write reusable
pieces of code and/or bring the number of different options down
(imoThe massive number of already available configuration/runtime
options and the fact that most of nutch is not designed to be extended
by coding is harmful for advanced users. In the other hand I think that
things are already too complicated for novice users/imo)

 2) Make it work in distributed setups (i.e. with more than 1 index
 server)  . Sami Siren also makes a note of this, but I don't believe
 that a simple hash-the-url approach is appropriate for nutch. It would
 be nice to guarantee that a url always goes to the same indexing
 server, even if we add or remove index servers (if we just take the
 hash of url, then adding a new machine would cause pretty much all
 urls to be distributed to different servers).

I think that the distributed online Index part should be done outside of
Nutch (or if done here do it with extreme caution:) so it does not get
tied to Nutch.

-- 
 Sami Siren


Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren

 I think that the distributed online Index part should be done outside of
 Nutch (or if done here do it with extreme caution:) so it does not get
 tied to Nutch.
 
 I am not sure I understand you here. If I have 10 machines I am using
 for serving indexes(I am assuming I have a Solr instance running on
 each one), IndexerSolr should be able to partition my index to 10
 machines.

There are more dimensions to distribution (or scaling) and the case you
describe is a very basic one.  Of course we could support such special
setups inside nutch too and just remember that once it starts to look
like a thing that can manage large online indexes perhaps it would
serve most goodness if it was not tied to nutch.

-- 
 Sami Siren


Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren
Doğacan Güney wrote:

 I actually think that the endless adding of configuration options does
 not do any good to anyone, we should instead start to write reusable
 pieces of code and/or bring the number of different options down
 (imoThe massive number of already available configuration/runtime
 options and the fact that most of nutch is not designed to be extended
 by coding is harmful for advanced users. In the other hand I think that
 things are already too complicated for novice users/imo)
 
 OK, adding new configuration options all the time is probably not a
 great idea. But I strongly believe that indexing to different targets
 should be done in Indexer.OutputFormat (OutputFormat outputs to
 different targets, makes sense to me :). For example, I would love the
 ability to index to solr but I would also need to store the original
 lucene index in DFS (so that if solr machine dies, I don't lose my
 index). I shouldn't have to run Indexer twice to achieve this.

In one application I added extension point for different indexing
backends, that way by implementing a composite index backend you could
achieve that same thing.

The code shown in blog post was mainly done simplicity in mind, other
motivation was doing it without touching Nutch source code.

--
 Sami Siren


Re: Enabling Spell-Check plugin in contrib

2007-06-15 Thread Sami Siren
Scam wrote:
 Hello Sami,
 
 Wednesday, June 13, 2007, 23:03, you wrote:
 
 Can anyone tell me how to use the spell-check query plugin available in the
 contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to
 enabling the nutch-plugins?
 
 SS Following these steps should get you there:
 
 SS 1. compile nutch (in top level dir do ant)
 
 SS 2. crawl your data (see tutorial)
 
 SS 3. edit your conf/nutch-site.xml so it contains plugin
 SS web-query-propose-spellcheck and webui-extensionpoints
 
 SS 4. edit conf/nutch-site.xml so it contains proper dir for plugins as the
 SS plugins are not packaged inside .war (something like
 SS property
 SS nameplugin.folders/name
 SS value path to plugins dir /value
 SS /property
 SS )
 
 SS 5. compile web2 plugins (in contrib/web2 do ant compile-plugins)
 
 I get error on this step:
 
 compile:
  [echo] Compiling plugin: web-caching-oscache
 [javac] Compiling 4 source files to 
 /home/nutch/distr/nutch.src/nutch/trunk/build/web-caching-oscache/classes
 [javac] 
 /home/nutch/distr/nutch.src/nutch/trunk/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:32:
  package org.apache.nutch.webapp.common does not exist
 
 Could you help me to know where is a problem?

it seems you can just ignore step #5, because they get compiled in #7


-- 
 Sami Siren


Re: Enabling Spell-Check plugin in contrib

2007-06-13 Thread Sami Siren
chris sleeman wrote:
 Hi,
 
 Can anyone tell me how to use the spell-check query plugin available in the
 contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to
 enabling the nutch-plugins?

Following these steps should get you there:

1. compile nutch (in top level dir do ant)

2. crawl your data (see tutorial)

3. edit your conf/nutch-site.xml so it contains plugin
web-query-propose-spellcheck and webui-extensionpoints

4. edit conf/nutch-site.xml so it contains proper dir for plugins as the
plugins are not packaged inside .war (something like
property
nameplugin.folders/name
value path to plugins dir /value
/property
)

5. compile web2 plugins (in contrib/web2 do ant compile-plugins)

6. edit search.jsp contains line tiles:insert definition=propose
ignore=true/ just before the second c:choose.

7. create web2 app (in contrib/web2 do ant war)

8. build your spell check index ( bin/nutch plugin
web-query-propose-spellcheck org.apache.nutch.spell.NGramSpeller -i
indexdir -f content -o spelling

9. deploy webapp to tomcat

10. start tomcat (from the dir you have your crawl data and ngram index
generated in #7)

11. search for something that is spelled incorrectly

 Also how do we build the spelling index ? Are these plugins still WIP ? I

see #8 above, the whole web is MWSN (More Work Still Needed:)

 haven't been able to find any docs on these.

That's because there currently is not any other documentation but the
readme in
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/README.txt?view=markup

I should probably put some documentation to wiki to gain more attraction

fyi - I just committed a small fix to bug that might prevent spell
checking proposer from working. So if you have problems check out the
trunk or a nightly build tomorrow.

-- 
 Sami Siren


Re: Regex-urlfilter

2007-05-16 Thread Sami Siren
Naess, Ronny wrote:
 Can anyone pleas tell me what am I doing wrong?

 It struck me that I might be using the wrong file and that all regex
 exceptions should be in crawl-urlfilter.txt, but I do not thing that is
 correct.

   
Yes when using the crawl command you should use crawl-urlfilter.xml or
configure crawl to use regex-urlfilter.xml via crawl-tool.xml.

-- 
 Sami Siren



Re: fetch single host

2007-05-11 Thread Sami Siren
derevo wrote:
 hi, 
 (2 servers hadoop nutch)
 
 I am try to fetch my host with txt files ( http://site.net/file_1.txt ).
 More then 15 txt files. 
 when i start fetch and look in access.log file in target host, i see only
 one slave host do fetch (SLAVE_1). 
 I try to restart fetching and slave host now is (SLAVE_2). 
 
 in Task Tracker Status i see the same result

Fetchlist is by default partitioned in a way that all urls for same host
 will end up being fetched by a single node see PartitionUrlByHost.

To override this you would need to change the partitioner or stop using
it (both would require source code changes)

-- 
 Sami Siren


Re: urlfilter-suffix bug ?

2007-05-06 Thread Sami Siren
Andrzej Bialecki wrote:
 Sami Siren wrote:
 Emmanuel JOKE wrote:
 ...
 those files. I tried to look at the code and I think the plugin doesn't
 manage correctly the dynamic URL  with ?  and parameters after the
 extension of the file.

 Yes your observation is correct, the filter compares only on string
 level. It isn't too hard to extends the functionality so it meets your
 requirement.

 
 The question is however what is the intended behavior - should we match
 at the whole URL (including URL parameters), or should we match only the
  URL up to (and including) path, but excluding any parameters? Currently
 we implement the former.

Yes the current behavior (specially in url space) is not what you'd
probably expect but it matches the name.

-- 
 Sami Siren


Re: nutch freezing issue

2007-05-05 Thread Sami Siren
Siddharth Jonathan wrote:
 Hi,
  After a couple of days of being up, my nutch app begins to
 freeze/hang and basically
 indexing and searching can no longer happen.

During this time (couple of days) is it just sitting idle or serving
requests?

--
 Sami Siren



Re: urlfilter-suffix bug ?

2007-05-05 Thread Sami Siren
Emmanuel JOKE wrote:
...
 those files. I tried to look at the code and I think the plugin doesn't
 manage correctly the dynamic URL  with ?  and parameters after the
 extension of the file.

Yes your observation is correct, the filter compares only on string
level. It isn't too hard to extends the functionality so it meets your
requirement.

-- 
 Sami Siren



Re: Nutch and running crawls within a container.

2007-04-30 Thread Sami Siren
Briggs wrote:
 Version:  Nutch 0.9 (but this applies to just about all versions)
 
 I'm really in a bind.
 
 Is anyone crawling from within a web application, or is everyone
 running Nutch using the shell scripts provided?  I am trying to write
 a web application around the Nutch crawling facilities, but it seems
 that there is are huge memory issues when trying to do this.   The
 container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
 on the stack) runs out of memory in less that an hour.  When profiling
 version 0.7.2 we can see that there is a constant pool of objects that
 grow, but never get garbage collected.  So, even when the crawl is
 finished, these objects tend to just hang around forever, until we get
 the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
 the application to use Nutch 0.9 and the problem got about 80x worse

Have you analyzed in any level of detail what is causing this memory
wasting?  Have you tried tweaking jvms XX:MaxPermSize?

I believe that all the classes required by plugins need to be loaded
multiple times (every time you execute a command where Configuration
object is created) because of the design of plugin system where every
plugin has it's own class loader (per configuration).

 So, the current design is/was to have an event happen within the
 system, which would fire off a crawler (currently just calls
 org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
 but grief.  We need to have several crawlers running concurrently. We

You should perhaps use and call the classes directly and take control of
managing the Configuration object, this way PermGen size is not wasted
by loading same classes over and over again.

-- 
 Sami Siren


Re: Can anybody tell me how the Nutch-0.9 is different than nutch-0.8.1

2007-04-20 Thread Sami Siren
Ratnesh,V2Solutions India wrote:
 Hi,
 can anybody explain me what's new with nutch-0.9 than in nutch-0.8.1 since I
 have used nutch-0.8.1 ,
 I am keen to know how the nutch-0.9 is different from older version .

I think the best place to study thechanges since 0.8.1 is jira:

http://issues.apache.org/jira/secure/BrowseProject.jspa?id=10680subset=3

where most of the changes are listed.

--
 Sami Siren


Re: Classpath and plugins question

2007-04-19 Thread Sami Siren
Antony Bowesman wrote:
 I'm looking to use the Nutch parsing framework in a separate Lucene
 project. I'd like to be able to use the existing plugins directory
 structure as-is, so wondered Nutch sets up the class loading environment
 to find all the jar files in the plugins directories.

There are dedicated class loaders for each plugin. The classpath is
constructed (recursively) based on plugin metadata (plugin.xml).

 Any pointers to the Nutch class(es) that do the work?

Check the package o.a.n.plugin which contains most of the general
plug-in code.

There's also a recently established project called Apache Tika [1] which
has a goal of putting together generally usable parsing/extracting
framework. It hasn't yet got out of the ground so there is a good chance
to get your voice heard.

[1] http://incubator.apache.org/tika/

-- 
 Sami Siren


Re: How to recude the tmp disk space usage during linkdb process?

2007-04-11 Thread Sami Siren
Sean Dean wrote:
 I think the general rule is you will require about 2.5 to 3 times the size of 
 the final product. This is due to Hadoop creating the reduce files after the 
 maps are produced, before the maps can be removed.
  
 I'm not aware of any way to change this, I think its just normal 
 functionality.

The space consumption is at its worst on single machine configuration
where you have to process all the data on 1 machine. If you have more
machines to spare then the space required per machine can (obviously) be
divided roughly by the amount of machines.

I think the only way to cut down your temp size requirements (after
compression, I think it's possible to compress the temp data?) is to do
your work in smaller slices.

--
 Sami Siren
 
  
 - Original Message 
 From: qi wu [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Wednesday, April 11, 2007 10:41:35 AM
 Subject: Re: How to recude the tmp disk space usage during linkdb process?
 
 
 One more general questions related with this issue is :How to estimate the  
 tmp space required by the overall process which include fetching,update 
 crawldb,building linkdb and indexing ?
 For my case, 20G space for crawdb and all segments require more than 36G 
 space for linking DB tmp space, sounds unreasonable!
 
 - Original Message - 
 From: qi wu [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Wednesday, April 11, 2007 10:15 PM
 Subject: Re: How to recude the tmp disk space usage during linkdb process?
 
 
 it's impossible for me to change to 0.9 now,anyway ,thank you!

 - Original Message - 
 From: Sean Dean [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Wednesday, April 11, 2007 9:33 PM
 Subject: Re: How to recude the tmp disk space usage during linkdb process?


 Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) 
 to reduce overall space. The average compression ratio using zlib is about 
 6:1 on those two databases and doesn't slow additions or segment creation 
 down.

 Keep in mind, this currently only works officially on Linux and 
 unofficially on FreeBSD.


 - Original Message 
 From: qi wu [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Wednesday, April 11, 2007 9:01:30 AM
 Subject: How to recude the tmp disk space usage during linkdb process?


 Hi,
  I have cralwed nearly 3millon pages which are kept in 13 segements and 
 there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux 
 box,currently the disk occupied by crawldb and segments is about 20G ,and 
 the machine still have 36G space left. I always failed in building linkdb, 
 and the error was caused by no space left for reducing process, the 
 exception is listed below:
 job_f506pk
 org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
at 
 org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150)
at 
 org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83)
at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at 
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
at 
 org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
at 
 org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
at 
 org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at 
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
at 
 org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
at 
 org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
at 
 org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
at 
 org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218)
at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

 I wonder why so much space are required by linkdb reduce job, can I config 
 some nutch or hadoop setting to reduce the disk space usage for linkdb? Any 
 hints for me to overcome the problem? //bow

 Thanks
 -Qi



Re: Fetcher2 too many spinWaiting, How to tune?

2007-04-02 Thread Sami Siren
hi,


qi wu wrote:
 Hi, I am using  Fetcher2 with 200 threads started. I get a satisify
 speed(about 20pages/s)  at the beginning stage ,but after no more
 than one hour,there are many spinWaiting threads. Where might be the
 bottleneck? network, memory or anyplace else? Could you also give me
 some hints on how to get more detailed debug info?

Not specific to fetcher2, but how are the pages distributed among
different hosts in fetchlist? Have you configured reasonable setting for
generate.max.per.host in nutch conf?

If you generate too many pages for too few hosts there's no way
fetcher|fetcher2 can fetch them fast unless you make it non polite.

--
 Sami Siren



Re: Crawling + Indexing staging vs. production and URL conflict

2007-04-01 Thread Sami Siren
Tomi N/A wrote:
 2007/3/31, Sami Siren [EMAIL PROTECTED]:
 
 You could also let your reverse proxy do the rewriting using something
 like http://apache.webthing.com/mod_proxy_html/. I have been using
 something like that for rewriting massive amount of html in realtime for
 AA purposes to hammer web applications to different url space.
 
 Does it put the server under noticeable additional load?

We ran reverse proxy (with AA) on separate machines and the load on the
machines was minimal, network latency was more overhead (thinking of
page download times) than rewriting couple of absolute urls. I should
note however that we did not use that particular rewriter but a very
similar home brew solution.

--
 Sami Siren


Re: Crawling + Indexing staging vs. production and URL conflict

2007-03-31 Thread Sami Siren
Andrzej Bialecki wrote:
 [EMAIL PROTECTED] wrote:
 What is the best way to accomplish this?

 One thing I was thinking was to index the staging site, then open up
 CrawlDb and LinkDb (any others?), loop through them and write out a
 new version of those files, changing the keys (URLs) along the way,
 for instance from http://STAGING.example.com/foo/bar.html to
 http://WWW.example.com/foo/bar.html

 Has anyone done this?  Does this sound realistic/doable?
 Is there a better/faster/easier way?
   e.g. changing URLs immediately at fetch/parse/index time?
   e.g. changing URLs on the fly at search time when displaying results?
 
 There is another option - when fetching configure nutch to use a URL
 rewriting proxy, which will rewrite on the fly your requests of
 www.example.com to staging.example.com, get the response, and return the
 content - the only thing to do then would be to rewrite absolute
 outlinks contained in the content, from staging to www - but this can be
 done in URLNormalizers.
 

You could also let your reverse proxy do the rewriting using something
like http://apache.webthing.com/mod_proxy_html/. I have been using
something like that for rewriting massive amount of html in realtime for
AA purposes to hammer web applications to different url space.

--
 Sami Siren



Re: Merging WebDBs

2007-03-23 Thread Sami Siren

2007/3/23, prashant_nutch [EMAIL PROTECTED]:


i created new webdb under which i create two folder crawldb and segments
(which is combination of two webdb),
but now i want create Linkdb and index
how this can be created..i use command like this in eclipse program
argument(Windows)

invertlinks linkdb segments/*
i got error like

INFO  crawl.LinkDb - LinkDb: starting
INFO  crawl.LinkDb - LinkDb: linkdb: invertlinks
INFO  crawl.LinkDb - LinkDb: adding segment: linkdb
INFO  crawl.LinkDb - LinkDb: adding segment: segments/*
ERROR mapred.JobClient - Input directory
E:/Data/prashant/Projects/DummyNutch/Nutch/linkdb/parse_data in local is
invalid.
thanks in advance for help




LinkDb treats the parameter invertlinks as the path to linkdb (the 1st
parameter), remove it and the command should succeed.

--
Sami Siren


Re: Nutch and GET

2007-03-23 Thread Sami Siren
Damian Florczyk wrote:
 Hi there,
 
 Does nutch can index dynamic pages with multilpe GET parameters in request ?
 

Have you allowed them in URL filter configuration? By default regex
urlfilter filters away those:

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]



--
 Sami Siren


Re: How to limit nutch to fetch, refetch and index just the injected URLs?

2007-02-02 Thread Sami Siren
Nicolás Lichtmaier wrote:

 I've backported revision 450799 to the 0.8.x branch for supporting
 -noAdditions. Perhaps you could consider committing it there... (I
 haven't tested it yet whough).

Can you please create a JIRA issue for this and attach the patch there.

--
 Sami Siren


Re: Indexing only some filetypes with Nutch

2007-01-24 Thread Sami Siren
Tobias Zahn wrote:
 Hallo again,
 I think I'm going to have a problem here: what if I'd like to index only
 files like .gif? I think I won't get anything in my index that way :-(
 Is there a way to get all URLs to such files anyway (maybe on a txt-list)?

You would have to allow html to be fetched to find the images. You would
also need to change indexer to index just the content you are interested
in (images) and skip the rest.

--
 Sami Siren


Re: Compiling PruneIndexTool trouble

2007-01-22 Thread Sami Siren
Jonathan Hunter wrote:
 Dear nutch-users,
 
 I am trying to make some changes to the Nutch's PruneIndexTool, but
 before I start making those changes I wanted to make sure that I am able
 to compile the current PruneIndexTool from the command line.
 
 I checked to make sure that the java compiler works in general by using
 it to compile a simple hello world program.
 
 I did this by calling the following command from my nutch directory:
 
 $ javac helloworld.java
 //compiles with no errors
 $ java helloworld
 hello world
 $
 


You should use ant command to compile nutch (including PruneIndexTool).

$ ant

--
 Sami Siren


Re: How to stop a slow fetch?

2007-01-18 Thread Sami Siren
 I'm thinking your fetch list is at a point where it might only have
 few hosts left in it, but enough pages from those hosts to stall
 everything up. Recently there was a patch applied to trunk to help
 solve that problem, the generator was actually not working to its
 fullest capacity for some time up until that point.

There's some more about that issue and how it affected to a random
segment here: http://blog.foofactory.fi/2007/01/sorted-out.html

--
 Sami Siren




Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
Chee Wu wrote:
 Hi,
 I am trying to add a new analyzer for Chinese,and I found the
 code below in the org.apache.nutch.indexer.Indexer
 
 The question of mine is:
 For doc.get(lang). Where and how can I  set the  lang property for

lang field is put there by language identifier plugin if it is active.

http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html

--
 Sami Siren


Re: List owner?

2007-01-07 Thread Sami Siren
Owner can be reached at [EMAIL PROTECTED]

What kind of error are you experiencing (if any)?

--
 Sami Siren

James Phillips wrote:
 Can somebody tell me how to contact the owner of this list? I have tried
 on COUNTLESS occasions to remove myself using
 [EMAIL PROTECTED] but still keep on receiving
 e-mails.
 
 Regards,
 
 James Phillips
 
 



Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
chee wu wrote:
 Thanks Sami. I tried LanguageIndexingFilter,and it seems the 
 LanguageIdentifier can't recognize Chinese now ?

No it doesn't. The list of languages can be checked here (*.ngp):
http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/

You can build a ngp profile for chinese, but i think that in language
identifiers current form it might not work that well.

You could also build an specialized identifier and add it as indexing
filter - the most basic form could just blindly set lang to Chinese if
that suits your use case.

--
 Sami Siren

 
 - Original Message - 
 From: Sami Siren [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Sunday, January 07, 2007 5:47 PM
 Subject: Re: Nutch .81: the process to add a new analyzer ?
 
 
 Chee Wu wrote:
 Hi,
 I am trying to add a new analyzer for Chinese,and I found the
 code below in the org.apache.nutch.indexer.Indexer

 The question of mine is:
 For doc.get(lang). Where and how can I  set the  lang property for
 lang field is put there by language identifier plugin if it is active.

 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html

 --
 Sami Siren




Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
e w wrote:
 If someone could explain the reasoning/motivation behind the orginal

Current n-gram identifier in nutch works pretty much ok for most of
western languages. It is also very simple and quite fast way of
identifying documents language. However is the charset of document is
not detected right results are not that good.

 identification method that would be helpful. Otherwise, I'd be happy to
 contribute my pseudo-NB hack and maybe even implement the correct version.

Go ahead and attach it to JIRA. I am sure there's plenty of people
interested in such thing.

--
 Sami Siren



Re: How best to add sponsored link support..??

2006-12-19 Thread Sami Siren

Are you looking for something like the google keymatch as described in [1]
which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]

--
Sami Siren

[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/
[3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java

2006/12/19, RP [EMAIL PROTECTED]:


Let me qualify this - ad banner rotation is dealt with - I'm looking for
something that will use our Nutch engine to serve up relevant links from
people who pay for that privilege.  We do not want to serve up ad's from
someone else's system i.e. the big G or Y, but use our own Nutch search
results to serve up relevant paying links that we have sold and
maintain.   In a simple relational SQL world we would add a flag and
another table with the links and scores and look that up and pass back
when needed.  Problem with that is that we lose the whole multi word
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a
Chicago pizza ad first and beer ads further down, just like our search
results have relevancy (not a great example but you get the idea).
Re-writing a scoring engine to do that in SQL seems like a waste when
Nutch already does it just fine.

So in a nutshell - we need to do what the big G and Y and other do when
serving up key word based sponsor links.  My thought - automate the
build of a dummy page with the key words bought that would be indexed
and served up just like regular crawled and indexed pages, using the
scoring to rank them in terms of relevancy and placement - I have not
seen any snippets of code to do simple insert/update/delete operations
on a Nutch segment or index however

This is the idea gathering phase - think like a school/college search
engine with local paying advertisers - we want to serve those links up
to the searchers to help offset the cost of the service and serve up or
flag links that rank first because of payment followed by normal search
link results

rp

Sean Dean wrote:
 I might be totally off base with what your asking to do, but take a look
at this open source project: http://phpadsnew.com/two/.

 Its basically an advertising engine, built on PHP. Integration within
any application is a breeze, and it supports external advertising such as
Google Ads.

 Sean

 - Original Message 
 From: RP [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, December 19, 2006 10:52:56 AM
 Subject: How best to add sponsored link support..??


 Hi all,

 I've been tasked with looking into this and am not a coder - that said,
 Nutch  is doing great and the bean counters have asked me to look into
 adding sponsored link results and I'm wondering how best to add this.

 It would be nice to utilize the Nutch engine to come up with the pages
 versus just doing a lookup on words and results in a flat file but the
 key word data could change daily (hourly) and would need to be able to
 be hand entered (or automated) as people sign up (re-index is not really
 an option).  I'm not sure this would fly within the main Nutch segments
 and index, but I could see maybe a separate index or possibly adding a
 flag to the existing data but I've not seen any easy to use tools to
 change/update/insert records into what is already there (yes Luke on the
 index but that does not touch the segment data, right?).  I don't want
 to change existing searched data and I don't see an issue with having
 duplicate results (sponsored up top and existing entry down below
 somewhere) but it would be more elegant to not have that occur.  I also
 see issues in a simple flat file look up as a multiple word search is
 best handled inside Nutch to score the results versus having to do
 something similar in the sponsored results.  I can see the need to
 control the summary text displayed and also pass thru any codes in the
 URL which are currently being stripped during the main crawl/index
 cycle.  I also see issues with seriously customizing the internals as
 they would have to be maintained as Nutch itself is updated

 If anyone has looked at this and has at least some ideas on how best to
 do this let me know.  I need to come up with a preliminary estimate
 before I can engage and pay the coders to make this happen so if there
 are any easy or best practices ways on doing this any help/pointers
 would be appreciated






Re: subcollections

2006-12-16 Thread Sami Siren
liv wrote:
 - I reindex the db: delete folder indexes, run the command:
 
 bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
 
 - then I inspect the resulting db with luke again
 
 Unfortunately nothing has changed. Maybe I am missing something... Please
 tell me if you see anything wrong.

If you did exactly those steps then what happens is that the
subcollections.xml is read from inside the .job file. You need to
rebuild the .job to put new file inside of it.

simply do ant and rerun indexing and it should work as expected.

--
 Sami Siren



Re: error with trunk: linkdb copied to wrong dir

2006-12-14 Thread Sami Siren
Andrzej Bialecki wrote:
 Espen Amble Kolstad wrote:
 Hi,

 There's a bug in LinkDb.install(). It tries to rename an old linkdb from
 linkdb/current to linkdb/old, and linkdb/current doesn't exist.
 Just replace:
 fs.rename(current, old);
 with:
 if (fs.exists(current)) {
fs.rename(current, old);
 }

 and it will work again :)
   
 
 
 Indeed, this is related to some changes of delete()'s behavior in HDFS -
 it seems that previously it would just return false on non-existent
 directories, now it throws an Exception.

the needle is here?

http://issues.apache.org/jira/browse/NUTCH-392

--
 Sami Siren


Re: subcollections

2006-12-14 Thread Sami Siren
liv wrote:
 I intend to use nutch with a fairly complex structure of subcollections. I
 did some tests and the storage/search performs as expected; however there is
 an aspect I may have neglected and cannot find an answer. 
 
 How/at which stage are subcollections added to the index structure?

If you are talking about the subcollections generated by the
subcollection plugin then the subcollection data is stored at indexing
phase.

 I plan on crawling frequently, adding new sites to existent repository,
 merging/reindexing as needed. However if I need to change the subcollection
 structure (ie. add a site to a newly created subcollection) I don't want to
 recrawl it again. I hope it can be done by simply using the existent/crawled
 data.

no need to recrawl, unfortunately you still need to reindex.

--
 Sami Siren


Re: Fetcher hung on final hurdle - continue?

2006-12-08 Thread Sami Siren
 Prefix filter to cut off anything without http://;. And then a
 (non-existent) domain-suffix filter, which considers only domain
 suffixes - this is easy to implement based on the suffix filter that
 ships with Nutch.

We should propably change the default filter to be something else than
regex.

--
 Sami Siren


Re: indexing from local file system -- indexing from HDFS

2006-11-22 Thread Sami Siren

Christian Herta wrote:
I tried to Index my  local file system according to the FAQ: 
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6


But if I add the plugin into the nutch-site.xml file like this:

  property
nameplugin.includes/name
   
valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value

  /property



try with:

valueprotocol-(file|http)|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic/value

if it does not work consult your log file logs/hadoop.log for more 
specific info about your problem.





Additionally I have another question:
 * Is there a possibility to use a directory of the HDFS Filesystem as a
spool directory to index from?


Not directly, but if you can expose[1] hdfs via some available protocol 
then it is possible to index contents of hdfs also.


One could also write a protocol-hdfs plugin to do the job.

--
 Sami Siren


[1]http://issues.apache.org/jira/browse/HADOOP-4


Re: Fetch fails

2006-11-22 Thread Sami Siren

frgrfg gfsdgffsd wrote:

Hi all,

I have  a problem with the crawl/fetch of 1 website (www.lequipe.fr), although 
it works for fine another (www.lemonde.fr).

Here are the errors:
ERROR [MAT] 2006-11-22 00:36:20,860 - Http.invoke0(?) | 
java.lang.IllegalArgumentException: null metadata
ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at 
org.apache.nutch.protocol.Content.init(Content.java:60)
ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:196)
ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:162)

Don't understand why metadata is null when there are some metadata on the pages... 



what version of nutch are you running?



I also have this messsage just before:
INFO [MAT] 2006-11-22 00:36:32,477 - HttpBase.getProtocolOutput(194) | 
Skipping: http://www.lequipe.fr/ exceeds fetcher.max.crawl.delay, max=30, 
Crawl-Delay=120

and i can't find this property in nutch-site.xml


You need to add it there.

property
 namefetcher.max.crawl.delay/name
 value  your value here  /value
/property

--
 Sami Siren


Re: Nutch sessions cookies on https protocol

2006-11-22 Thread Sami Siren

Gavino Marras wrote:

Nutch does work with sessions and cookies on https protocol ?


No, Nutch does not support cookies nor sessions.

--
 Sami Siren


Re: Nutch sessions cookies on https protocol

2006-11-22 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:

Gavino Marras wrote:

Nutch does work with sessions and cookies on https protocol ?


No, Nutch does not support cookies nor sessions.


This is not strictly speaking true ... if you use protocol-httpclient 
then https, cookies and sessions are supported internally by the 
httpclient library, but Nutch doesn't process this information in any way.


So, https works just fine, cookies are accepted and then presented if 
other urls are fetched during the same execution, but they are not 
stored anywhere.
Server set cookies are just http headers so they _are_ stored with rest 
of the headers.


Https works even without protocol-httpclient if a proxy that supports 
https is used.


Anyway, the way I understood the question I would still answer no to 
sessions and cookies.


--
 Sami Siren





Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren

carmmello wrote:
So, I think, one of the possibilities for the user of a single machine 
is that the Nutch developers could use some of their time do improve the 
previous 0.7.2, adding to it some new features, with further releases of 
this series.  I don`t belive that there are many Nutch users, in the 
real world of searching, with a farm of computers.  I, for myself, have 
already built an index of more than one million pages in a single 
machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the 
0.7.2 version, with very good results, including the actual searching,  
and gave up the same task, using the 0.8 version, because of the large 
amount of time required, time that I did not have,  to complete all the 
tasks, after the fetching of the pages.


How fast do you need to go?

I did a 1 million page crawl today with trunk version of nutch patched 
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.


But of course there are still various ways to optimize fetching process 
- for example optimizing the scheduling of urls to fetch, improving 
nutch agent to use Accept header [2] for failing fast on content it 
cannot handle etc.


[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html

--
 Sami Siren


Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren

Uroš Gruber wrote:

How fast do you need to go?

I did a 1 million page crawl today with trunk version of nutch patched 
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.



How is that even possible.

I have 3.2GHz pentium with 2G ram. I was same speed problem, because of 
that I setup nutch with single node. About hour ago fetcher was finished 
crawling 1.2 million pages. But this took


I am running on amd athlon 64 3600+ with 1 G of memory so it's not even 
high end
while map job I have about 24 pages/s. I din't test it with this patch. 
But then reduce job was slow as hell. I realy don't understant what took 
so long. It is almost twice as slow as map job.


Please try the trunk version for comparison and check back for results. 
(the patch is now applied to trunk)


There are also other things that count (even more?), please see [1]


If I use local mode numbers are even worse.


my numbers are with local job runner.


I can't imagine how much it took to crawl let say 10mio pages.

I'll let you know when mine is finished, just started 3rd segment of 
size 1 million to test the trunk version (running with local job runner)


--
 Sami Siren


[1]http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06533.html


Re: Nutch as static exporter?

2006-10-31 Thread Sami Siren

Thorsten Scherler wrote:

Hi all,

I wonder if I could use nutch as static exporter. 


I mean e.g. Apache Forrest is using the cocoon crawler but in the next
version of cocoon the crawler will be probably not included anymore.

Could I use nutch for that?


could you please explain a bit more what as static exporter means?

--
 Sami Siren


Re: large number of urls from Generator are not fetched?

2006-10-31 Thread Sami Siren
Are you saying that generator generates 200k urls but fetcher fetches 
around 100k or are you saying that you generate (-topN 20) 200k urls 
and fetcher fetches only around  100k.


If latter and you are running with LocalJobRunner you need to generate 
with -numFetchers 1.


--
 Sami Siren

AJ Chen wrote:

Any idea why nutch (0.9-dev) does not try to fetch every url generated? For
example, if Generator generates 200,000 urls, maybe 100,000 urls will be
fetched, succeeded or failed. This is a big difference, which is obvious by
checking the number of urls in the log or run readseg -list. What causes a
large number of urls get thrown out by the Fetcher?

Thanks,




Re: Speeding things up!

2006-10-29 Thread Sami Siren

forgot one important one:

set generate.max.per.host to something reasonable so you won't end up 
fetching urls from only low number of hosts which by default is very slow.


--
 Sami Siren

Sami Siren wrote:

Some simple rules for generally speeding things up

1. Crawl only the content you are going to handle handle (do not fetch 
for example pdf-files if you don't need them, also disable all unneeded 
parsers)


2. If using regex-urlfilter: If you don't need the rule
-.*(/.+?)/.*?\1/.*?\1/ remove it (also keep the number of rules as 
small as possible still remembering #1 and #3)


3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end 
up parsing all kinds of binary content with text parser.


You might also check the variables like fetcher.server.delay and 
fetcher.threads.per.host. (and remember to keep your fetcher polite!)


I am using something like 300 for fetcher.threads for fetching with 
0.8.1 single athlon 64, 1 GB of memory.


I am also in process of fixing some IO related bottlenecks and will get 
back to that hopefully sooner than later.


--
 Sami Siren




Marco Vanossi wrote:

Hi,

Do you have some hints that would improve speed for the following nutch
commands?

./nutch generate db segments -topN 1000
s=`ls -d segments/2* | tail -1`
./nutch fetch $s
./nutch updatedb db $s
./nutch index $s
./nutch dedup segments tmpfile

I mean, do you have some hints for the numbers set in
nutch-default.xmlfor, for example:
fetcher.threads (I'm using 10.000), etc
Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.

Thank you very much for any help.

Marco








Re: Nutch slow how to speed up?

2006-10-24 Thread Sami Siren
You are using DistributedSearch? and local filesystem to store index and 
related data?


--
 Sami Siren


Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), 
searching with queries like 'China Nuclear Forces' takes 20 – 25 s.


My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY





Re: Nutch slow how to speed up?

2006-10-24 Thread Sami Siren
If your data to be searched lies in dfs it is slow. You need to first 
copy it out to local file system. Split your data into smaller slices 
which you then distribute evenly on your search nodes.


This part of process is not that well covered and I am looking for much 
improvement in this area from this proposal:


http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL 
PROTECTED]

--
 Sami Siren



Håvard W. Kongsgård wrote:

DistributedSearch
2x datanodes, 2x Task Trackers

Sami Siren wrote:
You are using DistributedSearch? and local filesystem to store index 
and related data?


--
 Sami Siren


Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 
memory), searching with queries like 'China Nuclear Forces' takes 20 
– 25 s.


My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY











Re: Modifying Nutch core

2006-10-24 Thread Sami Siren

Right now it seems like I have to run `ant package` and then copy the
nutch-0.8.jar file out of the build dir and into the nutch dir.  But that
takes a really long time!  I'd like to just be able to run `ant
compile-core` and then run bin/nutch...

How should I be doing this?


first:
ant (to compile and to create nutch-x.x.x.job)

then:
bin/nutch ...

--
 Sami Siren


Re: Indexing the file system / best approach

2006-10-18 Thread Sami Siren

Bruno Thiel wrote:

All,

I want to get nutch to index the file system. My first approach was to
nfs-mount the file system and et nutch crawl through the hierachary over
http/Apache. This turned out to be fairly slow  ~3,000 fetches per hour. 
Next approach was to go via file:/// file:///  and to generate a file list

to be crawled. This file list is fairly big ~200,000 entries, and with the
current 0.8.1 release of nutch the fetcher just freezes right at the end of
a crawl.
What exactly happens when your fetcher freezes? 200 000 entries is not a 
big list to

be fetched.

--
Sami Siren



Re: Lucene query support in Nutch

2006-10-07 Thread Sami Siren


Nevertheless, I agree that there should be an option to choose the 
Lucene query engine instead of the Nutch flavour one because Nutch has 
been proven to be equally suitable for areas which do not require as 
efficient queries (like intranet crawling for instance) as an all-out 
web indexing application.


I agree also. Different query parsers could perhaps be made pluggable or 
at least configurable. The current(-alike) implementation could be the 
default one offered and by configuration one could switch it to 
intranet mode.


Contributions anyone?

--
 Sami Siren


Re: stop an index server

2006-09-29 Thread Sami Siren
It seems that this was not reaching nutch-user so here's it again in 
case someone else is also interested.


---

hello,

here's an adhoc addition to search server to support shutdown command.

client calls server like this:

bin/nutch 'org.apache.nutch.searcher.DistributedSearch$Client'
-shutdown 127.0.0.1 

--
  Sami Siren

Alvaro Cabrerizo wrote:



2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]:

Alvaro Cabrerizo wrote:
  How could I stop an index server (started with bin/nutch server
port
  index) knowing the port?
 
  Thanks in advance.
 

It does not support such a feature. Can you describe a little bit more
what are you trying to accomplish something similar to tomcats SHUTDOWN?


Sure,
That's right. If this feature doesn't exist, I'm looking for a clue to 
develop a SHUTDOWN and a RESTART command,  using NUTCH/HADOOP api. The 
idea is to have a group of JAVA classes that lets people execute a 
command like: SERVER_RESTART port or more advanced SERVER_RESTART 
port ip_address.


Anyway I can execute ps aux | grep 4 in a shell and find out 
proccess number in order to kill it or I can make a ^C to stop it, but 
this is not the solution I'm looking for.



Thanks, in advance.
 


--
  Sami Siren





Index: src/java/org/apache/nutch/searcher/NutchBean.java
===
--- src/java/org/apache/nutch/searcher/NutchBean.java	(revision 447940)
+++ src/java/org/apache/nutch/searcher/NutchBean.java	(working copy)
@@ -25,10 +25,12 @@
 
 import org.apache.hadoop.fs.*;
 import org.apache.hadoop.io.Closeable;
+import org.apache.hadoop.ipc.RPC.Server;
 import org.apache.hadoop.conf.*;
 import org.apache.nutch.parse.*;
 import org.apache.nutch.indexer.*;
 import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.searcher.DistributedSearch.Protocol;
 import org.apache.nutch.util.NutchConfiguration;
 
 /** 
@@ -36,8 +38,8 @@
  * @version $Id: NutchBean.java,v 1.19 2005/02/07 19:10:08 cutting Exp $
  */   
 public class NutchBean
-  implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
- DistributedSearch.Protocol, Closeable {
+  implements Protocol, Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
+ Closeable {
 
   public static final Log LOG = LogFactory.getLog(NutchBean.class);
 
@@ -400,12 +402,29 @@
 
   public long getProtocolVersion(String className, long arg1) throws IOException {
 if(DistributedSearch.Protocol.class.getName().equals(className)){
-  return 1;
+  return DistributedSearch.Client.versionID;
 } else {
   throw new IOException(Unknown Protocol classname: + className);
 }
   }
 
-
-
+  public void shutdown() {
+try {
+  LOG.info(Closing NutchBean instance  + this);
+  this.close();
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  e.printStackTrace();
+}
+final Server server=(Server)conf.getObject(DistributedSearch.DISTRIBITED_SERVER_INSTANCE);
+
+new Thread(){
+public void run(){
+
+  LOG.info(Shutting down server instance: + server);
+  server.stop();
+}
+}.start();
+
+  }
 }
Index: src/java/org/apache/nutch/searcher/DistributedSearch.java
===
--- src/java/org/apache/nutch/searcher/DistributedSearch.java	(revision 447940)
+++ src/java/org/apache/nutch/searcher/DistributedSearch.java	(working copy)
@@ -38,6 +38,8 @@
 
 /** Implements the search API over IPC connnections. */
 public class DistributedSearch {
+  
+  public static final String DISTRIBITED_SERVER_INSTANCE = DistribitedServerInstance;
   public static final Log LOG = LogFactory.getLog(DistributedSearch.class);
 
   private DistributedSearch() {}  // no public ctor
@@ -48,11 +50,17 @@
 
 /** The name of the segments searched by this node. */
 String[] getSegmentNames();
+
+
+/** Ask server to shutdown itself
+ * @throws IOException */
+void shutdown();
   }
 
   /** The search server. */
   public static class Server  {
 
+
 private Server() {}
 
 /** Runs a search server. */
@@ -70,6 +78,7 @@
   Configuration conf = NutchConfiguration.create();
 
   org.apache.hadoop.ipc.Server server = getServer(conf, directory, port);
+  conf.setObject(DISTRIBITED_SERVER_INSTANCE, server);
   server.start();
   server.join();
 }
@@ -83,7 +92,7 @@
 
   /** The search client. */
   public static class Client extends Thread
-implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
+implements Protocol, Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
Runnable {
 
 private InetSocketAddress[] defaultAddresses;
@@ -143,6 +152,8 @@
 private static final Method SEARCH;
 private static final Method DETAILS;
 private static final Method SUMMARY

Re: Problem Searching

2006-09-29 Thread Sami Siren

WebDev Freak wrote:

Hi, I'm using the subcollection.xml file to create collection but I can't
find any code samples to search for a term in a specific collection.  I'm
looking for java code samples.


look in contrib/web2, there's a piece of java code that does this (reads 
collection name from request parameter, put there by the view part of 
that plugin, and modifies query object accordingly)


http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-subcollection/src/java/org/apache/nutch/webapp/subcollection/SubcollectionPreSearchExtension.java?view=markup

So basically what you need to do is modify the Query.

--
 Sami Siren



Thanks,





Re: stop an index server

2006-09-27 Thread Sami Siren

Alvaro Cabrerizo wrote:

How could I stop an index server (started with bin/nutch server port
index) knowing the port?

Thanks in advance.



It does not support such a feature. Can you describe a little bit more 
what are you trying to accomplish something similar to tomcats SHUTDOWN?


--
 Sami Siren



[ANNOUNCE] Nutch 0.8.1 available

2006-09-26 Thread Sami Siren
Nutch Project is pleased to announce the availability of 0.8.1 release 
of Nutch - the open source web-search software based on lucene and hadoop.


The release is immediately available for download from:

http://lucene.apache.org/nutch/release/

Nutch 0.8.1 is a maintenance release for 0.8 branch and fixes many 
serious bugs discovered in previous release. For a list of changes see


http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt

A big thanks to everybody who participated and made this release possible.

--
  Sami Siren



Re: Cannot generate all injected URLS

2006-09-22 Thread Sami Siren
Are you running in non clustered mode, then run with parameter 
-numFetchers 1 and you should get all the urls.


perhaps we should fix this by adding a check in generator:

if task is run with local job runner that param should be forced to 1 
(now it defaults to job.getNumMapTasks() which defaults to 2)


--
 Sami Siren

Frank Kempf wrote:

Hello,

got stuck with generating.
Injecting 3200 Urls into the database and generating afterwards leads 
always to the same result of having 1632 Urls in crawl_generate.

(I checked the db and it actually has 3200 entries).
No matter if I try -topN 5000 / 5 or nothing.
How could I generate a whole set of first level Urls?


  Kind regards

Frank






Re: Is that true?

2006-09-18 Thread Sami Siren

Your observations are correct, 0.8 has some serious problems and we'll be
putting 0.8.1 out pretty soon to fix also the performance problem you
describe.

--
Sami Siren

2006/9/18, carmmello [EMAIL PROTECTED]:


I have been trying Nutch, since its version 0.3, sometimes with some
problems.  Now I am using the 0.7.2 release and I`m really happy with it,
to the point where I have about 1,100,000 pages indexed in a site that deals
with quality and environment.
But a new version means, at least in principle, a better product.  So I
went to try the Nutch 0.8, in the same single computer (Athlon 2400+, 1
gig Ram, about 4Mbits connection, 53 threads), same seed sites (but on a
folder, as per tutuorial)).  I  used a depth of 2, just to try the new
version (instead of 4 or 5, that I usually do), but when I went for the log,
I was really terrified:  the fetching was horribly slow!   With Nutch
0.7.2 I got about 9 pages per second and in Nutch 0.8 sometimes it was
necessary about 3 seconds por the fetching of a single page!  Roughly
speaking, the fetching speed was reduced bay a factor of 20!
So, that is may question:
Is that true, or do I have made some big mistake?
Thanks



Re: log records

2006-09-01 Thread sami siren

Is your environment windows or linux?

You are saying that most are not logged - can you please give an example
what is
logged (and where) and also what is not.

Logging in general can be configured by editing conf/log4j.properties

--
Sami Siren


2006/9/1, AJ Chen [EMAIL PROTECTED]:


When running fetcher (0.9-dev) in eclipse, lot of log messages are printed
as expected, including the status - pages *, errors *,  *kb/s. But, when
using nutch script to do fetching, most of the log messages including the
status are not in std out, nor in logs/haddop.log.  Did I miss a setting?
How to make nutch script print out all the info and error message in std
out
or log file?
thanks,
AJ




Re: Is there a way to get Nutch to parse/index by file access directly (not over HTTP)?

2006-08-28 Thread sami siren

Fetcher can fetch also with protocol file. This is not as efficient as it
could be because you still need to
go through full crawling cycle. It would be more efficient to use (write) a
special crawler that would start from a submitted path and follow all sub
directories and files.

Such crawler could also be succesfully used for efficient crawling of smb,
ftp  and webdaw resources,

--
Sami Siren

2006/8/27, Sandy Polanski [EMAIL PROTECTED]:


This maybe more of a straight Lucene task, but I thought I'd ask
anyway.  Rather than using Nutch as a crawler, I'd rather just send the
Nutch parser and indexer over to a directory on my server and have it detect
content-type by the file extension.

I'd prefer to skip the whole crawling part since all of my data is local,
and increase the reliability of getting all of my proper data indexed.  Is
this possible?


-
All-new Yahoo! Mail - Fire up a more powerful email and get things done
faster.



Re: Nutch doesn't dive deeper

2006-08-27 Thread sami siren

This is yet another side effect of applying TextParser to non plain text
documents and in this particular case it comes short with namespace
declarations. I propose that we remove the PlainText parser from at least
the following mime types:

* (default)
application/rss+xml
application/vnd.wap.wbxml
application/vnd.wap.wmlc
application/vnd.wap.wmlscriptc
application/xhtml+xml
application/x-latex
application/x-netcdf
application/x-tex
application/x-texinfo
application/x-troff
application/x-troff-man
application/x-troff-me
application/x-troff-ms
message/news
message/rfc822
text/css
text/sgml
text/vnd.wap.wml
text/xml
text/x-setext

I would guess that handling of text/xhtml+xml mimetpe should  be done with
html parser anyway.

--
Sami Siren

2006/8/25, Michael Wechner [EMAIL PROTECTED]:


I think the problem is as follows with XHTML files:

2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
contentType application/xhtml+xml via parse-plugins.xml, but its
plugin.xml file does not claim to support contentType:
application/xhtml+xml
2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: xmlns
at java.net.URL.init(URL.java:544)
at java.net.URL.init(URL.java:434)
at java.net.URL.init(URL.java:383)


whereas maybe this could be resolved with

http://issues.apache.org/jira/browse/NUTCH-359

I am kind of suprised that nobody else is having this problem with
proper XHTML ;-)

Thanks

Michi

Ken Gregoire wrote:

 look here, it is blocking robots: http://ulysses.wyona.org/robots.txt

 User-agent: *
 Disallow: /foo/bar.html

 User-agent: lenya
 Disallow: /foo/bar.html





 Michael Wechner wrote:

 Hi

 I am trying to index http://ulysses.wyona.org/ but somehow it just
 indexes the homepage but doesn't seem to follow
 any links. I have set depth 3 and other sites are being crawled
 deeper without a problem but not the Ulysses page.

 Has anyone made similar experiences?

 Is it possible that Nutch has problem with well-formed XHTML
 (application/xhtml+xml)?

 Thanks

 Michi




--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61




Re: Making crawler stop after all pages are found.

2006-08-27 Thread Sami Siren
The job should terminate in it's own, but not as soon as all pages are 
found - only after -depth iterations.


Are you saying It won't honor the -depth parameter?

--
 Sami Siren

Sandy Polanski wrote:

Sami, in 0.7.2 my intranet crawling job did terminate on its own.  The issue 
that I described only started since I began to use 0.8.  Maybe you understand 
the changes in the code/methods better than I do between the versions so that 
you could point me in the right direction (for opening an issue or writing a 
patch).

sami siren [EMAIL PROTECTED] wrote: There's no such feature present in Nutch 
currently. Feel free to open issue
(of type new feature) in Nutch Jira and provide a patch or wait until
someone else gets to it.

--
 Sami Siren


2006/8/27, Sandy Polanski :

On my intranet, I have 8100 documents.  The nutch crawler finds all of
them fine, but the process does not end.  It just keeps on creating empty
segments timestamp directories.  What conf setting will make it stop on its
own when there are no more links in the fetch list?

Thanks,
Sandy




-
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls.  Great rates 
starting at 1¢/min.




Re: Nutch doesn't dive deeper

2006-08-27 Thread sami siren

2006/8/27, Chris Mattmann [EMAIL PROTECTED]:


Hi Sami,

  I'm not sure that I agree that the entire set of mime types that you
list
below should be removed from the parse-plugins.xml default mapping. For
instance, if you look at the current mapping file, many of the types below
would have no other option for parsing them besides the TextParser. I
think
it makes a lot of sense to parse some of the below documents with the
TextParser because, in fact, they are text documents.




A LaTeX document is a

plan text document.



Yes it can contain textual content among other things. However without
proper parsing the outcome is (at least pars of it) not something I would
like to see in search results.

Text/css is essentially a plain text document.




yes, contents are most often ASCII but is it really something one wants to
index by default?


An rfc822

message is indeed (stripped of headers), a plain text document.



yes, contents are most often ascii, but  I quess as often encoded (for
example mime) to be more or less useless in unparsed form.

  There's a careful tradeoff that must be made in terms of having a default

config file that allows the greatest coverage of mime tyeps that are
available, and the handling of them with at least * one * parser, in
contrast to not including any parser at all for a particular mime type. I
struggled with this very issue when I initially created that file and what
you see in there now represents a best guess of mime types mapped to the
available parsers that exist in Nutch. The other option on that file is
that
people can modify it on their own. For instance, in a domain-specific
deployment, a user can add and remove whatever mime type to plugin
mappings
she wants from the parse-plugins.xml file: it was never meant to be
something that was set in stone per se. It would be good to see some
experiments to see what the best config set for parse-plugins.xml is.




My opinion is that we should not try to pretend to be able to parse
something when we really are not. We should give a default config that
allows the greatest set of mime types Nutch really can handle. Then again
those two text type of documents you picked up are quite rare and not
mainstream and probably enabling/disabling them doesn't really make any
difference in search results.

--
Sami Siren


Re: Making crawler stop after all pages are found.

2006-08-26 Thread sami siren

There's no such feature present in Nutch currently. Feel free to open issue
(of type new feature) in Nutch Jira and provide a patch or wait until
someone else gets to it.

--
Sami Siren


2006/8/27, Sandy Polanski [EMAIL PROTECTED]:


On my intranet, I have 8100 documents.  The nutch crawler finds all of
them fine, but the process does not end.  It just keeps on creating empty
segments timestamp directories.  What conf setting will make it stop on its
own when there are no more links in the fetch list?

Thanks,
Sandy


-
All-new Yahoo! Mail - Fire up a more powerful email and get things done
faster.



  1   2   >