Re: Update on ignoring menu divs
Andrzej Bialecki wrote: On 2010-02-28 18:42, Ian M. Evans wrote: Using Nutch as a crawler for solr. I've been digging around the nutch-user archives a bit and have seen some people discussing how to ignore menu items or other unnecessary div areas like common footers, etc. I still haven't come across a full answer yet. There is no such functionality out of the box. One direction that is worth pursuing would be to create an HtmlParseFilter plugin that wraps the Boilerpipe library http://code.google.com/p/boilerpipe/ . Andrzej, have you tested that lib? If the result is of decent quality it would be nice to have that wrapped as a plugin in Nutch. -- Sami Siren
Re: Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12
Hannu, Do you use same set of QueryFilters both in the webapp and when running from shell? Perhaps your filter is not executed when running from cli? You can verify how your query is transformed by running bin/nutch org.apache.nutch.searcher.Query and entering some queries. -- Sami Siren Hannu Väisänen wrote: I am using Nutch 1.0 to index files written in Finnish. I have written a filter MorphologyHVSuggestionFilter that converts Finnish words to a base form (that you find in dictionaries) and I index just the base forms so that I find all inflected forms when searching just for the base form. When I search for the word 'kuka' like this bin/nutch org.apache.nutch.searcher.NutchBean kuka Total hits: 245 Tomcat6 finds also 245 hits. But when I search for word 'kuusi' bin/nutch org.apache.nutch.searcher.NutchBean kuusi Total hits: 212 Tomcat6 finds only 14 hits. Tomcat6 log shows this for word 'kuka': 2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token1 (kuka,0,4) 2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token2 (kuka,0,4) 2010-02-16 21:25:40,910 INFO NutchBean - query: kuka 2010-02-16 21:25:40,910 INFO NutchBean - query: kuka 2010-02-16 21:25:40,910 INFO NutchBean - lang: fi 2010-02-16 21:25:40,910 INFO NutchBean - lang: fi 2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka -site: 2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka -site: 2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits 2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits 2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245 2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245 Tomcat6 log shows this for word 'kuusi': 2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token1 (kuusi,0,5) 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuu,0,5) 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuusi,0,0,posIncr=0) 2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi 2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi 2010-02-16 21:23:12,778 INFO NutchBean - lang: fi 2010-02-16 21:23:12,778 INFO NutchBean - lang: fi 2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for url 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for anchor 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for content 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for title 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for host 2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14 2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14 The difference between words 'kuka' and 'kuusi' is that the word 'kuka' has only one base form (which happens to be 'kuka') but the word 'kuusi' has two base forms 'kuusi' and 'kuu' ('moon'; 'si' is a possessive suffix). So is it possible that when I search through tomcat6 Nutch returns only those files that have both words 'kuusi' and 'kuu'. If so, how can I change this that it finds files that has either 'kuusi' or 'kuu' (or, of course, any other base forms of the word I search for :-).
Re: Content storage, results highlighting
The schema.xml file there is usable only when using Solr as the search server. Are you using Solr? -- Sami Siren Pedro Bezunartea López wrote: Hi, I've developed a web application in lucene that searches web pages using a nutch generated index. I'd like to highlight the query searched for when showing the results, and I understand that the content of the pages need to be stored, as well as indexed. This is what I've tried so far: 1.- In the file conf/nutch-site.xml, I changed the value of file.content.ignored to false. 2.- In the file conf/schema.xml I modified the line: field name=content type=text stored=false indexed=true/ to field name=content type=text stored=true indexed=true/ 3.- In the sources file src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java, line 116 to: LuceneWriter.addFieldOptions(content, LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf) I tried running the command bin/nutch crawl urls -dir crawl -depth 10 -topN 5000 after the first two steps, but the crawl didn't store the contents. I then tried the third step, recompiled nutch, and run the crawl command again to no avail. What am I missing? Any hints, please? TIA, Pedro.
Re: Nutch near future - strategic directions
Andrzej Bialecki wrote: Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer that can accommodate both HBase and plain MapFile-s. I was simply trying to say that we should not bundle Hadoop anymore with Nutch and instead just mention the specific version it should run on top of as a requirement. I am not totally sure anymore if this is a good idea... I do not know details about the HBase branch. Would using HBase allow us easy migration from one data model to another (without complex code we now have in our datums). How easy is HBase to manage/setup/configure? I think Avro looks promising as a data storage technology: has some support for data model evolution, can be accessed natively from many programming languages, is relatively well performing... The downside at the moment is that it is not yet fully supported by hadoop mapred (I think). -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler +1, with slight preference towards ivy. I was not clear here, I think I was referring to users of Nutch instead of Developers. And in that case the choise of a tool would be up to the user after the artifacts are in the repo. Also, I think what I wanted to day is more about the model how would people that want to do some customization operate instead of a technology choice. Creating new plugin: -create your own build configuration (or use a template we provide) -implement plugin code -publish to m2 repository Creating your custom crawler: -create your own build configuration (or use a template we might provide), specify the dependencies you need (plugins basically, from apache or from anybody else as long as they are available through some repository) -potentially write some custom code We could also still provide a default Nutch crawler also, as a build configuration (basically just xml file + some config) if we wanted. The new Hadoop maven artifacts also help with this vision since we could also access hadoop apis (and dependencies) through similar mechanism. My biggest concern is in execution of this (or any other) plan. Some of the changes or improvements that have been proposed are quite heavy in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base. Well ... that's (almost) what Dogacan did with the HBase port. I agree that we should not feel too constrained by the existing code base, but it would be silly to throw everything away and start from scratch - we need to find a middle ground. The crawler-commons and Tika projects should help us to get rid of the ballast and significantly reduce the size of our code. I am not aiming to throw everything away, just trying to relax the back compatibility burden and give innovation a chance. In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones. Nutch 0.7 to 0.8 reused a lot of the existing code. I am hoping that this time it will not be different. Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc). So to summarize, I would go ahead and put together a branch nutch N.0 that would consist of (a.k.a my wish list, hope I am not being too aggressive here): -runs on top of plain hadoop See above - what do you mean by that? -use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with db abstraction or hbase directly and smart data structures that allow flexible and efficient usage of the data) -basic solr integration for indexing/search -basic parsing with tika After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges. I believe that's more or less where Dogacan's port is right now, except it's not merged with the OSGI port. Are you sure OSGI is the way to go? I Know it has all these nice features and all but for some reason I feel that we could live with something simpler. From functional pow: just drop your jars info classpath and you're all set. So 2 changes here: 1. plugins are jars 2. no individual classloaders for plugins. -- Sami Siren
Re: Nutch near future - strategic directions
Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler My biggest concern is in execution of this (or any other) plan. Some of the changes or improvements that have been proposed are quite heavy in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base. In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones. Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc). So to summarize, I would go ahead and put together a branch nutch N.0 that would consist of (a.k.a my wish list, hope I am not being too aggressive here): -runs on top of plain hadoop -use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with db abstraction or hbase directly and smart data structures that allow flexible and efficient usage of the data) -basic solr integration for indexing/search -basic parsing with tika After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges. ps. many of the interesting challenges in your proposal seem to fall in the category of data analysis and manipulation that are mostly, used after the data has been crawled or between the fetch cycles so many of those could be implemented into current code base also, somehow I just feel that things could be made more efficient and understandable if the foundation (eg. data structures, extendability for example) was in better shape. Also if written nicely other projects could use them too! -- Sami Siren Andrzej Bialecki wrote: Hi all, The ApacheCon is over, our release 1.0 has been out already for some time, so I think it's a good moment to discuss what are the next steps in Nutch development. Let me share with you the topics I identified and presented in the ApacheCon slides, and some topics that are worth discussing based on various conversations I had there, and the discussions we had on our mailing list: 1. Avoid duplication of effort -- Currently we spend significant effort on implementing functionality that other projects are dedicated to. Instead of doing the same work, and sometimes poorly, we should concentrate on delegating and reusing: * Use Tika for content parsing: this will require some effort and collaboration with the Tika project, to improve Tika's ability to handle more complex formats well (e.g. hierarchical compound documents such as archives, mailboxes, RSS), and to contribute any missing parsers (e.g. parse-swf). * Use Solr for indexing search: it is hard to justify the effort of developing and maintaining our own search server - Solr offers much more functionality, configurability, performance and ease of integration than our relatively primitive search server. Our integration with Solr needs to be improved so that it's easier to setup and operate. * Use database-like storage abstraction: this may seem like a serious departure from the current architecture, but I don't mean that we should switch to an SQL DB ... what this means is that we should provide an option to use HBase, as well as the current plain MapFile-s (and perhaps other types of DBs, such as Berkeley DB or SQL, if it makes sense) as our storage. There is a very promising initial port of Nutch to HBase, which is currently closely integrated with HBase API (which is both good and bad) - it provides several improvements over our current storage, so I think it's worth using as the new default, but let's see if we can make it more abstract. * Plugins: the initial OSGI port looks good, but I'm not sure yet at this moment if the benefits of OSGI outweigh the cost of this change ... * Shard management: this is currently an Achilles' heel of Nutch, where users are left on their own ... If we switch to using HBase then at least on the crawling side the shard management will become much easier. This still leaves the problem of deploying new content to search server(s). The candidate framework for this side of the shard management is Katta + patches provided by Ted Dunning (see ???). If we switch to using Solr we would have to also use the Katta / Solr integration, and perhaps Solr/Hadoop integration as well
Re: Fetcher2 Slow
Roger Dunk wrote: Andrzej stated in NUTCH-669 that some people reported performance issues with Fetcher2, i.e. that it doesn't use the available bandwidth. These reports are unconfirmed, and they may have been caused by suboptimal URL / host distribution in a fetchlist - but it would be good to review the synchronization and threading aspects of Fetcher2. To address this, I've tried just now generating a fetchlist using generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee unique hosts, but the problem still remains. Therefore, I believe it's clearly not an issue of suboptimal URL / host distribution. If you require any further information to confirm my report, you need only ask! I have so far seen two sources for slowness, don't know it they are related to your case: 1. You are using nutch from behind nat box. I experienced this problem when I did some test crawling from a machine sitting behind adsl router that did NAT. Soon after starting a crawl the maximum number of NAT connections was reached in the router and furter connections could only be made after old ones timeouted from NAT table. These connections were mostly DNS connections. 2. Your machine has ip6 enabled. This I noticed more recently when I was wondering relatively slow fetching speed on a box. After disabling ipv6 totally I was able to fetch 2-4 times faster without any other config changes. -- Sami Siren
[ANNOUNCE] Apache Nutch 1.0
I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: Fwd: fetch but not index
?? wrote: HI,all in the crawl log, I can see ''fetching http://www.na.gov.la/docs/eng/currentnews/Vietnamese%20Ambassador.html', but at the end of the Indexing, I not found it create Indexing, why? please help me Hi, that url seems to be blocked by the robots.txt of that site. That is why it does not end up in the index. -- Sami Siren
Re: Running multiple processes on a single machine
dayz...@gmail.com wrote: Hi, If I want to run several parsers on a single quad-core machine simultaneously, would I still need to have Hadoop setup as a single-node cluster? I think that the fetcher is currently the only component that can take advantage of multiple cores when running in local mode. We should perhaps address that at some point since it is not that hard to parallelize at least some of the processing inside individual tools so single machine users could benefit from multiple cores. I am not sure but I think that the only way to do it properly is run jobtracker and tasktracker on that machine and configure proper block sizes number of map and reduce tasks. Can several updatedbs be run simultaneously? I believe not, since the db seems to be locked when it's being updated. Locking prevents multiple applications of accessing crawl db simultaneously (also linkdb). -- Sami Siren
Re: Working with Solr. Doubts
Javier Puerto wrote: Hi to all, We are working with Nutch 0.8 to crawl about 18 web sites in an intranet, each site has an average of 40.000~50.000 documents. Actually we had the contents splits in four parts, and ran a DistributedSearchServer for each and the client configured for the 4 servers. Now we had a Apache Droids crawling the filesystem for transform and index a lot of documents to Solr. we need to unify the front-end client to be able to search in Solr and Nutch. I thought to upgrade the release of Nutch for the Solr support but I have some doubts: Nutch has a front-end for Solr or I had to develop all? Can I search on multiple Solr servers in the same way that Nutch with DistributedSearchServer? Can I search on Nutch and Solr simultaneously and merge the results? If anyone has had any similar problem or any suggestions to clarify my doubts, thank you very much! I think you could simplify your setup by using solr, scale should not be problem, last week I tested the Solr Nutch integration with collection size of over 6 M docs on an old pc (growing roughly 1 M per day). Response times were still pretty good for simple queries even when I let Solr create snippets. -- Sami Siren
Re: Exception when crawling
dealmaker wrote: I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM). What's wrong? There was a change in hadoop that caused this problem to appear. It has now been fixed on build #743 -- Sami Siren
Re: How do you setup your svn for your nutch code?
just a FYI, there is also (unofficial) git repos for many apache projects - including nutch here: http://jukka.zitting.name/git/ -- Sami Siren Dingding Ye wrote: similar. 1. git-svn clone nutch-trunk Then create a git project which is my working project. After that, clone the nutch-git repo as a remote repo of this git project 2. git remote add Now when you want to update the nutch, update at nutch-git at first. Then update the branch of your working repo. Finally merge to your working branch On Mon, Mar 2, 2009 at 12:10 PM, dealmaker vin...@gmail.com wrote: need more detail. Do u clone main trunk to your local main trunk, and then create a local branch for personal project, then do merge periodically for your local main trunk which u cloned? Dingding Ye wrote: Just personal choice and i think the branch/merge feature of git is powerful than svn. It helps the smooth merge. What i did before is to clone main trunk. It should fit for 0.9 also. However, if you make rapid changes to the sources, i think none are helpful and you have to solve the conflicts yourself.. On Mon, Mar 2, 2009 at 11:55 AM, dealmaker vin...@gmail.com wrote: and also, do u clone the main trunk or just for examples 0.9? Dingding Ye wrote: I have used git-svn to clone the nutch project. And then use a git repo to manage personal version and do periodical merge with the git version of nutch. On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin...@gmail.com wrote: no, it's not the official 1.0. Even so, there may be 1.1 in future. I just want to know how to setup svn for future versions that needs minimum maintenance. Thanks. Tony Wang-3 wrote: from my understanding, Nutch 1.0 is already in the latest nightly build. On Sun, Mar 1, 2009 at 5:22 PM, dealmaker vin...@gmail.com wrote: Hi, I am modifying Nutch 0.9 code for my project. Currently, I put all my 0.9 code in my local main trunk. But I know that 1.0 will be out soon, and want to use 1.0 code instead in near future. What is the best way to setup svn to do that? Should I just sync the main trunk from apache server to my local trunk and setup branch for 1.0 in local? Thanks. -- View this message in context: http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22280092.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Are you RCholic? www.RCholic.com ? ? ? ? ? ? ? ? ? ? -- View this message in context: http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22280605.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22281721.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/How-do-you-setup-your-svn-for-your-nutch-code--tp22280092p22281816.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Problem with crawling using the latest 1.0 trunk
Hi, and thanks for being persistent. Can you specify what is the version of nutch that you are running, is it a nightly build (if yes, which one?) or did you check out the svn trunk? And just to be sure: you are running with default configuration? -- Sami Siren ahammad wrote: I checked hadoop.log and this is what it has: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.init(Field.java:279) at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) I don't understand what that refers to specifically. I'm running it at it's default configuration, without any of the advanced indexing that I have in my 0.9 install. Cheers. Andrzej Bialecki wrote: ahammad wrote: I am aware that this is still a development version, but I need to test a few things with Nutch/Solr so I installed the latest dev version of Nutch 1.0. I tried running a crawl like I did with the working 0.9 version. From the log, it seems to fetch all the pages properly, but it fails at the indexing: CrawlDb update: starting CrawlDb update: db: kb/crawldb CrawlDb update: segments: [kb/segments/20090302135858] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: kb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135757 LinkDb: adding segment: file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135807 LinkDb: adding segment: file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135858 LinkDb: done Indexer: starting Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.crawl.Crawl.main(Crawl.java:146) I took a look at all the configuration and as far as I can tell, I did the same thing with my 0.9 install. Could it be that I didn't install it properly? I unzipped it and ran ant and ant war in the root directory. Please check the logs in the logs/ directory - the above message is not informative, the real reason of the failure can be found in the logs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Problem with crawling using the latest 1.0 trunk
I can see this error also. not sure yet what's going wrong... -- Sami Siren Justin Yao wrote: log4j configure: log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout log4j.logger.org.apache.nutch=TRACE log4j.logger.org.apache.hadoop=TRACE Output: 2009-03-02 17:53:21,987 DEBUG indexer.Indexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@118d189 2009-03-02 17:53:21,988 DEBUG indexer.Indexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-justin/mapred/local/index/_1068960877 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@648016 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1551b0 ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= 2009-03-02 17:53:21,993 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2009-03-02 17:53:21,994 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2009-03-02 17:53:22,009 WARN mapred.LocalJobRunner - job_local_0001 java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.init(Field.java:279) at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2009-03-02 17:53:22,567 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) Andrzej Bialecki wrote: Justin Yao wrote: Same problem here if using build #740 (Mar 2, 2009 4:01:53 AM) I switched to build #736 (Feb 26, 2009 4:01:15 AM) and it worked then. Could you please send the error message from the logs/, which you got with build #740? Thanks!
Re: Problem with crawling using the latest 1.0 trunk
Sami Siren wrote: I can see this error also. not sure yet what's going wrong... it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideas what changed in hadoop that might have caused this? -- Sami Siren -- Sami Siren Justin Yao wrote: log4j configure: log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout log4j.logger.org.apache.nutch=TRACE log4j.logger.org.apache.hadoop=TRACE Output: 2009-03-02 17:53:21,987 DEBUG indexer.Indexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@118d189 2009-03-02 17:53:21,988 DEBUG indexer.Indexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-justin/mapred/local/index/_1068960877 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@648016 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1551b0 ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= 2009-03-02 17:53:21,993 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2009-03-02 17:53:21,994 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2009-03-02 17:53:22,009 WARN mapred.LocalJobRunner - job_local_0001 java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.init(Field.java:279) at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2009-03-02 17:53:22,567 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) Andrzej Bialecki wrote: Justin Yao wrote: Same problem here if using build #740 (Mar 2, 2009 4:01:53 AM) I switched to build #736 (Feb 26, 2009 4:01:15 AM) and it worked then. Could you please send the error message from the logs/, which you got with build #740? Thanks!
Re: log org.apache.solr.common.SolrException: Bad Request when indexing feeds with solrindexer.
Felix Zimmermann wrote: Hi, I get this log error when indexing feeds with solrindexer: 2009-02-23 23:04:11,438 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2009-02-23 23:04:11,439 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2009-02-23 23:04:11,441 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.feed.FeedIndexingFilter 2009-02-23 23:04:11,584 WARN mapred.LocalJobRunner - job_local_0001 org.apache.solr.common.SolrException: Bad Request Bad Request request: http://127.0.0.1:8080/solr3/update?wt=javabinversion=2.2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS olrServer.java:343) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS olrServer.java:183) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.jav a:217) at Hi, I would check the Solr log to see why it is failing, probably Nutch is providing content to a field not present in sol schema. -- Sami Siren
Re: Nutch 1.0 - Setting up and running Nutch for crawling and Solr for indexing and querying.
Tony Wang wrote: I don't see that Nutch 1.0 has been released. Where did you download it? Nutch 1.0 has not been released yet, the community is working to get it out as we speak. There are still some issues that needs to be fixed before the release can take place. Everybody's involvement in testing the current nightly builds and providing documentation patches or wiki updates is appreciated. -- Sami Siren nightly build? thanks On Fri, Feb 20, 2009 at 6:31 PM, Kham Vo k...@mac.com wrote: Hello Nutch 1.0 designers, I successfully installed and set up Nutch 1.0 (build # 722). Ran bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and it seemed to work, fetching data from specified sites. No error. My question is do I need to do anything special in order to get Nutch to post the data to another instance of apache-solr running at http://localhost:8983 for indexing. I googled for any documentation on how to correctly set up Nutch 1.0 such that nutch is for crawling and solr is for indexing and display. Nothing so far. Your help is greatly appreciated. Kham
Re: HTTP Status 500 - No Context configured to process this request
samuel.gre...@mesaaz.gov wrote: I have tried tomcat 6.0 and after escaping some quotes in a string in search.jsp, it works withou error. However, it returns no results. I suspect it is not finding the correct crawl files. That is the common case, other being there is no data available. I have started tomcat in the nutch directory. I have also added a preference to nutch: property namesearcher.dir/name valuecrawl/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property Any other steps to take? No, that should do it. Couple of things you can try: - double check that your configuration is indeed in use the file to check is in ${webapps}/ROOT/WEB-INF/classes/nutch-site.xml - use absolute directory in searcher.dir, that way it does not matter where or how you start tomcat. you can also check that you can actually get results back from nutch command line: - double check $nutch-home/conf/nutch-site.xml (searcher.dir) - execute (from command line) bin/nutch org.apache.nutch.searcher.NutchBean query -- Sami Siren Thanks Sam Hi, I just dropped Nutch web app into tomcat version 6.0.18 and it worked fine, perhaps you should upgrade your Tomcat? -- Sami Siren samuel.gre...@mesaaz.gov wrote: Hi, I am following the tutorial here: http://nutch.sourceforge.net/docs/en/tutorial.html Crawling works fine, as does the test search from the command line. When I try to fire up tomcat after moving ROOT.war into place, I get some errors in the tomcat logs and a page with HTTP Status 500 - No Context configured to process this request 2009-02-19 15:55:46 WebappLoader[]: Deploy JAR /WEB-INF/lib/xerces-2_6_2.jar to C:\Program Files\Apache Software Foundation\Tomcat 4.1\webapps\ROOT\WEB-INF\lib\xerces-2_6_2.jar 2009-02-19 15:55:47 ContextConfig[] Parse error in default web.xml org.apache.commons.logging.LogConfigurationException: User-specified log class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or is not useable. at org.apache.commons.digester.Digester.createSAXException(Digester.java:3181) at org.apache.commons.digester.Digester.createSAXException(Digester.java:3207) at org.apache.commons.digester.Digester.endElement(Digester.java:1225) etc. So it looks like the root of the error is default web.xml, not in the Log4JLogger - although I know very little about Java. I haven't played with it for a few years. Anyone know what is going on here? versions/info: nutch 0.9 Tomcat 4.1 jre1.5.0_08 jdk1.6.0_12 NUTCH_JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12 JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12 Thanks! Sam
Re: Feed indexing with solrindex not working.
Felix Zimmermann wrote: Hi, Hi, indexing RSS-Feeds with solrindex does not work. I expect missing special field-definitions in schema.xml of solr. Could somebody tell me the correct field-definitions please? In future, it would be the best to put a default schema.xml into the conf-dir(?) There is an open issue for this https://issues.apache.org/jira/browse/NUTCH-699. Please contribute your findings there. -- Sami Siren
Re: HTTP Status 500 - No Context configured to process this request
Hi, I just dropped Nutch web app into tomcat version 6.0.18 and it worked fine, perhaps you should upgrade your Tomcat? -- Sami Siren samuel.gre...@mesaaz.gov wrote: Hi, I am following the tutorial here: http://nutch.sourceforge.net/docs/en/tutorial.html Crawling works fine, as does the test search from the command line. When I try to fire up tomcat after moving ROOT.war into place, I get some errors in the tomcat logs and a page with HTTP Status 500 - No Context configured to process this request 2009-02-19 15:55:46 WebappLoader[]: Deploy JAR /WEB-INF/lib/xerces-2_6_2.jar to C:\Program Files\Apache Software Foundation\Tomcat 4.1\webapps\ROOT\WEB-INF\lib\xerces-2_6_2.jar 2009-02-19 15:55:47 ContextConfig[] Parse error in default web.xml org.apache.commons.logging.LogConfigurationException: User-specified log class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or is not useable. at org.apache.commons.digester.Digester.createSAXException(Digester.java:3181) at org.apache.commons.digester.Digester.createSAXException(Digester.java:3207) at org.apache.commons.digester.Digester.endElement(Digester.java:1225) etc. So it looks like the root of the error is default web.xml, not in the Log4JLogger - although I know very little about Java. I haven't played with it for a few years. Anyone know what is going on here? versions/info: nutch 0.9 Tomcat 4.1 jre1.5.0_08 jdk1.6.0_12 NUTCH_JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12 JAVA_HOME=C:\Program Files\Java\jdk1.6.0_12 Thanks! Sam
Re: Distributed Search Server fails with Trunk
Höchstötter Nadine wrote: Hi, I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: ... We do not run Nutch in PseudoDistributedMode. We only use the distributed search mode. With Nutch-0.9 this was working properly. Did anyone have the same problem? Yes, I just verified that this is happening, can you please file a Jira issue, fix for version = 1.0, priority = blocker. thanks. -- Sami Siren
Re: nutch restart after recrawl
Alexander Aristov wrote: Hi People Is there a way to tell nutch to re-initialize index after re-crawl without application restart. Not really. I added a jira NUTCH-376 to track this enhancement, but no work has done in that front. One potential solution to this problem is to use solr as indexing back end, the integration is in nightly version of nutch. I am not sure if the procedure is documented anywhere. -- Sami Siren All scripts suggest restarting nutch but this leads that searching is unavailable for a few minutes. May I call an API or something?
Re: Fetcher2 doesn't print status information on console
Koch Martina wrote: Hi, I'm testing Fetcher2 of the current trunk and wondered why Fetcher2 doesn't report any status on the console. Other tools like Injector or Fetcher report not only in the hadoop.log, but also to STDOUT to some extent, e.g. Generator: starting, Fetcher: done and so on. Did I configure something wrong or is this a wanted behaviour in Fetcher2? I can't see any difference in the logging logic of Fetcher2. The logging configuration is in file conf/log4j.properties, there you have entry for Fetcher: log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout but not for Fetrcher2. If you add such line for Fetcher2 it should start outputting logging to stdout. -- Sami Siren Thanks in advance. Kind regards, Martina
Re: Fetcher2 crashes with current trunk
Dog(acan Güney wrote: I think I have found the bug here, but I am in a hurry now, I will create a JIRA issue and post (what is hopefully) the fix later today. Great! thanks. -- Sami Siren On Tue, Feb 17, 2009 at 21:39, Dog(acan Güney doga...@gmail.com wrote: 2009/2/17 Sami Siren ssi...@gmail.com: Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible. No we don't. But you are right that we should. I am very busy and I forgot about it. I will examine this problem in more detail tomorrow and will open an issue if I can reproduce the bug. -- Sami Siren Dog(acan Güney wrote: Thanks for detailed analysis. I will take a look and get back to you. On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote: Hi, sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles). We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail. With the following configuration we get a corrupt crawldb after two fetch2 cycles: - activated plugins: protocol-http, parse-html, feed - generate.max.per.host - 100 - URLs to fetch: http://www.prosieben.de/service/newsflash/ http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241 http://www.prosieben.de/kino_dvd/news/60897/ http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259 http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/ http://www.prosieben.de/spielfilm_serie/topstories/61051/ http://www.prosieben.de/kino_dvd/news/60897/ When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles: WARN parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later. Any suggestions are highly appreciated. Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why... Thanks in advance. Kind regards, Martina -Ursprüngliche Nachricht- Von: Dog(acan Güney [mailto:doga...@gmail.com] Gesendet: Freitag, 13. Februar 2009 09:37 An: nutch-user@lucene.apache.org Betreff: Re: Fetcher2 crashes with current trunk On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina k...@huberverlag.de wrote: Hi all, we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied. We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1. When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log: 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: job_local_0002 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 2009-02-12 00:00:05,554
Re: Restarting Nutch
[moving this to nutch-user] Hrishikesh Agashe wrote: Hi, I am planning to do a huge crawl using Nutch (billions of URLs) and so need to understand whether Nutch can handle restarts after a crash. For single system, if I do Ctrl+C while Nutch is running and then restart it, will it be possible for Nutch to detect where it has reached in last run and start from that point onwards? Or will it be considered as new fresh crawl? Nutch does not try to resume the action that was interrupted. Also if I have 5 nodes running Nutch and doing the crawling, if one of the node fails, should it be considered as total failure of Nutch itself? Or should I allow other nodes to proceed further? Will I loose data gathered by the failed node? Hadoop will execute the remaining tasks at nodes that are available. Usually data will be stored on a shared/distributed filesystem (like HDFS). If your setup is similar and you ensure that the filesystem can survive single node failures your data should be safe. -- Sami Siren
Re: How many kb is a page's index?
buddha1021 wrote: hi: How many kb is a page's index? on average! Hi, There's a quite recent estimate on http://www.lucidimagination.com/search/document/c6c099bf31b0de55/index_ratio#de145fe338543d5b and when build distribute search clusters, the node is 1u server? or the common pc that people daily used on windiws? which can maximize performance? Well it can be anything, the important thing is to set up a small system with similar hardware and see how it performs. That way you can get quite accurate estimates on larger scale systems running on similar hardware. -- Sami Siren
Re: Fetcher2 crashes with current trunk
Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible. -- Sami Siren Dog(acan Güney wrote: Thanks for detailed analysis. I will take a look and get back to you. On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote: Hi, sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles). We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail. With the following configuration we get a corrupt crawldb after two fetch2 cycles: - activated plugins: protocol-http, parse-html, feed - generate.max.per.host - 100 - URLs to fetch: http://www.prosieben.de/service/newsflash/ http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238 http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241 http://www.prosieben.de/kino_dvd/news/60897/ http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267 http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259 http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/ http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/ http://www.prosieben.de/spielfilm_serie/topstories/61051/ http://www.prosieben.de/kino_dvd/news/60897/ When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles: WARN parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later. Any suggestions are highly appreciated. Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why... Thanks in advance. Kind regards, Martina -Ursprüngliche Nachricht- Von: Dog(acan Güney [mailto:doga...@gmail.com] Gesendet: Freitag, 13. Februar 2009 09:37 An: nutch-user@lucene.apache.org Betreff: Re: Fetcher2 crashes with current trunk On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina k...@huberverlag.de wrote: Hi all, we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied. We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1. When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log: 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: job_local_0002 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = 79691776/99614720 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = 262144/327680 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) at org.apache.hadoop.io.MapWritable.readFields
Re: Trying to understand how webapp works
Bartek wrote: Hello, I am trying to figure out how webapp part is working. I've installed nutch and crawled some site. Then deployed .war file and in file {tomcat.dir}/nutch/WEB-INF/classes/nutch-site.xml I've put correct searcher.dir, in my case /usr/local/nutch/crawls/site1 Everything is working fine but... When I removed whole crawls dir (/usr/local/nutch/crawls) web application is still working fine. Searching is working (but not cache - I assume that it can't find segments) So could someone explain to me why it is still working? You didn't restart tomcat after killing the directory did you? It might be working because the webapp still has references to all files it needs. Restart tomcat and it should work no more. -- Sami Siren
Re: indexing after fetching
Nicolas MARTIN wrote: I need to know if Nutch necessarily index data that have been fetched when running the bin/crawl command ? Hi, bin/nutch crawl command will index the data at the end of the cycle. If you do not wish to index just use the individual commands inject, generate, fetch, updatedb, generate... -- Sami Siren
Re: nutch jdk?
Dennis Kubes wrote: jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use tomcat but should run on any jsp/servlet container, websphere included. I think you need 1.6 now (for trunk) since we use Hadoop 0.19. -- Sami Siren
Re: nutch jdk?
buddha1021 wrote: Sami Siren-2 wrote: Dennis Kubes wrote: jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use tomcat but should run on any jsp/servlet container, websphere included. I think you need 1.6 now (for trunk) since we use Hadoop 0.19. -- Sami Siren which sdk will be used? j2se sdk or j2eesdk? You don't need the ee version of java to run nutch. if i use the ubuntu as the os to buid a distribute search system which contains several nodes,which version is the best os for the every node? the desktop edition or the server edition? thank you! I think either of those should be good enough as would many other linux distributions. -- Sami Siren
Re: how to create a new ngp file for Telugu in nutch
nalgonda wrote: Hi, how to create a new ngp file for tamil i tried using java org.apache.nutch.analysis.lang.NGramProfile -create te sample_te.txt UTF8 but i get error No java lang class in org/apache/nutch/lang/analysis/Ngramprofile what's that how to solve? Hi, I think the easiest way is to enable language-identifier plugin and execute class through the plugin command: bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create te sample_te.txt utf-8 -- Sami Siren
directions for web ui? [was Re: web2 plugins compilation error]
Hi, The web2 ui was originally an effort to make the web ui more modular and easier to customize. The architecture below the surface is mildly put outdated and it relies on a dirty trick that allows jsp's to be executed from inside .jar files. If I would write it again today I would probably use webmacro or velocity instead to get rid of the hack that breaks on servlet containers with different versions of jsp api. My recommendation is: do not use it ;) It has long been on my todo list to start discussion about the future of a end user interface in Nutch and possible future directions (where did all that time go?). I think that we need a simple to maintain ui that is easy to customize (both of the current ui fail to satisfy those requirements IMO). What kind of thought do others have? -- Sami Siren michos101 wrote: Hi, i am trying to enable the web2 plugins but i am get an issue on when i try to compile the plugins i get the following errors init: compile-plugins: deploy: init: init-plugin: compile: jar: [jar] Warning: skipping jar archive /usr/local/mputa01/build/webui-extensionpoints/webui-extensionpoints.jar because no files were included. deps-test: deploy: init: init-plugin: [echo] Copying UI configuration [echo] Copying UI templates deps-jar: prepare-web: [delete] Deleting directory /usr/local/mputa01/build/web-caching-oscache/tmp/_web [copy] Copying 5 files to /usr/local/mputa01/build/web-caching-oscache/tmp/_web compile-jsp: compile: [echo] Compiling plugin: web-caching-oscache [javac] Compiling 4 source files to /usr/local/mputa01/build/web-caching-oscache/classes [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:32: package org.apache.nutch.webapp.common does not exist [javac] import org.apache.nutch.webapp.common.Search; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:33: package org.apache.nutch.webapp.common does not exist [javac] import org.apache.nutch.webapp.common.ServiceLocator; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:127: cannot find symbol [javac] symbol : class ServiceLocator [javac] location: class org.apache.nutch.webapp.CacheManager [javac] public Search getSearch(String id, ServiceLocator locator) throws NeedsRefreshException { [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:127: cannot find symbol [javac] symbol : class Search [javac] location: class org.apache.nutch.webapp.CacheManager [javac] public Search getSearch(String id, ServiceLocator locator) throws NeedsRefreshException { [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:162: cannot find symbol [javac] symbol : class Search [javac] location: class org.apache.nutch.webapp.CacheManager [javac] public void putSearch(String id, Search search){ [javac]^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:27: package org.apache.nutch.webapp.common does not exist [javac] import org.apache.nutch.webapp.common.Search; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:28: package org.apache.nutch.webapp.common does not exist [javac] import org.apache.nutch.webapp.common.ServiceLocator; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:29: package org.apache.nutch.webapp.common does not exist [javac] import org.apache.nutch.webapp.common.Startable; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:30: cannot find symbol [javac] symbol : class SearchController [javac] location: package org.apache.nutch.webapp.controller [javac] import org.apache.nutch.webapp.controller.SearchController; [javac] ^ [javac] /usr/local/mputa01/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/controller/CachingSearchController.java:39: cannot find symbol [javac] symbol: class SearchController
Re: Next Generation Nutch
of different tools in any of these areas. What this means is the ability to have different components such as web crawlers (as long as the end data structure is the same), for example Fetcher, Fetcher2, Grub, Heretrix, or even specialized crawlers. And different components for different analysis types. I don't see a lot of cross-cutting concerns here. And where there is, url normalization for example, I think it can be handled better through dependency injection. Which brings me to three. I think it is time to get rid of the plugin framework. +1 I want to keep the functionality of the various plugins but I think a dependency injection framework, such as spring, creating the components needed for logic inside of tools is a much cleaner way to proceed. This would allow much better unit and mock testing of tool and logic functionality. The lack of junit tests in nutch has been a big burden for it (in general amount of junit tests seems to somewhat correlate to how easy/hard they are to write :) so if we architecture the system to be easily testable (small isolated units) we could simultaneously rise the bar for junit testing it and also make it easier to refactor later. It would allow Nutch to run on a non nutchified Hadoop cluster, meaning just a plain old hadoop cluster. We could have core jars and contrib jars and a contrib directory which is pulled from by shell scripts when submitting jobs to Hadoop. With the multiple-resources functionality in Hadoop it would be a simple matter of creating the correct command lines for the job to run. And that brings me to separation of data and presentation. Currently the Nutch website is one monolithic jsp application with plugins. I think the next generation should segment that out into xml / json feeds and a separate front end that uses those feeds. Again this would make it much easier to create web applications using nutch. And of course I think that shard management, a la Hadoop master and slave style, is a big requirement as well. I also think a full test suite with mock objects and local and MiniMR and MiniDFS cluster testing is important as is better documentation and tutorials (maybe even a book :)). So up to this point I have created MapReduce jobs that use spring for dependency injection and it is simple and works well. The above is the direction I would like to head down but I would also like to see what everyone else is thinking. Dennis -- Sami Siren
Re: Nutch training at ApacheCon EU 2008
Frisa, Raquel, VF-ES (rfrisar) wrote: Hello, I was right now thinking about attending your training session but it's not there! What's happened? Do you know if there's something planned related to Nutch? Hi, The training was canceled due to low demand. There are still plenty of interesting lucene/solr/hadoop related stuff there to attend to. -- Sami Siren
Re: can't find hadoop classes necessary to use Nutch API
Ana Rodighiero wrote: I have Nutch running on my server and it crawls and searches just fine. I am writing a java program to use the search api, but cannot compile because I am missing some classes from hadoop. Are these classes included somewhere in the nutch or tomcat downloads? If not, how is the compiled distribution of nutch running without them? Where can I get the hadoop jar files? Specifically, I am trying to make a NutchBean, which requires hadoop's Configuration and Path classes. I'm not doing anything with multiple servers, so those may be the only ones I need. Is there any way to use Nutch without them? Thank you for answers to any or all of these questions. The hadoop jar (hadoop-version-core.jar) should be available under lib/. Nutch cannot be compiled/run without it. -- Sami Siren
Re: java.lang.NoClassDefFoundError Nutch 0.9
karthik085 wrote: Hi, I got nutch from svn tags - release0.9 - but can't get rid of this problem. I did ant compile ant jar ant war All of them build successfully with different versions of ant - 1.6.5 and 1.7.0 do ant job -- Sami Siren
Re: PDF problems, inc. documents returned with XLS extension
George Weller wrote: Hi all, First I note in the logs that a large number of PDF documents have been fetched, and yet only two have been indexed, and indeed only these two appear in search results. The content limit is set high enough to allow these documents to be indexed, so I can't think why this should be. Are there any related errors on log? Secondly for those documents that ARE indexed, rather bizarrely, the document titles in the search results have a '.xls' extension. I can even search for all PDF document just by using the query 'xls'. Note that this suffix is most definitely NOT in the actual title of those files. I also chanced upon a site that seems to use Nutch (no affiliation- I just googled) and found the same problem... In the examples from your site the title is extracted from the pdf metadata so it just uses the title stored within the pdf doc. -- Sami Siren
Re: Indexer does not update the Lucene TITLE field
Sergio Morales wrote: Hi Sami, Thanks for the info. Is there any other way to share this? create a jira issue and attach to it? -- Sami Siren
Re: Indexer does not update the Lucene TITLE field
Sergio Morales wrote: Hi, I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz. It seems the indexer is unable to update the field TITLE of the Lucene index when processing specific html documents. Please find below a brief summay of this issue: 1.- Extracted this new version in a separate directory and copy across the following configuration files: - {nutch_home_9.0}/bin/url folder, containing the urls - {nutch_home_9.0}/conf/nutch-site.xml - {nutch_home_9.0}/conf/crawl-urlfilter.txt 2.- To reproduce the issue, you would need to copy the attached html document to your webserver/filesytem. There was not any html document attached. This is because mailing list software removes them. -- Sami Siren
Re: Problems running multiple nutch nodes
Uygar BAYAR wrote: hi thanks for the solution..it's solved my log problem but not my http://www.nabble.com/java.lang.OutOfMemoryError%3A-Requested-array-size-exceeds-VM-limit-tf4562352.html and gives this error message Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:131) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:149) if it works on local jobrunner you possibly forgot to increase memory for spawned vm processes with hadoop conf like: property namemapred.child.java.opts/name value-Xmx1000m/value /property -- Sami Siren
Re: IOException using feed plugin - NUTCH-444
Kai_testing Middleton wrote: I hope someone can suggest a method to proceed with this RuntimeExceptionI'm getting. recheck that you have scoring plugin enabled properly (scoring-opic) in nutch configuration (in the snippet you gave below it did not exist, also the pluginRepository log you showed did not have it registered) -- Sami Siren java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) As far as I can tell I'm using NUTCH-444 out-of-the-box since I have a nightly build. --Kai M. - Original Message From: Kai_testing Middleton [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, June 29, 2007 5:24:57 PM Subject: Re: IOException using feed plugin - NUTCH-444 The exception is: java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! I note that my nutch-site.xml does contain a reference to scoring-opic so I wonder why it would give that exception. --Kai M. - Original Message From: Kai_testing Middleton [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, June 29, 2007 11:36:11 AM Subject: Re: IOException using feed plugin - NUTCH-444 Here is the more detailed stack trace: java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) In fact, here is a complete hadoop.log for the command I attempt: nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 21 | tee crawl.log 2007-06-29 11:28:58,785 INFO crawl.Crawl - crawl started in: /usr/tmp/lee_apollo 2007-06-29 11:28:58,788 INFO crawl.Crawl - rootUrlDir = /usr/tmp/lee_urls.txt 2007-06-29 11:28:58,789 INFO crawl.Crawl - threads = 10 2007-06-29 11:28:58,790 INFO crawl.Crawl - depth = 2 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: starting 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: crawlDb: /usr/tmp/lee_apollo/crawldb 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: urlDir: /usr/tmp/lee_urls.txt 2007-06-29 11:28:58,926 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2007-06-29 11:28:59,936 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch-2007-06-27_06-52-44/plugins 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Registered Plugins: 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-06-29 11:29:00,261
Re: [Nutch-general] Integrate nutch crawler with Solr index server
Doğacan Güney wrote: Hi, On 6/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Is this actually planned (addition of SolrIndexer to Nutch)? A search for SolrIndexer in JIRA got no hits. There is NUTCH-442 (one of the most popular issues). But, after Sami's work, there have been no further developments. I think Sami Siren's original patch no longer works with Solr, I am not sure if it still applies to nutch. So, if anyone wants to tackle this, here are a couple of items off the top of my mind: It still applies to nutch (actually there were just two additional classes) and works with the original client (don't know if it's still available). I am currently working on something around solr-nutch integration and hoping that I can give out something within the next few weeks. 1) Bring Sami's patch up-to-date (both with solr and with nutch). I think a seperate Indexer job is unnecessary, we should just change Indexer.OutputFormat to check for a parameter, and if its true, OutputFormat should also send documents to Solr (besides writing it to lucene index in DFS). I actually think that the endless adding of configuration options does not do any good to anyone, we should instead start to write reusable pieces of code and/or bring the number of different options down (imoThe massive number of already available configuration/runtime options and the fact that most of nutch is not designed to be extended by coding is harmful for advanced users. In the other hand I think that things are already too complicated for novice users/imo) 2) Make it work in distributed setups (i.e. with more than 1 index server) . Sami Siren also makes a note of this, but I don't believe that a simple hash-the-url approach is appropriate for nutch. It would be nice to guarantee that a url always goes to the same indexing server, even if we add or remove index servers (if we just take the hash of url, then adding a new machine would cause pretty much all urls to be distributed to different servers). I think that the distributed online Index part should be done outside of Nutch (or if done here do it with extreme caution:) so it does not get tied to Nutch. -- Sami Siren
Re: [Nutch-general] Integrate nutch crawler with Solr index server
I think that the distributed online Index part should be done outside of Nutch (or if done here do it with extreme caution:) so it does not get tied to Nutch. I am not sure I understand you here. If I have 10 machines I am using for serving indexes(I am assuming I have a Solr instance running on each one), IndexerSolr should be able to partition my index to 10 machines. There are more dimensions to distribution (or scaling) and the case you describe is a very basic one. Of course we could support such special setups inside nutch too and just remember that once it starts to look like a thing that can manage large online indexes perhaps it would serve most goodness if it was not tied to nutch. -- Sami Siren
Re: [Nutch-general] Integrate nutch crawler with Solr index server
Doğacan Güney wrote: I actually think that the endless adding of configuration options does not do any good to anyone, we should instead start to write reusable pieces of code and/or bring the number of different options down (imoThe massive number of already available configuration/runtime options and the fact that most of nutch is not designed to be extended by coding is harmful for advanced users. In the other hand I think that things are already too complicated for novice users/imo) OK, adding new configuration options all the time is probably not a great idea. But I strongly believe that indexing to different targets should be done in Indexer.OutputFormat (OutputFormat outputs to different targets, makes sense to me :). For example, I would love the ability to index to solr but I would also need to store the original lucene index in DFS (so that if solr machine dies, I don't lose my index). I shouldn't have to run Indexer twice to achieve this. In one application I added extension point for different indexing backends, that way by implementing a composite index backend you could achieve that same thing. The code shown in blog post was mainly done simplicity in mind, other motivation was doing it without touching Nutch source code. -- Sami Siren
Re: Enabling Spell-Check plugin in contrib
Scam wrote: Hello Sami, Wednesday, June 13, 2007, 23:03, you wrote: Can anyone tell me how to use the spell-check query plugin available in the contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to enabling the nutch-plugins? SS Following these steps should get you there: SS 1. compile nutch (in top level dir do ant) SS 2. crawl your data (see tutorial) SS 3. edit your conf/nutch-site.xml so it contains plugin SS web-query-propose-spellcheck and webui-extensionpoints SS 4. edit conf/nutch-site.xml so it contains proper dir for plugins as the SS plugins are not packaged inside .war (something like SS property SS nameplugin.folders/name SS value path to plugins dir /value SS /property SS ) SS 5. compile web2 plugins (in contrib/web2 do ant compile-plugins) I get error on this step: compile: [echo] Compiling plugin: web-caching-oscache [javac] Compiling 4 source files to /home/nutch/distr/nutch.src/nutch/trunk/build/web-caching-oscache/classes [javac] /home/nutch/distr/nutch.src/nutch/trunk/contrib/web2/plugins/web-caching-oscache/src/java/org/apache/nutch/webapp/CacheManager.java:32: package org.apache.nutch.webapp.common does not exist Could you help me to know where is a problem? it seems you can just ignore step #5, because they get compiled in #7 -- Sami Siren
Re: Enabling Spell-Check plugin in contrib
chris sleeman wrote: Hi, Can anyone tell me how to use the spell-check query plugin available in the contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to enabling the nutch-plugins? Following these steps should get you there: 1. compile nutch (in top level dir do ant) 2. crawl your data (see tutorial) 3. edit your conf/nutch-site.xml so it contains plugin web-query-propose-spellcheck and webui-extensionpoints 4. edit conf/nutch-site.xml so it contains proper dir for plugins as the plugins are not packaged inside .war (something like property nameplugin.folders/name value path to plugins dir /value /property ) 5. compile web2 plugins (in contrib/web2 do ant compile-plugins) 6. edit search.jsp contains line tiles:insert definition=propose ignore=true/ just before the second c:choose. 7. create web2 app (in contrib/web2 do ant war) 8. build your spell check index ( bin/nutch plugin web-query-propose-spellcheck org.apache.nutch.spell.NGramSpeller -i indexdir -f content -o spelling 9. deploy webapp to tomcat 10. start tomcat (from the dir you have your crawl data and ngram index generated in #7) 11. search for something that is spelled incorrectly Also how do we build the spelling index ? Are these plugins still WIP ? I see #8 above, the whole web is MWSN (More Work Still Needed:) haven't been able to find any docs on these. That's because there currently is not any other documentation but the readme in http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/README.txt?view=markup I should probably put some documentation to wiki to gain more attraction fyi - I just committed a small fix to bug that might prevent spell checking proposer from working. So if you have problems check out the trunk or a nightly build tomorrow. -- Sami Siren
Re: Regex-urlfilter
Naess, Ronny wrote: Can anyone pleas tell me what am I doing wrong? It struck me that I might be using the wrong file and that all regex exceptions should be in crawl-urlfilter.txt, but I do not thing that is correct. Yes when using the crawl command you should use crawl-urlfilter.xml or configure crawl to use regex-urlfilter.xml via crawl-tool.xml. -- Sami Siren
Re: fetch single host
derevo wrote: hi, (2 servers hadoop nutch) I am try to fetch my host with txt files ( http://site.net/file_1.txt ). More then 15 txt files. when i start fetch and look in access.log file in target host, i see only one slave host do fetch (SLAVE_1). I try to restart fetching and slave host now is (SLAVE_2). in Task Tracker Status i see the same result Fetchlist is by default partitioned in a way that all urls for same host will end up being fetched by a single node see PartitionUrlByHost. To override this you would need to change the partitioner or stop using it (both would require source code changes) -- Sami Siren
Re: urlfilter-suffix bug ?
Andrzej Bialecki wrote: Sami Siren wrote: Emmanuel JOKE wrote: ... those files. I tried to look at the code and I think the plugin doesn't manage correctly the dynamic URL with ? and parameters after the extension of the file. Yes your observation is correct, the filter compares only on string level. It isn't too hard to extends the functionality so it meets your requirement. The question is however what is the intended behavior - should we match at the whole URL (including URL parameters), or should we match only the URL up to (and including) path, but excluding any parameters? Currently we implement the former. Yes the current behavior (specially in url space) is not what you'd probably expect but it matches the name. -- Sami Siren
Re: nutch freezing issue
Siddharth Jonathan wrote: Hi, After a couple of days of being up, my nutch app begins to freeze/hang and basically indexing and searching can no longer happen. During this time (couple of days) is it just sitting idle or serving requests? -- Sami Siren
Re: urlfilter-suffix bug ?
Emmanuel JOKE wrote: ... those files. I tried to look at the code and I think the plugin doesn't manage correctly the dynamic URL with ? and parameters after the extension of the file. Yes your observation is correct, the filter compares only on string level. It isn't too hard to extends the functionality so it meets your requirement. -- Sami Siren
Re: Nutch and running crawls within a container.
Briggs wrote: Version: Nutch 0.9 (but this applies to just about all versions) I'm really in a bind. Is anyone crawling from within a web application, or is everyone running Nutch using the shell scripts provided? I am trying to write a web application around the Nutch crawling facilities, but it seems that there is are huge memory issues when trying to do this. The container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K on the stack) runs out of memory in less that an hour. When profiling version 0.7.2 we can see that there is a constant pool of objects that grow, but never get garbage collected. So, even when the crawl is finished, these objects tend to just hang around forever, until we get the wonderful: java.lang.OutOfMemoryError: PermGen space. I updated the application to use Nutch 0.9 and the problem got about 80x worse Have you analyzed in any level of detail what is causing this memory wasting? Have you tried tweaking jvms XX:MaxPermSize? I believe that all the classes required by plugins need to be loaded multiple times (every time you execute a command where Configuration object is created) because of the design of plugin system where every plugin has it's own class loader (per configuration). So, the current design is/was to have an event happen within the system, which would fire off a crawler (currently just calls org.apache.nutch.crawl.Crawl.main()). But, this has caused nothing but grief. We need to have several crawlers running concurrently. We You should perhaps use and call the classes directly and take control of managing the Configuration object, this way PermGen size is not wasted by loading same classes over and over again. -- Sami Siren
Re: Can anybody tell me how the Nutch-0.9 is different than nutch-0.8.1
Ratnesh,V2Solutions India wrote: Hi, can anybody explain me what's new with nutch-0.9 than in nutch-0.8.1 since I have used nutch-0.8.1 , I am keen to know how the nutch-0.9 is different from older version . I think the best place to study thechanges since 0.8.1 is jira: http://issues.apache.org/jira/secure/BrowseProject.jspa?id=10680subset=3 where most of the changes are listed. -- Sami Siren
Re: Classpath and plugins question
Antony Bowesman wrote: I'm looking to use the Nutch parsing framework in a separate Lucene project. I'd like to be able to use the existing plugins directory structure as-is, so wondered Nutch sets up the class loading environment to find all the jar files in the plugins directories. There are dedicated class loaders for each plugin. The classpath is constructed (recursively) based on plugin metadata (plugin.xml). Any pointers to the Nutch class(es) that do the work? Check the package o.a.n.plugin which contains most of the general plug-in code. There's also a recently established project called Apache Tika [1] which has a goal of putting together generally usable parsing/extracting framework. It hasn't yet got out of the ground so there is a good chance to get your voice heard. [1] http://incubator.apache.org/tika/ -- Sami Siren
Re: How to recude the tmp disk space usage during linkdb process?
Sean Dean wrote: I think the general rule is you will require about 2.5 to 3 times the size of the final product. This is due to Hadoop creating the reduce files after the maps are produced, before the maps can be removed. I'm not aware of any way to change this, I think its just normal functionality. The space consumption is at its worst on single machine configuration where you have to process all the data on 1 machine. If you have more machines to spare then the space required per machine can (obviously) be divided roughly by the amount of machines. I think the only way to cut down your temp size requirements (after compression, I think it's possible to compress the temp data?) is to do your work in smaller slices. -- Sami Siren - Original Message From: qi wu [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 11, 2007 10:41:35 AM Subject: Re: How to recude the tmp disk space usage during linkdb process? One more general questions related with this issue is :How to estimate the tmp space required by the overall process which include fetching,update crawldb,building linkdb and indexing ? For my case, 20G space for crawdb and all segments require more than 36G space for linking DB tmp space, sounds unreasonable! - Original Message - From: qi wu [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 11, 2007 10:15 PM Subject: Re: How to recude the tmp disk space usage during linkdb process? it's impossible for me to change to 0.9 now,anyway ,thank you! - Original Message - From: Sean Dean [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 11, 2007 9:33 PM Subject: Re: How to recude the tmp disk space usage during linkdb process? Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) to reduce overall space. The average compression ratio using zlib is about 6:1 on those two databases and doesn't slow additions or segment creation down. Keep in mind, this currently only works officially on Linux and unofficially on FreeBSD. - Original Message From: qi wu [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 11, 2007 9:01:30 AM Subject: How to recude the tmp disk space usage during linkdb process? Hi, I have cralwed nearly 3millon pages which are kept in 13 segements and there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux box,currently the disk occupied by crawldb and segments is about 20G ,and the machine still have 36G space left. I always failed in building linkdb, and the error was caused by no space left for reducing process, the exception is listed below: job_f506pk org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150) at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112) I wonder why so much space are required by linkdb reduce job, can I config some nutch or hadoop setting to reduce the disk space usage for linkdb? Any hints for me to overcome the problem? //bow Thanks -Qi
Re: Fetcher2 too many spinWaiting, How to tune?
hi, qi wu wrote: Hi, I am using Fetcher2 with 200 threads started. I get a satisify speed(about 20pages/s) at the beginning stage ,but after no more than one hour,there are many spinWaiting threads. Where might be the bottleneck? network, memory or anyplace else? Could you also give me some hints on how to get more detailed debug info? Not specific to fetcher2, but how are the pages distributed among different hosts in fetchlist? Have you configured reasonable setting for generate.max.per.host in nutch conf? If you generate too many pages for too few hosts there's no way fetcher|fetcher2 can fetch them fast unless you make it non polite. -- Sami Siren
Re: Crawling + Indexing staging vs. production and URL conflict
Tomi N/A wrote: 2007/3/31, Sami Siren [EMAIL PROTECTED]: You could also let your reverse proxy do the rewriting using something like http://apache.webthing.com/mod_proxy_html/. I have been using something like that for rewriting massive amount of html in realtime for AA purposes to hammer web applications to different url space. Does it put the server under noticeable additional load? We ran reverse proxy (with AA) on separate machines and the load on the machines was minimal, network latency was more overhead (thinking of page download times) than rewriting couple of absolute urls. I should note however that we did not use that particular rewriter but a very similar home brew solution. -- Sami Siren
Re: Crawling + Indexing staging vs. production and URL conflict
Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: What is the best way to accomplish this? One thing I was thinking was to index the staging site, then open up CrawlDb and LinkDb (any others?), loop through them and write out a new version of those files, changing the keys (URLs) along the way, for instance from http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html Has anyone done this? Does this sound realistic/doable? Is there a better/faster/easier way? e.g. changing URLs immediately at fetch/parse/index time? e.g. changing URLs on the fly at search time when displaying results? There is another option - when fetching configure nutch to use a URL rewriting proxy, which will rewrite on the fly your requests of www.example.com to staging.example.com, get the response, and return the content - the only thing to do then would be to rewrite absolute outlinks contained in the content, from staging to www - but this can be done in URLNormalizers. You could also let your reverse proxy do the rewriting using something like http://apache.webthing.com/mod_proxy_html/. I have been using something like that for rewriting massive amount of html in realtime for AA purposes to hammer web applications to different url space. -- Sami Siren
Re: Merging WebDBs
2007/3/23, prashant_nutch [EMAIL PROTECTED]: i created new webdb under which i create two folder crawldb and segments (which is combination of two webdb), but now i want create Linkdb and index how this can be created..i use command like this in eclipse program argument(Windows) invertlinks linkdb segments/* i got error like INFO crawl.LinkDb - LinkDb: starting INFO crawl.LinkDb - LinkDb: linkdb: invertlinks INFO crawl.LinkDb - LinkDb: adding segment: linkdb INFO crawl.LinkDb - LinkDb: adding segment: segments/* ERROR mapred.JobClient - Input directory E:/Data/prashant/Projects/DummyNutch/Nutch/linkdb/parse_data in local is invalid. thanks in advance for help LinkDb treats the parameter invertlinks as the path to linkdb (the 1st parameter), remove it and the command should succeed. -- Sami Siren
Re: Nutch and GET
Damian Florczyk wrote: Hi there, Does nutch can index dynamic pages with multilpe GET parameters in request ? Have you allowed them in URL filter configuration? By default regex urlfilter filters away those: # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] -- Sami Siren
Re: How to limit nutch to fetch, refetch and index just the injected URLs?
Nicolás Lichtmaier wrote: I've backported revision 450799 to the 0.8.x branch for supporting -noAdditions. Perhaps you could consider committing it there... (I haven't tested it yet whough). Can you please create a JIRA issue for this and attach the patch there. -- Sami Siren
Re: Indexing only some filetypes with Nutch
Tobias Zahn wrote: Hallo again, I think I'm going to have a problem here: what if I'd like to index only files like .gif? I think I won't get anything in my index that way :-( Is there a way to get all URLs to such files anyway (maybe on a txt-list)? You would have to allow html to be fetched to find the images. You would also need to change indexer to index just the content you are interested in (images) and skip the rest. -- Sami Siren
Re: Compiling PruneIndexTool trouble
Jonathan Hunter wrote: Dear nutch-users, I am trying to make some changes to the Nutch's PruneIndexTool, but before I start making those changes I wanted to make sure that I am able to compile the current PruneIndexTool from the command line. I checked to make sure that the java compiler works in general by using it to compile a simple hello world program. I did this by calling the following command from my nutch directory: $ javac helloworld.java //compiles with no errors $ java helloworld hello world $ You should use ant command to compile nutch (including PruneIndexTool). $ ant -- Sami Siren
Re: How to stop a slow fetch?
I'm thinking your fetch list is at a point where it might only have few hosts left in it, but enough pages from those hosts to stall everything up. Recently there was a patch applied to trunk to help solve that problem, the generator was actually not working to its fullest capacity for some time up until that point. There's some more about that issue and how it affected to a random segment here: http://blog.foofactory.fi/2007/01/sorted-out.html -- Sami Siren
Re: Nutch .81: the process to add a new analyzer ?
Chee Wu wrote: Hi, I am trying to add a new analyzer for Chinese,and I found the code below in the org.apache.nutch.indexer.Indexer The question of mine is: For doc.get(lang). Where and how can I set the lang property for lang field is put there by language identifier plugin if it is active. http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html -- Sami Siren
Re: List owner?
Owner can be reached at [EMAIL PROTECTED] What kind of error are you experiencing (if any)? -- Sami Siren James Phillips wrote: Can somebody tell me how to contact the owner of this list? I have tried on COUNTLESS occasions to remove myself using [EMAIL PROTECTED] but still keep on receiving e-mails. Regards, James Phillips
Re: Nutch .81: the process to add a new analyzer ?
chee wu wrote: Thanks Sami. I tried LanguageIndexingFilter,and it seems the LanguageIdentifier can't recognize Chinese now ? No it doesn't. The list of languages can be checked here (*.ngp): http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ You can build a ngp profile for chinese, but i think that in language identifiers current form it might not work that well. You could also build an specialized identifier and add it as indexing filter - the most basic form could just blindly set lang to Chinese if that suits your use case. -- Sami Siren - Original Message - From: Sami Siren [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Sunday, January 07, 2007 5:47 PM Subject: Re: Nutch .81: the process to add a new analyzer ? Chee Wu wrote: Hi, I am trying to add a new analyzer for Chinese,and I found the code below in the org.apache.nutch.indexer.Indexer The question of mine is: For doc.get(lang). Where and how can I set the lang property for lang field is put there by language identifier plugin if it is active. http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html -- Sami Siren
Re: Nutch .81: the process to add a new analyzer ?
e w wrote: If someone could explain the reasoning/motivation behind the orginal Current n-gram identifier in nutch works pretty much ok for most of western languages. It is also very simple and quite fast way of identifying documents language. However is the charset of document is not detected right results are not that good. identification method that would be helpful. Otherwise, I'd be happy to contribute my pseudo-NB hack and maybe even implement the correct version. Go ahead and attach it to JIRA. I am sure there's plenty of people interested in such thing. -- Sami Siren
Re: How best to add sponsored link support..??
Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ [3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java 2006/12/19, RP [EMAIL PROTECTED]: Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated
Re: subcollections
liv wrote: - I reindex the db: delete folder indexes, run the command: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* - then I inspect the resulting db with luke again Unfortunately nothing has changed. Maybe I am missing something... Please tell me if you see anything wrong. If you did exactly those steps then what happens is that the subcollections.xml is read from inside the .job file. You need to rebuild the .job to put new file inside of it. simply do ant and rerun indexing and it should work as expected. -- Sami Siren
Re: error with trunk: linkdb copied to wrong dir
Andrzej Bialecki wrote: Espen Amble Kolstad wrote: Hi, There's a bug in LinkDb.install(). It tries to rename an old linkdb from linkdb/current to linkdb/old, and linkdb/current doesn't exist. Just replace: fs.rename(current, old); with: if (fs.exists(current)) { fs.rename(current, old); } and it will work again :) Indeed, this is related to some changes of delete()'s behavior in HDFS - it seems that previously it would just return false on non-existent directories, now it throws an Exception. the needle is here? http://issues.apache.org/jira/browse/NUTCH-392 -- Sami Siren
Re: subcollections
liv wrote: I intend to use nutch with a fairly complex structure of subcollections. I did some tests and the storage/search performs as expected; however there is an aspect I may have neglected and cannot find an answer. How/at which stage are subcollections added to the index structure? If you are talking about the subcollections generated by the subcollection plugin then the subcollection data is stored at indexing phase. I plan on crawling frequently, adding new sites to existent repository, merging/reindexing as needed. However if I need to change the subcollection structure (ie. add a site to a newly created subcollection) I don't want to recrawl it again. I hope it can be done by simply using the existent/crawled data. no need to recrawl, unfortunately you still need to reindex. -- Sami Siren
Re: Fetcher hung on final hurdle - continue?
Prefix filter to cut off anything without http://;. And then a (non-existent) domain-suffix filter, which considers only domain suffixes - this is easy to implement based on the suffix filter that ships with Nutch. We should propably change the default filter to be something else than regex. -- Sami Siren
Re: indexing from local file system -- indexing from HDFS
Christian Herta wrote: I tried to Index my local file system according to the FAQ: http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 But if I add the plugin into the nutch-site.xml file like this: property nameplugin.includes/name valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value /property try with: valueprotocol-(file|http)|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic/value if it does not work consult your log file logs/hadoop.log for more specific info about your problem. Additionally I have another question: * Is there a possibility to use a directory of the HDFS Filesystem as a spool directory to index from? Not directly, but if you can expose[1] hdfs via some available protocol then it is possible to index contents of hdfs also. One could also write a protocol-hdfs plugin to do the job. -- Sami Siren [1]http://issues.apache.org/jira/browse/HADOOP-4
Re: Fetch fails
frgrfg gfsdgffsd wrote: Hi all, I have a problem with the crawl/fetch of 1 website (www.lequipe.fr), although it works for fine another (www.lemonde.fr). Here are the errors: ERROR [MAT] 2006-11-22 00:36:20,860 - Http.invoke0(?) | java.lang.IllegalArgumentException: null metadata ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at org.apache.nutch.protocol.Content.init(Content.java:60) ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:196) ERROR [MAT] 2006-11-22 00:36:20,870 - Http.invoke0(?) | at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:162) Don't understand why metadata is null when there are some metadata on the pages... what version of nutch are you running? I also have this messsage just before: INFO [MAT] 2006-11-22 00:36:32,477 - HttpBase.getProtocolOutput(194) | Skipping: http://www.lequipe.fr/ exceeds fetcher.max.crawl.delay, max=30, Crawl-Delay=120 and i can't find this property in nutch-site.xml You need to add it there. property namefetcher.max.crawl.delay/name value your value here /value /property -- Sami Siren
Re: Nutch sessions cookies on https protocol
Gavino Marras wrote: Nutch does work with sessions and cookies on https protocol ? No, Nutch does not support cookies nor sessions. -- Sami Siren
Re: Nutch sessions cookies on https protocol
Andrzej Bialecki wrote: Sami Siren wrote: Gavino Marras wrote: Nutch does work with sessions and cookies on https protocol ? No, Nutch does not support cookies nor sessions. This is not strictly speaking true ... if you use protocol-httpclient then https, cookies and sessions are supported internally by the httpclient library, but Nutch doesn't process this information in any way. So, https works just fine, cookies are accepted and then presented if other urls are fetched during the same execution, but they are not stored anywhere. Server set cookies are just http headers so they _are_ stored with rest of the headers. Https works even without protocol-httpclient if a proxy that supports https is used. Anyway, the way I understood the question I would still answer no to sessions and cookies. -- Sami Siren
Re: Strategic Direction of Nutch
carmmello wrote: So, I think, one of the possibilities for the user of a single machine is that the Nutch developers could use some of their time do improve the previous 0.7.2, adding to it some new features, with further releases of this series. I don`t belive that there are many Nutch users, in the real world of searching, with a farm of computers. I, for myself, have already built an index of more than one million pages in a single machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good results, including the actual searching, and gave up the same task, using the 0.8 version, because of the large amount of time required, time that I did not have, to complete all the tasks, after the fetching of the pages. How fast do you need to go? I did a 1 million page crawl today with trunk version of nutch patched with NUTCH-395 [1]. total time for fetching was little over 7 hrs. But of course there are still various ways to optimize fetching process - for example optimizing the scheduling of urls to fetch, improving nutch agent to use Accept header [2] for failing fast on content it cannot handle etc. [1]http://issues.apache.org/jira/browse/NUTCH-395 [2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html -- Sami Siren
Re: Strategic Direction of Nutch
Uroš Gruber wrote: How fast do you need to go? I did a 1 million page crawl today with trunk version of nutch patched with NUTCH-395 [1]. total time for fetching was little over 7 hrs. How is that even possible. I have 3.2GHz pentium with 2G ram. I was same speed problem, because of that I setup nutch with single node. About hour ago fetcher was finished crawling 1.2 million pages. But this took I am running on amd athlon 64 3600+ with 1 G of memory so it's not even high end while map job I have about 24 pages/s. I din't test it with this patch. But then reduce job was slow as hell. I realy don't understant what took so long. It is almost twice as slow as map job. Please try the trunk version for comparison and check back for results. (the patch is now applied to trunk) There are also other things that count (even more?), please see [1] If I use local mode numbers are even worse. my numbers are with local job runner. I can't imagine how much it took to crawl let say 10mio pages. I'll let you know when mine is finished, just started 3rd segment of size 1 million to test the trunk version (running with local job runner) -- Sami Siren [1]http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06533.html
Re: Nutch as static exporter?
Thorsten Scherler wrote: Hi all, I wonder if I could use nutch as static exporter. I mean e.g. Apache Forrest is using the cocoon crawler but in the next version of cocoon the crawler will be probably not included anymore. Could I use nutch for that? could you please explain a bit more what as static exporter means? -- Sami Siren
Re: large number of urls from Generator are not fetched?
Are you saying that generator generates 200k urls but fetcher fetches around 100k or are you saying that you generate (-topN 20) 200k urls and fetcher fetches only around 100k. If latter and you are running with LocalJobRunner you need to generate with -numFetchers 1. -- Sami Siren AJ Chen wrote: Any idea why nutch (0.9-dev) does not try to fetch every url generated? For example, if Generator generates 200,000 urls, maybe 100,000 urls will be fetched, succeeded or failed. This is a big difference, which is obvious by checking the number of urls in the log or run readseg -list. What causes a large number of urls get thrown out by the Fetcher? Thanks,
Re: Speeding things up!
forgot one important one: set generate.max.per.host to something reasonable so you won't end up fetching urls from only low number of hosts which by default is very slow. -- Sami Siren Sami Siren wrote: Some simple rules for generally speeding things up 1. Crawl only the content you are going to handle handle (do not fetch for example pdf-files if you don't need them, also disable all unneeded parsers) 2. If using regex-urlfilter: If you don't need the rule -.*(/.+?)/.*?\1/.*?\1/ remove it (also keep the number of rules as small as possible still remembering #1 and #3) 3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end up parsing all kinds of binary content with text parser. You might also check the variables like fetcher.server.delay and fetcher.threads.per.host. (and remember to keep your fetcher polite!) I am using something like 300 for fetcher.threads for fetching with 0.8.1 single athlon 64, 1 GB of memory. I am also in process of fixing some IO related bottlenecks and will get back to that hopefully sooner than later. -- Sami Siren Marco Vanossi wrote: Hi, Do you have some hints that would improve speed for the following nutch commands? ./nutch generate db segments -topN 1000 s=`ls -d segments/2* | tail -1` ./nutch fetch $s ./nutch updatedb db $s ./nutch index $s ./nutch dedup segments tmpfile I mean, do you have some hints for the numbers set in nutch-default.xmlfor, for example: fetcher.threads (I'm using 10.000), etc Let's say it is running on a machine with 12GB RAM, and 2.000GB HD. Thank you very much for any help. Marco
Re: Nutch slow how to speed up?
You are using DistributedSearch? and local filesystem to store index and related data? -- Sami Siren Håvard W. Kongsgård wrote: I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s. My config: http.content.limit = 6165536 dfs.replication = 1 mapred.submit.replication = 2 mapred.child.java.opts = -Xmx800m My data: TOTAL urls: 3748140 retry 0: 3614731 retry 1: 85999 retry 2: 20772 retry 3: 26638 min score: 0.0 avg score: 0.64956105 max score: 3922.723 status 1 (DB_unfetched): 1316016 status 2 (DB_fetched): 2168397 status 3 (DB_gone): 263727 Status: HEALTHY Total size: 254534723272 B Total blocks: 5140 (avg. block size 49520374 B) Total dirs: 260 Total files: 1466 Over-replicated blocks: 8 (0.15564202 %) Under-replicated blocks: 0 (0.0 %) Target replication factor: 1 Real replication factor: 1.0015564 The filesystem under path '/' is HEALTHY
Re: Nutch slow how to speed up?
If your data to be searched lies in dfs it is slow. You need to first copy it out to local file system. Split your data into smaller slices which you then distribute evenly on your search nodes. This part of process is not that well covered and I am looking for much improvement in this area from this proposal: http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL PROTECTED] -- Sami Siren Håvard W. Kongsgård wrote: DistributedSearch 2x datanodes, 2x Task Trackers Sami Siren wrote: You are using DistributedSearch? and local filesystem to store index and related data? -- Sami Siren Håvard W. Kongsgård wrote: I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s. My config: http.content.limit = 6165536 dfs.replication = 1 mapred.submit.replication = 2 mapred.child.java.opts = -Xmx800m My data: TOTAL urls: 3748140 retry 0: 3614731 retry 1: 85999 retry 2: 20772 retry 3: 26638 min score: 0.0 avg score: 0.64956105 max score: 3922.723 status 1 (DB_unfetched): 1316016 status 2 (DB_fetched): 2168397 status 3 (DB_gone): 263727 Status: HEALTHY Total size: 254534723272 B Total blocks: 5140 (avg. block size 49520374 B) Total dirs: 260 Total files: 1466 Over-replicated blocks: 8 (0.15564202 %) Under-replicated blocks: 0 (0.0 %) Target replication factor: 1 Real replication factor: 1.0015564 The filesystem under path '/' is HEALTHY
Re: Modifying Nutch core
Right now it seems like I have to run `ant package` and then copy the nutch-0.8.jar file out of the build dir and into the nutch dir. But that takes a really long time! I'd like to just be able to run `ant compile-core` and then run bin/nutch... How should I be doing this? first: ant (to compile and to create nutch-x.x.x.job) then: bin/nutch ... -- Sami Siren
Re: Indexing the file system / best approach
Bruno Thiel wrote: All, I want to get nutch to index the file system. My first approach was to nfs-mount the file system and et nutch crawl through the hierachary over http/Apache. This turned out to be fairly slow ~3,000 fetches per hour. Next approach was to go via file:/// file:/// and to generate a file list to be crawled. This file list is fairly big ~200,000 entries, and with the current 0.8.1 release of nutch the fetcher just freezes right at the end of a crawl. What exactly happens when your fetcher freezes? 200 000 entries is not a big list to be fetched. -- Sami Siren
Re: Lucene query support in Nutch
Nevertheless, I agree that there should be an option to choose the Lucene query engine instead of the Nutch flavour one because Nutch has been proven to be equally suitable for areas which do not require as efficient queries (like intranet crawling for instance) as an all-out web indexing application. I agree also. Different query parsers could perhaps be made pluggable or at least configurable. The current(-alike) implementation could be the default one offered and by configuration one could switch it to intranet mode. Contributions anyone? -- Sami Siren
Re: stop an index server
It seems that this was not reaching nutch-user so here's it again in case someone else is also interested. --- hello, here's an adhoc addition to search server to support shutdown command. client calls server like this: bin/nutch 'org.apache.nutch.searcher.DistributedSearch$Client' -shutdown 127.0.0.1 -- Sami Siren Alvaro Cabrerizo wrote: 2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]: Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance. It does not support such a feature. Can you describe a little bit more what are you trying to accomplish something similar to tomcats SHUTDOWN? Sure, That's right. If this feature doesn't exist, I'm looking for a clue to develop a SHUTDOWN and a RESTART command, using NUTCH/HADOOP api. The idea is to have a group of JAVA classes that lets people execute a command like: SERVER_RESTART port or more advanced SERVER_RESTART port ip_address. Anyway I can execute ps aux | grep 4 in a shell and find out proccess number in order to kill it or I can make a ^C to stop it, but this is not the solution I'm looking for. Thanks, in advance. -- Sami Siren Index: src/java/org/apache/nutch/searcher/NutchBean.java === --- src/java/org/apache/nutch/searcher/NutchBean.java (revision 447940) +++ src/java/org/apache/nutch/searcher/NutchBean.java (working copy) @@ -25,10 +25,12 @@ import org.apache.hadoop.fs.*; import org.apache.hadoop.io.Closeable; +import org.apache.hadoop.ipc.RPC.Server; import org.apache.hadoop.conf.*; import org.apache.nutch.parse.*; import org.apache.nutch.indexer.*; import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.searcher.DistributedSearch.Protocol; import org.apache.nutch.util.NutchConfiguration; /** @@ -36,8 +38,8 @@ * @version $Id: NutchBean.java,v 1.19 2005/02/07 19:10:08 cutting Exp $ */ public class NutchBean - implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, - DistributedSearch.Protocol, Closeable { + implements Protocol, Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, + Closeable { public static final Log LOG = LogFactory.getLog(NutchBean.class); @@ -400,12 +402,29 @@ public long getProtocolVersion(String className, long arg1) throws IOException { if(DistributedSearch.Protocol.class.getName().equals(className)){ - return 1; + return DistributedSearch.Client.versionID; } else { throw new IOException(Unknown Protocol classname: + className); } } - - + public void shutdown() { +try { + LOG.info(Closing NutchBean instance + this); + this.close(); +} catch (IOException e) { + // TODO Auto-generated catch block + e.printStackTrace(); +} +final Server server=(Server)conf.getObject(DistributedSearch.DISTRIBITED_SERVER_INSTANCE); + +new Thread(){ +public void run(){ + + LOG.info(Shutting down server instance: + server); + server.stop(); +} +}.start(); + + } } Index: src/java/org/apache/nutch/searcher/DistributedSearch.java === --- src/java/org/apache/nutch/searcher/DistributedSearch.java (revision 447940) +++ src/java/org/apache/nutch/searcher/DistributedSearch.java (working copy) @@ -38,6 +38,8 @@ /** Implements the search API over IPC connnections. */ public class DistributedSearch { + + public static final String DISTRIBITED_SERVER_INSTANCE = DistribitedServerInstance; public static final Log LOG = LogFactory.getLog(DistributedSearch.class); private DistributedSearch() {} // no public ctor @@ -48,11 +50,17 @@ /** The name of the segments searched by this node. */ String[] getSegmentNames(); + + +/** Ask server to shutdown itself + * @throws IOException */ +void shutdown(); } /** The search server. */ public static class Server { + private Server() {} /** Runs a search server. */ @@ -70,6 +78,7 @@ Configuration conf = NutchConfiguration.create(); org.apache.hadoop.ipc.Server server = getServer(conf, directory, port); + conf.setObject(DISTRIBITED_SERVER_INSTANCE, server); server.start(); server.join(); } @@ -83,7 +92,7 @@ /** The search client. */ public static class Client extends Thread -implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, +implements Protocol, Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, Runnable { private InetSocketAddress[] defaultAddresses; @@ -143,6 +152,8 @@ private static final Method SEARCH; private static final Method DETAILS; private static final Method SUMMARY
Re: Problem Searching
WebDev Freak wrote: Hi, I'm using the subcollection.xml file to create collection but I can't find any code samples to search for a term in a specific collection. I'm looking for java code samples. look in contrib/web2, there's a piece of java code that does this (reads collection name from request parameter, put there by the view part of that plugin, and modifies query object accordingly) http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-subcollection/src/java/org/apache/nutch/webapp/subcollection/SubcollectionPreSearchExtension.java?view=markup So basically what you need to do is modify the Query. -- Sami Siren Thanks,
Re: stop an index server
Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance. It does not support such a feature. Can you describe a little bit more what are you trying to accomplish something similar to tomcats SHUTDOWN? -- Sami Siren
[ANNOUNCE] Nutch 0.8.1 available
Nutch Project is pleased to announce the availability of 0.8.1 release of Nutch - the open source web-search software based on lucene and hadoop. The release is immediately available for download from: http://lucene.apache.org/nutch/release/ Nutch 0.8.1 is a maintenance release for 0.8 branch and fixes many serious bugs discovered in previous release. For a list of changes see http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt A big thanks to everybody who participated and made this release possible. -- Sami Siren
Re: Cannot generate all injected URLS
Are you running in non clustered mode, then run with parameter -numFetchers 1 and you should get all the urls. perhaps we should fix this by adding a check in generator: if task is run with local job runner that param should be forced to 1 (now it defaults to job.getNumMapTasks() which defaults to 2) -- Sami Siren Frank Kempf wrote: Hello, got stuck with generating. Injecting 3200 Urls into the database and generating afterwards leads always to the same result of having 1632 Urls in crawl_generate. (I checked the db and it actually has 3200 entries). No matter if I try -topN 5000 / 5 or nothing. How could I generate a whole set of first level Urls? Kind regards Frank
Re: Is that true?
Your observations are correct, 0.8 has some serious problems and we'll be putting 0.8.1 out pretty soon to fix also the performance problem you describe. -- Sami Siren 2006/9/18, carmmello [EMAIL PROTECTED]: I have been trying Nutch, since its version 0.3, sometimes with some problems. Now I am using the 0.7.2 release and I`m really happy with it, to the point where I have about 1,100,000 pages indexed in a site that deals with quality and environment. But a new version means, at least in principle, a better product. So I went to try the Nutch 0.8, in the same single computer (Athlon 2400+, 1 gig Ram, about 4Mbits connection, 53 threads), same seed sites (but on a folder, as per tutuorial)). I used a depth of 2, just to try the new version (instead of 4 or 5, that I usually do), but when I went for the log, I was really terrified: the fetching was horribly slow! With Nutch 0.7.2 I got about 9 pages per second and in Nutch 0.8 sometimes it was necessary about 3 seconds por the fetching of a single page! Roughly speaking, the fetching speed was reduced bay a factor of 20! So, that is may question: Is that true, or do I have made some big mistake? Thanks
Re: log records
Is your environment windows or linux? You are saying that most are not logged - can you please give an example what is logged (and where) and also what is not. Logging in general can be configured by editing conf/log4j.properties -- Sami Siren 2006/9/1, AJ Chen [EMAIL PROTECTED]: When running fetcher (0.9-dev) in eclipse, lot of log messages are printed as expected, including the status - pages *, errors *, *kb/s. But, when using nutch script to do fetching, most of the log messages including the status are not in std out, nor in logs/haddop.log. Did I miss a setting? How to make nutch script print out all the info and error message in std out or log file? thanks, AJ
Re: Is there a way to get Nutch to parse/index by file access directly (not over HTTP)?
Fetcher can fetch also with protocol file. This is not as efficient as it could be because you still need to go through full crawling cycle. It would be more efficient to use (write) a special crawler that would start from a submitted path and follow all sub directories and files. Such crawler could also be succesfully used for efficient crawling of smb, ftp and webdaw resources, -- Sami Siren 2006/8/27, Sandy Polanski [EMAIL PROTECTED]: This maybe more of a straight Lucene task, but I thought I'd ask anyway. Rather than using Nutch as a crawler, I'd rather just send the Nutch parser and indexer over to a directory on my server and have it detect content-type by the file extension. I'd prefer to skip the whole crawling part since all of my data is local, and increase the reliability of getting all of my proper data indexed. Is this possible? - All-new Yahoo! Mail - Fire up a more powerful email and get things done faster.
Re: Nutch doesn't dive deeper
This is yet another side effect of applying TextParser to non plain text documents and in this particular case it comes short with namespace declarations. I propose that we remove the PlainText parser from at least the following mime types: * (default) application/rss+xml application/vnd.wap.wbxml application/vnd.wap.wmlc application/vnd.wap.wmlscriptc application/xhtml+xml application/x-latex application/x-netcdf application/x-tex application/x-texinfo application/x-troff application/x-troff-man application/x-troff-me application/x-troff-ms message/news message/rfc822 text/css text/sgml text/vnd.wap.wml text/xml text/x-setext I would guess that handling of text/xhtml+xml mimetpe should be done with html parser anyway. -- Sami Siren 2006/8/25, Michael Wechner [EMAIL PROTECTED]: I think the problem is as follows with XHTML files: 2006-08-25 16:06:11,925 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks java.net.MalformedURLException: unknown protocol: xmlns at java.net.URL.init(URL.java:544) at java.net.URL.init(URL.java:434) at java.net.URL.init(URL.java:383) whereas maybe this could be resolved with http://issues.apache.org/jira/browse/NUTCH-359 I am kind of suprised that nobody else is having this problem with proper XHTML ;-) Thanks Michi Ken Gregoire wrote: look here, it is blocking robots: http://ulysses.wyona.org/robots.txt User-agent: * Disallow: /foo/bar.html User-agent: lenya Disallow: /foo/bar.html Michael Wechner wrote: Hi I am trying to index http://ulysses.wyona.org/ but somehow it just indexes the homepage but doesn't seem to follow any links. I have set depth 3 and other sites are being crawled deeper without a problem but not the Ulysses page. Has anyone made similar experiences? Is it possible that Nutch has problem with well-formed XHTML (application/xhtml+xml)? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
Re: Making crawler stop after all pages are found.
The job should terminate in it's own, but not as soon as all pages are found - only after -depth iterations. Are you saying It won't honor the -depth parameter? -- Sami Siren Sandy Polanski wrote: Sami, in 0.7.2 my intranet crawling job did terminate on its own. The issue that I described only started since I began to use 0.8. Maybe you understand the changes in the code/methods better than I do between the versions so that you could point me in the right direction (for opening an issue or writing a patch). sami siren [EMAIL PROTECTED] wrote: There's no such feature present in Nutch currently. Feel free to open issue (of type new feature) in Nutch Jira and provide a patch or wait until someone else gets to it. -- Sami Siren 2006/8/27, Sandy Polanski : On my intranet, I have 8100 documents. The nutch crawler finds all of them fine, but the process does not end. It just keeps on creating empty segments timestamp directories. What conf setting will make it stop on its own when there are no more links in the fetch list? Thanks, Sandy - Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1¢/min.
Re: Nutch doesn't dive deeper
2006/8/27, Chris Mattmann [EMAIL PROTECTED]: Hi Sami, I'm not sure that I agree that the entire set of mime types that you list below should be removed from the parse-plugins.xml default mapping. For instance, if you look at the current mapping file, many of the types below would have no other option for parsing them besides the TextParser. I think it makes a lot of sense to parse some of the below documents with the TextParser because, in fact, they are text documents. A LaTeX document is a plan text document. Yes it can contain textual content among other things. However without proper parsing the outcome is (at least pars of it) not something I would like to see in search results. Text/css is essentially a plain text document. yes, contents are most often ASCII but is it really something one wants to index by default? An rfc822 message is indeed (stripped of headers), a plain text document. yes, contents are most often ascii, but I quess as often encoded (for example mime) to be more or less useless in unparsed form. There's a careful tradeoff that must be made in terms of having a default config file that allows the greatest coverage of mime tyeps that are available, and the handling of them with at least * one * parser, in contrast to not including any parser at all for a particular mime type. I struggled with this very issue when I initially created that file and what you see in there now represents a best guess of mime types mapped to the available parsers that exist in Nutch. The other option on that file is that people can modify it on their own. For instance, in a domain-specific deployment, a user can add and remove whatever mime type to plugin mappings she wants from the parse-plugins.xml file: it was never meant to be something that was set in stone per se. It would be good to see some experiments to see what the best config set for parse-plugins.xml is. My opinion is that we should not try to pretend to be able to parse something when we really are not. We should give a default config that allows the greatest set of mime types Nutch really can handle. Then again those two text type of documents you picked up are quite rare and not mainstream and probably enabling/disabling them doesn't really make any difference in search results. -- Sami Siren
Re: Making crawler stop after all pages are found.
There's no such feature present in Nutch currently. Feel free to open issue (of type new feature) in Nutch Jira and provide a patch or wait until someone else gets to it. -- Sami Siren 2006/8/27, Sandy Polanski [EMAIL PROTECTED]: On my intranet, I have 8100 documents. The nutch crawler finds all of them fine, but the process does not end. It just keeps on creating empty segments timestamp directories. What conf setting will make it stop on its own when there are no more links in the fetch list? Thanks, Sandy - All-new Yahoo! Mail - Fire up a more powerful email and get things done faster.