Re: embed nutch crawl in an application
This is an interesting question. If you know how to run the Crawl process out of another Java program, plz let me know it. Thanks in advance. n_developer wrote: > > Generally nutch crawl in done thru cygwin. If i dont want to run cygwin, > and i want to crawl an application from an application of my own what can > i do? > > N also i want nutch to perform wildcard query search(as in, if search > query is book*, then it shd return al search results whic contain isbn > followed by any text) This is possible in luke lucene. But hw can i > implement it in nutch search? > -- View this message in context: http://www.nabble.com/embed-nutch-crawl-in-an-application-tp22572933p22573211.html Sent from the Nutch - User mailing list archive at Nabble.com.
embed nutch crawl in an application
Generally nutch crawl in done thru cygwin. If i dont want to run cygwin, and i want to crawl an application from an application of my own what can i do? N also i want nutch to perform wildcard query search(as in, if search query is book*, then it shd return al search results whic contain isbn followed by any text) This is possible in luke lucene. But hw can i implement it in nutch search? -- View this message in context: http://www.nabble.com/embed-nutch-crawl-in-an-application-tp22572933p22572933.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch-based Application for Windows
Hi All, For fun, I created a windows-based installer for Nutch and added a administrative GUI to it. If interested, you can grab it from http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html FreewareFiles . Regards, John -- View this message in context: http://www.nabble.com/Nutch-based-Application-for-Windows-tp22572158p22572158.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: The Future of Nutch
Dennis, That adds another dimension to the issue which I had not considered. One avenue as you suggest would be to add another committer to the Lucene PMC. If that does not work them maybe going the route of TLP is the best option. Marc > Part of this is about releases. Currently releases are voted on by Lucene > PMC members and it takes 3 members to confirm a vote. There are only 2 > Nutch committers on the Lucene PMC. So for releases, not that we have had > many recently, other Lucene PMC members who may not be actively associated > with Nutch would need to vote to release. If Nutch was a TLP there would be > a Nutch PMC which would most likely include all current Nutch committers. > The other may be to add another Nutch committer to the Lucene PMC. > >> >> My thoughts. And hopefully in the near future my small team will be >> able to contribute to Nutch in a meaningful way. > > Any and every contribution is welcome. > > Dennis >
Re: The Future of Nutch
Marc, Glad you responded. Always good to hear peoples thoughts. Marc Boucher wrote: Dennis, Otis et al, My very small team has kept silent for a long time. We've been playing with Nutch, Hadoop and to a lesser extent Solr for about 2 years now. Before I get into my thoughts on what direction things should take I would like to offer a thought on why Nutch is not as active as other groups. I think in part it's because what Nutch represents and that's the ability of creating a large scale search. Some developers would rather use Nutch and associated tools and keep quiet about it because of their goals, which in some case might mean competing against the likes of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to compete with those companies on large scale search but I can see competition in the vertical markets. And while Solr is hot these days it's intended primarily for the enterprise market which is very different than the large scale and vertical markets. I completely agree. The group of people/companies that are creating large scale search solutions whether whole web or vertical is much smaller than say enterprise search or even the potential uses for Hadoop. Now on to the future. I agree with many of the thoughts Otis put forward. While Nutch has it's problems other than Heritrix there is no other open source system available and Nutch's ability to perform web-wide crawls must be preserved. However I'm thinking we should have modular approach to Nutch. For instance, why just one fetcher? Why not keep the current one but also allow for the possibility of using Droids? Parsing can and should include Tika. I'm not sure about outsourcing indexing and searching to Solr but that could be a modular option as well. Yup. It should IMO also be easy to install and configure. I was having a discussion today where the main topic was, could we make Nutch have a nice graphical web interface for configuration, when you could drop it in, change some options, and create a customized vertical search over x domains? I'm not sure if Nutch should become a top level project and move out from under Lucene. Lucene has great visibility and for many reasons. If Nutch was moved, would it still attract enough attention? It's been noted that developer interest in Nutch is different that Lucene, Solr etc. On the other hand it might do Nutch good to go TLP as maybe then it would attract more developers especially if it was packaged differently. Part of this is about releases. Currently releases are voted on by Lucene PMC members and it takes 3 members to confirm a vote. There are only 2 Nutch committers on the Lucene PMC. So for releases, not that we have had many recently, other Lucene PMC members who may not be actively associated with Nutch would need to vote to release. If Nutch was a TLP there would be a Nutch PMC which would most likely include all current Nutch committers. The other may be to add another Nutch committer to the Lucene PMC. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Any and every contribution is welcome. Dennis Marc Boucher http://hyperix.com On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic wrote: Hello, Comments inlined. - Original Message From: Dennis Kubes To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Yes, there are fewer parties doing large scale web crawling. Still, as there is no alternative fetcher+parser+indexer+searcher capable of handling large scale deployments like Nutch (or maybe Heritrix has the same scaling capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should be preserved. Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. That's my experience, too. I think we can have both under the same Nutch roof. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? I disagree, at least in the near te
Re: Professional Nutch Support and Distribution
This sounds interesting. I might be interested in this. Marc Boucher http://hyperix.com On Tue, Mar 17, 2009 at 12:31 PM, Dennis Kubes wrote: > Wanted to gauge community interest in having a certified Nutch distribution > with support? Similar to what Lucid Imagination is doing for Solr and > Lucene and what Cloudera is providing for Hadoop. Anybody interested? > > Dennis >
Re: The Future of Nutch
Dennis, Otis et al, My very small team has kept silent for a long time. We've been playing with Nutch, Hadoop and to a lesser extent Solr for about 2 years now. Before I get into my thoughts on what direction things should take I would like to offer a thought on why Nutch is not as active as other groups. I think in part it's because what Nutch represents and that's the ability of creating a large scale search. Some developers would rather use Nutch and associated tools and keep quiet about it because of their goals, which in some case might mean competing against the likes of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to compete with those companies on large scale search but I can see competition in the vertical markets. And while Solr is hot these days it's intended primarily for the enterprise market which is very different than the large scale and vertical markets. Now on to the future. I agree with many of the thoughts Otis put forward. While Nutch has it's problems other than Heritrix there is no other open source system available and Nutch's ability to perform web-wide crawls must be preserved. However I'm thinking we should have modular approach to Nutch. For instance, why just one fetcher? Why not keep the current one but also allow for the possibility of using Droids? Parsing can and should include Tika. I'm not sure about outsourcing indexing and searching to Solr but that could be a modular option as well. I'm not sure if Nutch should become a top level project and move out from under Lucene. Lucene has great visibility and for many reasons. If Nutch was moved, would it still attract enough attention? It's been noted that developer interest in Nutch is different that Lucene, Solr etc. On the other hand it might do Nutch good to go TLP as maybe then it would attract more developers especially if it was packaged differently. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Marc Boucher http://hyperix.com On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic wrote: > > Hello, > > > Comments inlined. > > - Original Message >> From: Dennis Kubes >> To: nutch-user@lucene.apache.org >> Sent: Friday, March 13, 2009 8:19:37 PM >> >> With the release of Nutch 1.0 I think it is a good time to begin a discussion >> about the future of Nutch. Here are some things to consider and would love >> to >> here everyones views on this >> >> Nutch's original intention was as a large-scale www search engine. That is a >> very specific goal. Only a few people and organizations actually use it on >> that >> level. (I just happen to be one of them as most of my work focuses on large >> scale web search as opposed to vertical search). > > Yes, there are fewer parties doing large scale web crawling. Still, as there > is no alternative fetcher+parser+indexer+searcher capable of handling large > scale deployments like Nutch (or maybe Heritrix has the same scaling > capabilities?), I think Nutch's ability to perform web-wide crawls, etc. > should be preserved. > >> Many, perhaps most, people >> using Nutch these days are either using parts of Nutch, such as the crawler, >> or >> are targeting towards vertical or intranet type search engines. This can be >> seen in how many people have already started using the Solr integration >> features. So while Nutch was originally intended as a www search, IMO most >> people aren't using it for that purpose. > > > That's my experience, too. I think we can have both under the same Nutch > roof. > >> Since there are different purposes for different users, would it be good to >> consider moving Nutch to a top level apache project out from under the Lucene >> umbrella? This would then allow the creation of nutch sub-projects, such as >> nutch-solr, nutch-hbase. Thoughts? > > > I disagree, at least in the near term. There is nothing preventing those > sub-projects existing under Nutch today. Both Solr and Lucene have the > contrib area where similar sub-projects live. I think it's not a matter of > being a TLP, but rather attracting enough developer interest, then user > interest, and then contributor interest, so that these sub-projects can be > created, maintained, advanced. Right now, Solr gets a TON of attention, as > does Lucene. Nutch gets the least developer attention, and for some reason > the nutch-user subscribers "feel" a bit different from solr-user or java-user > subscribers. > >> Many parts of Nutch have also been implemented in other projects. For >> example, >> Tika for the parsers, Droids for the Crawler. In begs the question what is >> Nutch's core features going forward. When I think about search (again my >> perspective is large scale), I think crawling or acquisition of data, >> parsing, >> analysis, indexing, deployment, and searching. I personally think that >> there is >> much room for improvement in crawling and especially analysis
Re: nutch - solr integration advantages
Hello Bartosz. I can only really describe my own experiences and what I have done with Nutch/Solr is pretty simple. My reasons for using Nutch/Solr were that the query interface to solr is more powerful (Nutch is optimised for speed instead) and that I felt that ot would be easier for me to integrate Solr into my python/django front end than it would be to integrate Nutch using OpenSearch. Thanks Andy 2009/3/17 Bartosz Gadzimski > Hello, > > It's hard for me to "get big picture" of why to use solr as indexing and > searching. > > Could someone try to describe this a little bit? > > I understand that nutch is doing crawling and solr just indexing and > searching? > > Any help would be great. > > Thanks, > Bartosz >
nutch - solr integration advantages
Hello, It's hard for me to "get big picture" of why to use solr as indexing and searching. Could someone try to describe this a little bit? I understand that nutch is doing crawling and solr just indexing and searching? Any help would be great. Thanks, Bartosz
Re: Task failed to report status when merging segments
I raised heap size to 2GB for each child in "mapred.child.java.opts" and the segment merging succeeded. Justin Yao wrote: Hi, I encountered an error when I try to merge segment using the latest nightly build nutch. I have 3 hadoop nodes and all servers have CentOS 5.2 installed. Every time when I tried to merge segment using command: "nutch mergesegs crawl/MERGEDsegments -dir crawl/segments", it would fail with error message: Task attempt: /default-rack/10.9.17.206 Cleanup Attempt: /default-rack/10.9.17.206 "Task attempt_200903161037_0001_r_03_0 failed to report status for 1200 seconds. Killing!" then another child task will be launched, and later I got error message: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1055) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:998) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:301) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) at org.apache.hadoop.ipc.Client.call(Client.java:697) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy1.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2585) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.(SequenceFile.java:1074) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:306) at org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.ensureSequenceFile(SegmentMerger.java:252) at org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.write(SegmentMerger.java:211) at org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.write(SegmentMerger.java:194) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:479) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) Here is the log of namenode: 2009-03-16 17:03:20,794 WARN hdfs.StateChange - DIR* NameSystem.startFile: failed to create file /user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 because current leaseholder is trying to recreate file. 2009-03-16 17:04:20,798 WARN hdfs.StateChange - DIR* NameSystem.startFile: failed to create file /user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 because current leaseholder is trying to recreate file. 2009-03-16 17:05:20,803 WARN hdfs.StateChange - DIR* NameSystem.startFile: failed to create file /user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 because current leaseholder is trying to recreate file. I checked the processes in hadoop node, the failed reduce process was never killed and it kept running. I've tried to merge segments several times and it always failed with same error. Does someone encounter this problem before? Is there any solution to avoid this problem? Any suggestion would be appreciated. Thanks, -- Justin Yao Snooth o: 646.723.4328 c: 718.662.6362 jus...@snooth.com Snooth -- Over 2 million ratings and counting...
Professional Nutch Support and Distribution
Wanted to gauge community interest in having a certified Nutch distribution with support? Similar to what Lucid Imagination is doing for Solr and Lucene and what Cloudera is providing for Hadoop. Anybody interested? Dennis
Re: Index Disaster Recovery
On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote: Eric, There are a couple of ways you can back up a Lucene index built by Solr: 1) have a look at the Solr replication scripts, specifically snapshooter. This script creates a snapshot of an index. It's typically triggered by Solr after its "commit" or "optimize" calls, when the index is "stable" and not being modified. If you use snapshooter to create index snapshots, you could simply grab a snapshot and there is your backup. 2) have a look at Solr's new replication mechanism (info on the Solr Wiki), which does something similar to the above, but without relying on replication (shell) scripts. It does everything via HTTP. In my 10 years of using Lucene and N years of using Solr and Nutch I've never had index corruption. Nowadays Lucene even has transactions, so it's much harder (theoretically impossible) to corrupt the index. Thank you for the information. I happened to read about snapshooter about 10 minutes after I sent that message, but didn't know about replication. It inspires confidence that you haven't experienced index corruption in your years of using this technology. Eric -- Eric J. Christeson Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University
wild card query in nutch
Hello people, I have used nutch-0.9 to crawl my application.. While searching , Its not giving results for query which is a part of the string .. For example the word "Message" is indexed , and the search query is "essa", Its not searching for the "message", and hence it will give No results .. So How do I make nutch to search in these cases, How can I enable wild card queries in nutch ? Should I write any plugin for that ??? Please reply ASAP -- View this message in context: http://www.nabble.com/wild-card-query-in-nutch-tp22564901p22564901.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Original tags, attribute defs, multiword tokens, how is this done.
On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote: Question Four ( is will start hunting for this ): Last one, promise.. The indexes themselves. Is there an explanation written up for each of the fields in the index. http://wiki.apache.org/nutch/IndexStructure is the closest thing I've found apart from reading the code. -- Eric J. Christeson Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part
Re: Original tags, attribute defs, multiword tokens, how is this done.
On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote: Question Four ( is will start hunting for this ): Last one, promise.. The indexes themselves. Is there an explanation written up for each of the fields in the index. http://wiki.apache.org/nutch/IndexStructure is the closest thing I've found apart from reading the code. Eric -- Eric J. Christeson Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University, Fargo, North Dakota, USA PGP.sig Description: This is a digitally signed message part
Re: Original tags, attribute defs, multiword tokens, how is this done.
pls see the inline comments!! On Tue, Mar 17, 2009 at 7:34 PM, Lukas, Ray wrote: > > I have some basic questions about Nutch. Can someone point me in the > right direction, or if you have time, maybe just blast out an answer. > > Question One: > I can see the terms that come from the web page. Can I set up a way to > also add these things to the index. In other words, if "ice cream" came > from a tag I want to know. Modifiy the index-plugin to include such a changes. You can add more fields in the plugin. Of course you need to modify HTML parser also so that it also keeps record of the headings in a document being parse. E.g you can includes the Field "Heading" in the index which contains the terms of a document which are there in headings While searching you can give more boost to the document if a query terms are found in the "Heading" field . For the you need to modify query formulation...for more see the documentation about lucene query formulation. > > > Question Two: > "Ice Cream" is really two words. But in the index it will be stored as > two entries. How can I tell Nutch (Lucene) that this and other things > are to be treated as one Token.. I know that somehow I will need to > supply a dictionary of these terms, but is it possible.. and if so how? > If you have Multi-word Extractor(MWE) or dictionary, before indexing a document you can invoke the MWE or look up in the dictionary , create a field "MWE" in the index, Give more boost if a query terms are found in MWE. In some sense Lucene/Nutch Ranking does handle it. For more details see the "coord" factor in lucene ranking. However, If you still wants to give more boost to the multi-world terms , you can do it by setting boost too hight in the lucene query ...again see lucene query formulation. > > Question Three ( is will start hunting for this ): > I have to hunt around for this so.. I have not yet.. but since I am > asking questions.. How can I add more stop words into the stop word > list? > You can look at the SMART system 's stop word list. Or you can generate using frequecy analysis on some document collections if you are looking for domain specific stop words. > > Question Four ( is will start hunting for this ): > Last one, promise.. The indexes themselves. Is there an explanation > written up for each of the fields in the index. > I m not sure but look at the nutch wiki .. you might get something. > > > Thanks for the help > Ray >
Original tags, attribute defs, multiword tokens, how is this done.
I have some basic questions about Nutch. Can someone point me in the right direction, or if you have time, maybe just blast out an answer. Question One: I can see the terms that come from the web page. Can I set up a way to also add these things to the index. In other words, if "ice cream" came from a tag I want to know. Question Two: "Ice Cream" is really two words. But in the index it will be stored as two entries. How can I tell Nutch (Lucene) that this and other things are to be treated as one Token.. I know that somehow I will need to supply a dictionary of these terms, but is it possible.. and if so how? Question Three ( is will start hunting for this ): I have to hunt around for this so.. I have not yet.. but since I am asking questions.. How can I add more stop words into the stop word list? Question Four ( is will start hunting for this ): Last one, promise.. The indexes themselves. Is there an explanation written up for each of the fields in the index. Thanks for the help Ray
Re: Fetcher2 Slow
Roger Dunk wrote: Andrzej stated in NUTCH-669 that "some people reported performance issues with Fetcher2, i.e. that it doesn't use the available bandwidth. These reports are unconfirmed, and they may have been caused by suboptimal URL / host distribution in a fetchlist - but it would be good to review the synchronization and threading aspects of Fetcher2." To address this, I've tried just now generating a fetchlist using generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee unique hosts, but the problem still remains. Therefore, I believe it's clearly not an issue of suboptimal URL / host distribution. If you require any further information to confirm my report, you need only ask! Thanks for reporting this. Yes, we need more information - it's best if you create a JIRA issue, because then it will be easier to send attachments. What we need at this moment is: * the fetchlist - just zip the crawl_generate and attach it. * nutch-site.xml and hadoop-site.xml (if you run in a distributed mode). * cmd-line parameters, specifically the number of threads and -noParsing * information about your environment (OS, cpu/mem, heapsize, JVM version). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch 0.7
Just check out the code from the svn branch, and build your self .., i think it's easy enough ... On Tue, Mar 17, 2009 at 5:21 PM, Mayank Kamthan wrote: > Hello ppl, > Please provide a pointer to 0.7 release.. I need it urgently.. > > Thanks n regards, > Mayank. > > On Mon, Mar 16, 2009 at 2:23 PM, Mayank Kamthan wrote: > >> Hi! >> >> I need nutch 0.7. Can someone please provide me a pointer to it to >> download. >> When I try via the Apache site it leads me to nutch 0.9. >> Please give a pointer for the 0.7 release. >> >> Regards, >> Mayank. >> -- >> Mayank Kamthan >> > > > > -- > Mayank Kamthan > -- --- OpenThink Labs www.tobethink.com Aligning IT and Education >> 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
Re: nutch 0.7
Hello ppl, Please provide a pointer to 0.7 release.. I need it urgently.. Thanks n regards, Mayank. On Mon, Mar 16, 2009 at 2:23 PM, Mayank Kamthan wrote: > Hi! > > I need nutch 0.7. Can someone please provide me a pointer to it to > download. > When I try via the Apache site it leads me to nutch 0.9. > Please give a pointer for the 0.7 release. > > Regards, > Mayank. > -- > Mayank Kamthan > -- Mayank Kamthan
Re: Fetcher2 Slow
Andrzej stated in NUTCH-669 that "some people reported performance issues with Fetcher2, i.e. that it doesn't use the available bandwidth. These reports are unconfirmed, and they may have been caused by suboptimal URL / host distribution in a fetchlist - but it would be good to review the synchronization and threading aspects of Fetcher2." To address this, I've tried just now generating a fetchlist using generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee unique hosts, but the problem still remains. Therefore, I believe it's clearly not an issue of suboptimal URL / host distribution. If you require any further information to confirm my report, you need only ask! Cheers... Roger -- From: "Roger Dunk" Sent: Tuesday, March 17, 2009 7:10 PM To: Subject: Re: Fetcher2 Slow Now that the soon to be released v1 uses Fetcher2 as default (or as the only fetcher available?), I would think that this slowness problem that is facing a number of users might be addressed? In short the case for me is like this: Nutch trunk revision 755143 JDK 1.6_12 on Linux Crawl list consists of ~40,000 URLs from dmoz, so naturally are well distributed among hosts (i.e. mostly unique hosts). Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 The result? Most of the time, something like this: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 If I'm lucky, it might fetch around 1 page per second (or less). What I have noticed is that if I let it run for a while, cancel the fetch, and start it again from the beginning, it runs very quickly for a while before it slows right down to a trickle again. My guess is that the hosts that have cached by my caching NS are fetched quickly, but new lookups are taking an age and slowing things down. However, I don't believe my NS is slow by any means. And furthermore, the old Fetcher1 never had this problem. Any ideas where to look to track this down? Thanks, Roger -- From: "Roger Dunk" Sent: Thursday, February 05, 2009 2:16 PM To: Subject: Re: Fetcher2 Slow It makes no difference if I set fetcher.threads.per.host to 1 or 100, which I assume is what you were suggesting? I also stated that the majority of pages to fetch were from unique hosts, so I believe the value of this parameter should not really come into play. Cheers... Roger -- From: "Laurent Laborde" Sent: Tuesday, February 03, 2009 5:51 PM To: Subject: Re: Fetcher2 Slow On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk wrote: Hi all, I'm having no luck whatsoever using Fetcher2, as even with 50 threads enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 hosts in the queue. I do however have some 50,000 pages to fetch, the majority of which are from unique hosts. The regular fetcher works as expected, fetching concurrently from 50 hosts. There is a configuration parameters limiting the concurent fetcher per unique host. -- F4FQM Kerunix Flan Laurent Laborde
Re: Fetcher2 Slow
Now that the soon to be released v1 uses Fetcher2 as default (or as the only fetcher available?), I would think that this slowness problem that is facing a number of users might be addressed? In short the case for me is like this: Nutch trunk revision 755143 JDK 1.6_12 on Linux Crawl list consists of ~40,000 URLs from dmoz, so naturally are well distributed among hosts (i.e. mostly unique hosts). Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 The result? Most of the time, something like this: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 If I'm lucky, it might fetch around 1 page per second (or less). What I have noticed is that if I let it run for a while, cancel the fetch, and start it again from the beginning, it runs very quickly for a while before it slows right down to a trickle again. My guess is that the hosts that have cached by my caching NS are fetched quickly, but new lookups are taking an age and slowing things down. However, I don't believe my NS is slow by any means. And furthermore, the old Fetcher1 never had this problem. Any ideas where to look to track this down? Thanks, Roger -- From: "Roger Dunk" Sent: Thursday, February 05, 2009 2:16 PM To: Subject: Re: Fetcher2 Slow It makes no difference if I set fetcher.threads.per.host to 1 or 100, which I assume is what you were suggesting? I also stated that the majority of pages to fetch were from unique hosts, so I believe the value of this parameter should not really come into play. Cheers... Roger -- From: "Laurent Laborde" Sent: Tuesday, February 03, 2009 5:51 PM To: Subject: Re: Fetcher2 Slow On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk wrote: Hi all, I'm having no luck whatsoever using Fetcher2, as even with 50 threads enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 hosts in the queue. I do however have some 50,000 pages to fetch, the majority of which are from unique hosts. The regular fetcher works as expected, fetching concurrently from 50 hosts. There is a configuration parameters limiting the concurent fetcher per unique host. -- F4FQM Kerunix Flan Laurent Laborde