remove fields
hi all, there are 4 document fields in my index that i am not indexing anymore; then i have 4 new fields i need to add to my index, so i created a new indexing filter. how i can add these new fields while preserving the removed fields in the existing docs? at the moment when i run bin/index all non-indexed fields get removed from the index;
Re: 100 fetches per second?
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try https://issues.apache.org/jira/browse/NUTCH-770 and https://issues.apache.org/jira/browse/NUTCH-769 ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Andrzej Bialecki a...@getopt.org To: nutch-user@lucene.apache.org Sent: Wed, November 25, 2009 6:13:07 PM Subject: Re: 100 fetches per second? MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial). The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high). In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Encoding the content got from Fetcher
Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (\n) and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default encoding like: Ã?¡, Ã?³, Ã?Â, Ã?³, Ã?º, Ã?±, Ã?¼. I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: 100 fetches per second?
Yep, I will try right after this run ends... Which is likely tomorrow by the sound of it. Still how come there is a factor 6+ difference from one run to the next ... Timing hosts blocking the queue maybe, but the probability to get one in the queue can not be so different from one run to run. 2009/11/26, Otis Gospodnetic ogjunk-nu...@yahoo.com: I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try https://issues.apache.org/jira/browse/NUTCH-770 and https://issues.apache.org/jira/browse/NUTCH-769 ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Andrzej Bialecki a...@getopt.org To: nutch-user@lucene.apache.org Sent: Wed, November 25, 2009 6:13:07 PM Subject: Re: 100 fetches per second? MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial). The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high). In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii-
Broken segments ?
Hello All, I was wondering if there is any way to check the integrity of a segment? As it stands, I can't create the index I want due to a number of my segments freaking out like below : Is there anyway to check if my segments are OK, I guess i could always re:fetch them if need be. Regards, and thanks in advance :) Mischa !-- java.io.IOException: Could not obtain block: blk_8431627671702898365_95075 file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-0 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166) at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.Child.main(Child.java:158) ... java.io.IOException: Could not obtain block: blk_7970643458650610887_21674 file=/user/nutch/crawl/segments/20090618111426/content/part-3/data at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.Child.main(Child.java:158) -- On 26 Nov 2009, at 12:03, Santiago Pérez wrote: Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (\n) and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default encoding like: Ã?¡, Ã?³, Ã? , Ã?³, Ã?º, Ã?±, Ã?¼. I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html Sent from the Nutch - User mailing list archive at Nabble.com. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Re: Broken segments ?
Mischa Tuffield wrote: Hello All, http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Encoding the content got from Fetcher
hi have you tried to change this property: parser.character.encoding.default Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (\n) and transform useful characters in Spanish as á,é,Ã,ó,ú,ñ,ü in the default encoding like: Ã?á, Ã?ó, Ã?ÃÂ, Ã?ó, Ã?ú, Ã?ñ, Ã?ü. I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html Sent from the Nutch - User mailing list archive at Nabble.com.
add parse-wml plugin to Nutch!
hi, i have to add parse-wml plugin to Nutch, if it has been finished,pls give me some advise. Tks!
Re: 100 fetches per second?
Interesting updates on the current run of 450K urls : + 30minutes @ 3Mbits/s + drop to 1Mbit/s (1/X shape) + gradual improvement to 1.5 Mbit/s and steady for 7 hours + sudden drop to 0.9 Mbits/s and steady for 4 hours + up to 1.7 Mbits for 1hour + staircasing down to 0.5 Mbit/s by steps of 1 hour I don't know what to take as a conclusion, but it is quite strange to have those sudden variation of bandwidth and overall very slow. I can post the graph if people are interested. 2009/11/26 MilleBii mille...@gmail.com Yep, I will try right after this run ends... Which is likely tomorrow by the sound of it. Still how come there is a factor 6+ difference from one run to the next ... Timing hosts blocking the queue maybe, but the probability to get one in the queue can not be so different from one run to run. 2009/11/26, Otis Gospodnetic ogjunk-nu...@yahoo.com: I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770(plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try https://issues.apache.org/jira/browse/NUTCH-770 and https://issues.apache.org/jira/browse/NUTCH-769 ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Andrzej Bialecki a...@getopt.org To: nutch-user@lucene.apache.org Sent: Wed, November 25, 2009 6:13:07 PM Subject: Re: 100 fetches per second? MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial). The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high). In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii-
Re: Nutch near future - strategic directions
Andrzej Bialecki wrote: Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer that can accommodate both HBase and plain MapFile-s. I was simply trying to say that we should not bundle Hadoop anymore with Nutch and instead just mention the specific version it should run on top of as a requirement. I am not totally sure anymore if this is a good idea... I do not know details about the HBase branch. Would using HBase allow us easy migration from one data model to another (without complex code we now have in our datums). How easy is HBase to manage/setup/configure? I think Avro looks promising as a data storage technology: has some support for data model evolution, can be accessed natively from many programming languages, is relatively well performing... The downside at the moment is that it is not yet fully supported by hadoop mapred (I think). -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler +1, with slight preference towards ivy. I was not clear here, I think I was referring to users of Nutch instead of Developers. And in that case the choise of a tool would be up to the user after the artifacts are in the repo. Also, I think what I wanted to day is more about the model how would people that want to do some customization operate instead of a technology choice. Creating new plugin: -create your own build configuration (or use a template we provide) -implement plugin code -publish to m2 repository Creating your custom crawler: -create your own build configuration (or use a template we might provide), specify the dependencies you need (plugins basically, from apache or from anybody else as long as they are available through some repository) -potentially write some custom code We could also still provide a default Nutch crawler also, as a build configuration (basically just xml file + some config) if we wanted. The new Hadoop maven artifacts also help with this vision since we could also access hadoop apis (and dependencies) through similar mechanism. My biggest concern is in execution of this (or any other) plan. Some of the changes or improvements that have been proposed are quite heavy in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base. Well ... that's (almost) what Dogacan did with the HBase port. I agree that we should not feel too constrained by the existing code base, but it would be silly to throw everything away and start from scratch - we need to find a middle ground. The crawler-commons and Tika projects should help us to get rid of the ballast and significantly reduce the size of our code. I am not aiming to throw everything away, just trying to relax the back compatibility burden and give innovation a chance. In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones. Nutch 0.7 to 0.8 reused a lot of the existing code. I am hoping that this time it will not be different. Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc). So to summarize, I would go ahead and put together a branch nutch N.0 that would consist of (a.k.a my wish list, hope I am not being too aggressive here): -runs on top of plain hadoop See above - what do you mean by that? -use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with db abstraction or hbase directly and smart data structures that allow flexible and efficient usage of the data) -basic solr integration for indexing/search -basic parsing with tika After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges. I believe that's more or less where Dogacan's port is right now, except it's not merged with the OSGI port. Are you sure OSGI is the way to go? I Know it has all these nice features and all but for some reason I feel that we could live with something simpler. From functional pow: just drop your jars info classpath and you're all set. So 2 changes here: 1. plugins are jars 2. no individual classloaders for plugins. -- Sami Siren