remove fields

2009-11-26 Thread Fadzi Ushewokunze
hi all,

there are 4 document fields in my index that i am not indexing anymore;

then i have 4 new fields i need to add to my index, so i created a new
indexing filter.

how i can add these new fields while preserving the removed fields in
the existing docs?

at the moment when i run bin/index all non-indexed fields get removed
from the index;






Re: 100 fetches per second?

2009-11-26 Thread Otis Gospodnetic
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and 
what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus 
https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, 
in addition to what Andrzej described below.

Can you try https://issues.apache.org/jira/browse/NUTCH-770 and 
https://issues.apache.org/jira/browse/NUTCH-769 ?

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Andrzej Bialecki a...@getopt.org
 To: nutch-user@lucene.apache.org
 Sent: Wed, November 25, 2009 6:13:07 PM
 Subject: Re: 100 fetches per second?
 
 MilleBii wrote:
  I have to say that I'm still puzzled. Here is the latest. I just restarted a
  run and then guess what :
  
  got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
  3Mbit/s max before (nota bits and not bytes as I said before).
  A few samples show that I was running at 50 Fetches/sec ... not bad. But why
  this high-speed on this run I haven't got the faintest idea.
  
  
  Than it drops and I get that kind of logs
  
  2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
  2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=120
  2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
  2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=120
  2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
  
  Don't fully understand why it is oscillating between two queue size never
  mind but it is likely the end of the run since hadoop shows 99.99%
  percent complete for the 2 map it generated.
  
  Would that be explained by a better URL mix 
 
 I suspect that you have a bunch of hosts that slowly trickle the content, 
 i.e. 
 requests don't time out, crawl-delay is low, but the download speed is very 
 very 
 low due to the limits at their end (either physical or artificial).
 
 The solution in that case would be to track a minimum avg. speed per 
 FetchQueue, 
 and lock-out the queue if this number crosses the threshold (similarly to 
 what 
 we do when we discover a crawl-delay that is too high).
 
 In the meantime, you could add the number of FetchQueue-s to that diagnostic 
 output, to see how many unique hosts are in the current working set.
 
 -- Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Encoding the content got from Fetcher

2009-11-26 Thread Santiago Pérez

Hej,

I am a newbie in Nutch and I need some help with a problem because I do not
find clear documentation.

In crawling proccess when the each of the FetcherThread get the content,
this is in formatted in a way which deletes the new line characters (\n)
and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
encoding like: �¡, �³, �­, �³, �º, �±, �¼.

I would like to know if it is possible to set this default encoding (is
UTF-8?) to the one that I need (ASCII I guess).

Thanks in advance ;)
-- 
View this message in context: 
http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: 100 fetches per second?

2009-11-26 Thread MilleBii
Yep, I will try right after this run ends... Which is likely tomorrow
by the sound of it.

Still how come there is a factor 6+ difference from one run to the
next ... Timing hosts blocking the queue maybe, but the probability to
get one in the queue can not be so different from one run to run.

2009/11/26, Otis Gospodnetic ogjunk-nu...@yahoo.com:
 I think in the end what Ken Krugler did with Bixo (limiting crawl time) and
 what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus
 https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this
 problem, in addition to what Andrzej described below.

 Can you try https://issues.apache.org/jira/browse/NUTCH-770 and
 https://issues.apache.org/jira/browse/NUTCH-769 ?

 Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: Andrzej Bialecki a...@getopt.org
 To: nutch-user@lucene.apache.org
 Sent: Wed, November 25, 2009 6:13:07 PM
 Subject: Re: 100 fetches per second?

 MilleBii wrote:
  I have to say that I'm still puzzled. Here is the latest. I just
  restarted a
  run and then guess what :
 
  got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
  get
  3Mbit/s max before (nota bits and not bytes as I said before).
  A few samples show that I was running at 50 Fetches/sec ... not bad. But
  why
  this high-speed on this run I haven't got the faintest idea.
 
 
  Than it drops and I get that kind of logs
 
  2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
  2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=120
  2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
  2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=120
  2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
  spinWaiting=100, fetchQueues.totalSize=516
 
  Don't fully understand why it is oscillating between two queue size
  never
  mind but it is likely the end of the run since hadoop shows 99.99%
  percent complete for the 2 map it generated.
 
  Would that be explained by a better URL mix 

 I suspect that you have a bunch of hosts that slowly trickle the content,
 i.e.
 requests don't time out, crawl-delay is low, but the download speed is
 very very
 low due to the limits at their end (either physical or artificial).

 The solution in that case would be to track a minimum avg. speed per
 FetchQueue,
 and lock-out the queue if this number crosses the threshold (similarly to
 what
 we do when we discover a crawl-delay that is too high).

 In the meantime, you could add the number of FetchQueue-s to that
 diagnostic
 output, to see how many unique hosts are in the current working set.

 -- Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-


Broken segments ?

2009-11-26 Thread Mischa Tuffield
Hello All, 

I was wondering if there is any way to check the integrity of a segment? As it 
stands, I can't create the index I want due to a number of my segments freaking 
out like below : 

Is there anyway to check if my segments are OK, I guess i could always re:fetch 
them if need be.

Regards, and thanks in advance :)

Mischa


!--
java.io.IOException: Could not obtain block: blk_8431627671702898365_95075 
file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-0
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166)
at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)

...

java.io.IOException: Could not obtain block: blk_7970643458650610887_21674 
file=/user/nutch/crawl/segments/20090618111426/content/part-3/data
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
--


On 26 Nov 2009, at 12:03, Santiago Pérez wrote:

 
 Hej,
 
 I am a newbie in Nutch and I need some help with a problem because I do not
 find clear documentation.
 
 In crawling proccess when the each of the FetcherThread get the content,
 this is in formatted in a way which deletes the new line characters (\n)
 and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
 encoding like: �¡, �³, � , �³, �º, �±, �¼.
 
 I would like to know if it is possible to set this default encoding (is
 UTF-8?) to the one that I need (ASCII I guess).
 
 Thanks in advance ;)
 -- 
 View this message in context: 
 http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
 Sent from the Nutch - User mailing list archive at Nabble.com.
 

___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD



Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki

Mischa Tuffield wrote:

Hello All,


http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply 
to an existing message, instead start a fresh email.  Even if you change 
the subject line of your email, other mail headers still track which 
thread you replied to and your question is hidden in that thread and 
gets less attention.   It makes following discussions in the mailing 
list archives particularly difficult.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Encoding the content got from Fetcher

2009-11-26 Thread fadzi
hi

have you tried to change this property:

parser.character.encoding.default




 Hej,

 I am a newbie in Nutch and I need some help with a problem because I do
 not
 find clear documentation.

 In crawling proccess when the each of the FetcherThread get the content,
 this is in formatted in a way which deletes the new line characters (\n)
 and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the
 default
 encoding like: �¡, �³, �­, �³, �º, �±,
 �¼.

 I would like to know if it is possible to set this default encoding (is
 UTF-8?) to the one that I need (ASCII I guess).

 Thanks in advance ;)
 --
 View this message in context:
 http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
 Sent from the Nutch - User mailing list archive at Nabble.com.






add parse-wml plugin to Nutch!

2009-11-26 Thread yangfeng
hi,
  i have to add parse-wml plugin  to Nutch,  if it has been finished,pls
give me some advise.

   Tks!


Re: 100 fetches per second?

2009-11-26 Thread MilleBii
Interesting updates on the current run of 450K urls :
+ 30minutes @ 3Mbits/s
+ drop to 1Mbit/s (1/X shape)
+ gradual improvement to 1.5 Mbit/s and steady for 7 hours
+ sudden drop to 0.9 Mbits/s and steady for 4 hours
+ up to 1.7 Mbits for 1hour
+ staircasing down to 0.5 Mbit/s by steps of 1 hour

I don't know what to take as a conclusion, but it is quite strange to have
those sudden variation of bandwidth and overall very slow.
I can post the graph if people are interested.


2009/11/26 MilleBii mille...@gmail.com

 Yep, I will try right after this run ends... Which is likely tomorrow
 by the sound of it.

 Still how come there is a factor 6+ difference from one run to the
 next ... Timing hosts blocking the queue maybe, but the probability to
 get one in the queue can not be so different from one run to run.

 2009/11/26, Otis Gospodnetic ogjunk-nu...@yahoo.com:
  I think in the end what Ken Krugler did with Bixo (limiting crawl time)
 and
  what Julien added in https://issues.apache.org/jira/browse/NUTCH-770(plus
  https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this
  problem, in addition to what Andrzej described below.
 
  Can you try https://issues.apache.org/jira/browse/NUTCH-770 and
  https://issues.apache.org/jira/browse/NUTCH-769 ?
 
  Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: Andrzej Bialecki a...@getopt.org
  To: nutch-user@lucene.apache.org
  Sent: Wed, November 25, 2009 6:13:07 PM
  Subject: Re: 100 fetches per second?
 
  MilleBii wrote:
   I have to say that I'm still puzzled. Here is the latest. I just
   restarted a
   run and then guess what :
  
   got ultra-high speed : 8Mbits/s sustained for 1 hour where I could
 only
   get
   3Mbit/s max before (nota bits and not bytes as I said before).
   A few samples show that I was running at 50 Fetches/sec ... not bad.
 But
   why
   this high-speed on this run I haven't got the faintest idea.
  
  
   Than it drops and I get that kind of logs
  
   2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
   spinWaiting=100, fetchQueues.totalSize=516
   2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
   spinWaiting=100, fetchQueues.totalSize=120
   2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
   spinWaiting=100, fetchQueues.totalSize=516
   2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
   spinWaiting=100, fetchQueues.totalSize=120
   2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
   spinWaiting=100, fetchQueues.totalSize=516
  
   Don't fully understand why it is oscillating between two queue size
   never
   mind but it is likely the end of the run since hadoop shows 99.99%
   percent complete for the 2 map it generated.
  
   Would that be explained by a better URL mix 
 
  I suspect that you have a bunch of hosts that slowly trickle the
 content,
  i.e.
  requests don't time out, crawl-delay is low, but the download speed is
  very very
  low due to the limits at their end (either physical or artificial).
 
  The solution in that case would be to track a minimum avg. speed per
  FetchQueue,
  and lock-out the queue if this number crosses the threshold (similarly
 to
  what
  we do when we discover a crawl-delay that is too high).
 
  In the meantime, you could add the number of FetchQueue-s to that
  diagnostic
  output, to see how many unique hosts are in the current working set.
 
  -- Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 


 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch near future - strategic directions

2009-11-26 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:

Lots of good thoughts and ideas, easy to agree with.

Something for the ease of use category:
-allow running on top of plain vanilla hadoop


What does it mean plain vanilla here? Do you mean the current DB 
implementation? That's the idea, we should aim for an abstract layer 
that can accommodate both HBase and plain MapFile-s.


I was simply trying to say that we should not bundle Hadoop anymore with 
Nutch and instead just mention the specific version it should run on top 
of as a requirement. I am not totally sure anymore if this is a good idea...


I do not know details about the HBase branch. Would using HBase allow us 
easy migration from  one data model to another (without complex code we 
now have in our datums). How easy is HBase to manage/setup/configure?


I think Avro looks promising as a data storage technology: has some 
support for data model evolution, can be accessed natively from many 
programming languages, is relatively well performing... The downside at 
the moment is that it is not yet fully supported by hadoop mapred (I think).



-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to 
pull required dependencies for their specific crawler


+1, with slight preference towards ivy.


I was not clear here, I think I was referring to users of Nutch instead 
of Developers. And in that case the choise of a tool would be up to the 
user after the artifacts are in the repo.


Also, I think what I wanted to day is more about the model how would 
people that want to do some customization operate instead of a 
technology choice.


Creating new plugin:
-create your own build configuration (or use a template we provide)
-implement plugin code
-publish to m2 repository

Creating your custom crawler:
-create your own build configuration (or use a template we might 
provide), specify the dependencies you need (plugins basically, from 
apache or from anybody else as long as they are available through some 
repository)

-potentially write some custom code

We could also still provide a default Nutch crawler also, as a build 
configuration (basically just xml file + some config) if we wanted.


The new Hadoop maven artifacts also help with this vision since we could 
also access hadoop apis (and dependencies) through similar mechanism.



My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite 
heavy in nature and would require large changes. I am just thinking 
that would it still be better to take a fresh start instead of trying 
to do this incrementally on top of existing code base.


Well ... that's (almost) what Dogacan did with the HBase port. I agree 
that we should not feel too constrained by the existing code base, but 
it would be silly to throw everything away and start from scratch - we 
need to find a middle ground. The crawler-commons and Tika projects 
should help us to get rid of the ballast and significantly reduce the 
size of our code.


I am not aiming to throw everything away, just trying to relax the back 
compatibility burden and give innovation a chance.


In the history of Nutch this approach is not something new (remember 
map reduce?) and in my opinion it worked nicely then. Perhaps it is 
different this time since the changes we are discussing now have many 
abstract things hanging in the air, even fundamental ones.


Nutch 0.7 to 0.8 reused a lot of the existing code.


I am hoping that this time it will not be different.



Of course the rewrite approach means that it will take some time 
before we actually get into the point where we can start adding real 
substance (meaning new features etc).


So to summarize, I would go ahead and put together a branch nutch 
N.0 that would consist of (a.k.a my wish list, hope I am not being 
too aggressive here):


-runs on top of plain hadoop


See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits 
and is easy to use)
-basic http/https crawling functionality (with db abstraction or 
hbase directly and smart data structures that allow flexible and 
efficient usage of the data)

-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the 
hidden gems we might have, or some solutions for the interesting 
challenges.


I believe that's more or less where Dogacan's port is right now, except 
it's not merged with the OSGI port.


Are you sure OSGI is the way to go? I Know it has all these nice 
features and all but for some reason I feel that we could live with 
something simpler. From functional pow: just drop your jars info 
classpath and you're all set. So 2 changes here: 1. plugins are jars 2. 
no individual classloaders for plugins.


--
 Sami Siren