from:"Mathijs Homminga"

Re: crawling site without www

2012-08-07 Thread Mathijs Homminga

(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
 
 ParseSegment: starting at 2012-08-07 16:01:35
 ParseSegment: segment: crawl/crawldb/segments/20120807160035
 Exception in thread main java.io.IOException: Segment already parsed!
at
 org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
 CrawlDb update: starting at 2012-08-07 16:01:36
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: false
 CrawlDb update: URL filtering: false
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
 LinkDb: starting at 2012-08-07 16:01:37
 LinkDb: linkdb: crawl/crawldb/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment:
 file:/data/nutch/crawl/crawldb/segments/20120807160035
 LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
 LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
 
 
 But when seed.txt have www.test.com instead test.com second launch of
 crawler script found next segment for fetching.
 
 On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga 
 mathijs.hommi...@kalooga.com wrote:
 
 What do you mean exactly with it falls on fetch phase?
 Do  you get an error?
 Does test.com exist?
 Does it perhaps redirect to www.test.com?
 ...
 
 Mathijs
 
 
 On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com
 wrote:
 
 yes
 
 On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 http://   ?
 
 hth
 
 On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev 
 alexei.koro...@gmail.com
 wrote:
 Hello,
 
 I have small script
 
 $NUTCH_PATH inject crawl/crawldb seed.txt
 $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
 
 s1=`ls -d crawl/crawldb/segments/* | tail -1`
 $NUTCH_PATH fetch $s1
 $NUTCH_PATH parse $s1
 $NUTCH_PATH updatedb crawl/crawldb $s1
 
 In seed.txt I have just one site, for example test.com. When I start
 script it falls on fetch phase.
 If I change test.com on www.test.com it works fine. Seems the reason,
 that
 outgoing link on test.com all have www. prefix.
 What I need to change in nutch config for work with test.com?
 
 Thank you in advance. I hope my explanation is clear :)
 
 --
 Alexei A. Korolev
 
 
 
 --
 Lewis
 
 
 
 
 --
 Alexei A. Korolev
 
 
 
 
 -- 
 Alexei A. Korolev

Re: crawling site without www

2012-08-04 Thread Mathijs Homminga

What do you mean exactly with it falls on fetch phase?
Do  you get an error? 
Does test.com exist? 
Does it perhaps redirect to www.test.com?
...

Mathijs


On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote:

 yes
 
 On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 http://   ?
 
 hth
 
 On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com
 wrote:
 Hello,
 
 I have small script
 
 $NUTCH_PATH inject crawl/crawldb seed.txt
 $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
 
 s1=`ls -d crawl/crawldb/segments/* | tail -1`
 $NUTCH_PATH fetch $s1
 $NUTCH_PATH parse $s1
 $NUTCH_PATH updatedb crawl/crawldb $s1
 
 In seed.txt I have just one site, for example test.com. When I start
 script it falls on fetch phase.
 If I change test.com on www.test.com it works fine. Seems the reason,
 that
 outgoing link on test.com all have www. prefix.
 What I need to change in nutch config for work with test.com?
 
 Thank you in advance. I hope my explanation is clear :)
 
 --
 Alexei A. Korolev
 
 
 
 --
 Lewis
 
 
 
 
 -- 
 Alexei A. Korolev

Re: Is it posible to know how long it takes to download an amount of data with nutch.

2012-08-03 Thread Mathijs Homminga

What version of Nutch are you using?

On Aug 4, 2012, at 5:36 , isidro isidr...@gmail.com wrote:

 Hi,
 
 Where can I get the content size and the fetch times for each fetched file ?
 
 Isidro
 
 
 On Thu, Aug 2, 2012 at 11:49 PM, Mathijs Homminga-3 [via Lucene] 
 ml-node+s472066n3998947...@n3.nabble.com wrote:
 
 Hi,
 
 Unless you monitor the counters of a job while it's running: no.
 However, you could, in theory, replay the fetch/download by looking at the
 fetch times, sum the content size and see when it hits the 5G. But you have
 to write your own tool for that.
 
 Mathijs
 
 On 3 aug. 2012, at 03:13, isidro [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3998947i=0
 wrote:
 
 Hi,
 
 Is it posible to know how long it takes to download an amount of data
 with
 nutch.
 
 I want to know how long it took to download the first 5 G.
 
 Isidro
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936.html
 Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 --
 If you reply to this email, your message will be added to the discussion
 below:
 
 http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936p3998947.html
 To unsubscribe from Is it posible to know how long it takes to download
 an amount of data with nutch., click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3998936code=aXNpZHJvc2FAZ21haWwuY29tfDM5OTg5MzZ8MjAzMTY0OTM3MQ==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936p3999163.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Mathijs Homminga

Hi,

Looking at the code, it looks like your batchId is null. Not sure how that can 
happen (since the SolrIndexerJob does check arguments).
Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)? 
Please do so and post commandline / nutch config / logs.

Cheers,
Mathijs

On Jul 29, 2012, at 20:33 , X3C TECH t...@x3chaos.com wrote:

 Hi Lewis,
 Thanks for below, I just ran it on the new schema. Funny thing is in Solr's
 example/logs directory there are no files at all, so I'm wondering if Nutch
 is even hitting Solr. I ran a new crawl, and I'm now getting a Null Pointer
 at indexing point I assume (right after parse).
 This is the hadoop log dump (end of it, as the whole dump is large)
 
 2012-07-29 07:25:05,118 INFO  parse.ParserJob - Skipping
 http://wiki.apache.org/nutch/WordIndex; different batch id
 2012-07-29 07:25:06,692 WARN  mapred.FileOutputCommitter - Output path is
 null in cleanup
 2012-07-29 07:25:11,689 INFO  mapreduce.GoraRecordReader -
 gora.buffer.read.limit = 1
 2012-07-29 07:25:16,747 INFO  mapreduce.GoraRecordWriter -
 gora.buffer.write.limit = 1
 2012-07-29 07:25:16,754 INFO  crawl.FetchScheduleFactory - Using
 FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
 2012-07-29 07:25:16,755 INFO  crawl.AbstractFetchSchedule -
 defaultInterval=2592000
 2012-07-29 07:25:16,755 INFO  crawl.AbstractFetchSchedule -
 maxInterval=7776000
 2012-07-29 07:25:19,243 WARN  mapred.FileOutputCommitter - Output path is
 null in cleanup
 
 This is the exception
 Exception in thread main java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:432)
at java.util.Properties.setProperty(Properties.java:161)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
at
 org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
at
 org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:53)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
 Hbase seems to have recorded the parsed content
 
 Could it be the SolrJ version? I'm using the default one that I got with
 the 2.0 release
 
 
 On Sun, Jul 29, 2012 at 11:04 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Iggy,
 
 We usually start with asking what the log from your solr server is saying?
 
 Additionally, did you know there is another schema for Nutch which
 'should' work with Solr 4. Please see below
 
 http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema-solr4.xml
 
 Lewis
 
 On Sun, Jul 29, 2012 at 6:53 PM, X3C TECH t...@x3chaos.com wrote:
 Hello,
 Has anyone been successful in hooking up Nutch 2 with Solr4?
 I seem to have my config screwed up somehow. I've added the Nutch fields
 to
 Solr's example schema and changed the field type from text' to
 text_general
 However when I index, I get the message
 SolrIndexerJob:starting
 SolrIndexerJob:Done
 but nothing has been indexed. Hadoop log shows no errors, neither does
 Solr
 terminal window. I even tried installing Solr 3.6.1 and copying the
 schema
 file as is, with no luck, same issue. Does something need to be adjusted
 in
 Nutch config? I made no adjustment when I built it, so it's stock beyond
 adjustments to hook up Hbase listed in tutorial. Your help is highly
 appreciated, as I'm really boggled by this!!
 
 Iggy
 
 
 
 --
 Lewis

Re: Nutch output to Solr

2012-07-12 Thread Mathijs Homminga

Hi Jim,

I believe indexing is not part of the default crawl loop/process. You have to 
call the indexing job separately.

Mathijs Homminga

On Jul 12, 2012, at 17:36, Jim Chandler jamescchand...@gmail.com wrote:

 Would anyone know why when I'm doing my crawling I don't get any output
 from nutch to solr?
 
 I am not getting any errors.
 
 I have used the individual nutch commands to crawl.
 
 But when all is said and done I only segment and digest fields.
 
 Thanks,
 Jim

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

2012-07-11 Thread Mathijs Homminga

Hi Julian,

Just to share our experiences with using Nutch 2.0:

Indexing in Nutch actually has nothing to do with indexing itself. It just 
selects some fields from a WebPage, does some very minimal processing (both 
typically in the indexing filter plugins) and sends the result to a writer. 
What I notice is that we tend to develop IndexingFilter/IndexingWriter 
combinations for exporting/pushing data to different external systems (Solr, 
elasticsearch,...) because not only do these systems use different 
format/interface (handled by IndexingWriter) but also may support different use 
cases, and thus may require different fields (done in IndexingFilter).

Since indexing is the obvious use case here, I can understand the naming of 
this process, but again, the data can be pushed anywhere.

Currently, we need to call a different IndexingJob (which uses a different 
Writer) and change the NutchConfiguration (to include the right 
IndexingFilters) to push data to another sink. I would be great if Nutch could 
support different target systems with one configuration.

Mathijs





On Jul 11, 2012, at 13:34 , Julien Nioche wrote:

 Hi Lewis
 
 I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed
 about retrieving crawl results as JSON
 
 
 From my own pov it appears that Nutch 2.X is 'closer' to the model
 
 required for a multiple backends implementation although there is
 still quite a bit of work to do here.
 
 
 backend for crawl storage != target of the exporter/indexer
 
 
 What I am slightly confused
 about, which hasn't been mentioned on this particular issue is whether
 individual Gora modules would make up part of the stack or whether the
 abstraction would somehow be written @Nutch side... of course this
 then gets a bit more tricky when we begin thinking about current 1.X
 and how to progress with a suitable long term vision.
 
 
 this is definitely on the Nutch side and applies in the same way for 1.x
 and 2.x. Think about it as a pluggable indexer : regardless of what backend
 is used for storing the crawl table you might want to send some of the
 content (with possible transformations) to e.g.  SOLR, ElasticSearch, a
 text file, a database etc... At the moment we are limited to SOLR - which
 is OK as most people use Nutch for indexing / searching but the point is
 that we should have more flexibility. I have used the terms 'pluggable
 indexer' before as well as 'pluggable exporter' I suppose the difference is
 whether we take care of finding which URLs should be deleted (indexer) or
 just dump a snapshot of the content (exporter).
 
 See comments on https://issues.apache.org/jira/browse/NUTCH-1047
 
 
 On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
 I'd think that this would be more a case for the universal exporter
 (a.k.a
 multiple indexing backends) that we mentioned several times. The REST API
 is more a way of piloting a crawl remotely. It could certainly be twisted
 into doing all sorts of  things but I am not sure it would be very
 practical when dealing with very large data. Instead having a pluggable
 exporter would allow you to define what backend you want to send the data
 to and what transformations to do on the way (e.g. convert to JSON).
 Alternatively a good old custom map reduce job based is the way to go.
 
 HTH
 
 Jul
 
 On 10 July 2012 22:42, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 wrote:
 
 Hi,
 
 I am looking to create a dataset for use in an example scenario where
 I want to create all the products you would typically find in the
 online Amazon store e.g. loads of products with different categories,
 different prices, titles, availability, condition etc etc etc. One way
 I was thinking of doing this was using the above API written into
 Nutch 2.X to get the results as JSON these could then hopefully be
 loaded into my product table in my datastore and we could begin to
 build up the database of products.
 
 Having never used the REST API directly I wonder if anyone has any
 information on this and whether I can obtain some direction relating
 to producing my crawl results as JSON. I'm also going to look into
 Andrzej's patch in NUTCH-932 also so I'll try to update this thread
 once I make some progress with it.
 
 Thanks in advance for any sharing of experiences with this one.
 
 Best
 Lewis
 
 --
 Lewis
 
 
 
 
 --
 *
 *Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 --
 Lewis
 
 
 
 
 -- 
 *
 *Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Mathijs Homminga

Hi Jorge,

I can indeed reproduce your problem using your code.

After some debugging...
You have to add a contentType to your implementation in plugin.xml:

implementation id=ImageThumbnailParser 
class=...ImageThumbnailParserparameter name=contentType 
value=image/png//implementation

Good luck!
Send from my iphone,
Mathijs Homminga

On Jun 28, 2012, at 0:12, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu 
wrote:

 Of course Mathijs, thank you for the time and the replies, here goes my 
 parse-plugins.xml (as an attachment).
 
 Greetings!
 
 - Mensaje original -
 De: Mathijs Homminga mathijs.hommi...@kalooga.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 17:44:43
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hmmm looking at the ParserFactory code, there can actually be several causes 
 for a NullPointerException...
 Can you also send the parse-plugins.xml? 
 
 Mathijs Homminga
 
 On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:
 
 This is the content of my plugin.xml
 
 plugin
  id=image-thumbnail
  name=Image thumbnailer for Orion
  version=1.0.0
  provider-name=nutch.org
 
   runtime
 library name=image-thumbnail.jar
export name=*/
 /library
  /runtime
 
  requires
 import plugin=nutch-extensionpoints/
  /requires
 
  extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailParser
 name=Image thumbnailer parser
 point=org.apache.nutch.parse.Parser
 implementation id=ImageThumbnailParser
 
 class=org.apache.nutch.parse.thumbnail.ImageThumbnailParser/
  /extension
 
  extension 
 id=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter
 name=Image thumbnail indexing filter
 point=org.apache.nutch.indexer.IndexingFilter
 implementation id=ImageThumbnailIndexingFilter
 
 class=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter/
  /extension
 
 /plugin
 
 
 - Mensaje original -
 De: Mathijs Homminga mathijs.hommi...@kalooga.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 17:17:12
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 No need for Tika. Can you send your plugin.xml?
 
 Mathijs Homminga
 
 On Jun 27, 2012, at 23:07, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:
 
 Hi,
 
 I agree with you, and is a genius idea rely on Tika to parse the files, but 
 in this particular case when all I want to do is encode the content into 
 base64 should I wrote a custom parser to tika and rely on the parser-tika 
 plugin to do its magic?
 
 Jorge
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:55:12
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi,
 
 I think you are partly correct.
 
 The core Nutch code itself doesn't do any parsing as such. All parsing
 is relied upon by external parsing libraries.
 
 Basically we need to define a parser to do the parsing, using Tika as
 a wrapper for mimeType detection and subsequent parsing saves us a bit
 of overhead.
 
 Lewis
 
 On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? 
 I thought this was optional since I really don't parse the content 
 searching for nothing, I only get the content, transform it into an Image 
 object, resize it, and then I encode with base64 to store on the solr 
 backend.
 
 So I thought that all this processing could be done getParse method.
 
 Is my assumption correct or is mandatory to write my desired logic using 
 Tika?
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:33:01
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi Jorge,
 
 It doesn't look like your actually using Tika as a wrapper for your
 custom parser at all...
 
 You would be need to specify the correct Tika config by calling
 tikaConfig.getParser
 
 hth
 
 On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi all:
 
 I'm working on a custom parser plugin to generate thumbnails from images 
 fetched with nutch 1.4. I'm doing this because the humbnails will be 
 converted into a base64 encoded string and stored on a Solr backend.
 
 So I basically wrote a custom parser (to which I send all png images, for 
 example). I enable the plugin (image-thumbnail) in the nutch-site.xml, 
 set some custom properties to load the width and height of the thumbnail. 
 Also set the alias on the parse-plugins.xml and set the plugin to handle 
 the image/png files, also in this file.
 
 the plugin is being loaded, but every time I get a png image to parse I 
 get this:
 
 Error parsing: 
 http

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Mathijs Homminga

You can use: 

image/(bmp|gif|jpeg|png|tiff)

in your plugin.xml, this will cover all/most images.

On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote:

 Hi Julien!
 
 Thank you for your explanation I realize that Tika indeed does a mimetype 
 detection. I just was searching a way to ensure that in the plugin I'm 
 developing only do the processing with images, this is just as a fail safe in 
 case that some wrong configuration its done in conf/parse-plugins.xml. I'm 
 thinking now that perhaps I can read the image types allowed to fetch in the 
 nutch configuration and use this as a filter, you think this would be 
 possible?
 
 - Mensaje original -
 De: Julien Nioche lists.digitalpeb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Jueves, 28 de Junio 2012 12:37:29
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Guys
 
 Just to make sure there is no misunderstanding : the detection of the
 MimeType is done BEFORE the parsing step and it is what allows the parsing
 step to determine which parser to use. The mimetype detection uses Tika *and
 *  there is a universal parser which is parse-tika (a.k.a the Tika
 Wrapper). These are two different things and you don't need to use
 parse-tika and can rely on other plugins.
 
 Now what you can do is to write a Parser and associate it with the
 mime-types of your choice : see conf/parse-plugins.xml and how to override
 parse-tika for a given mimetype. Another approach is do implement a
 HtmlParseFilter which will be called by parse-tika (assuming it is
 activated) from where you can access the Content and store the base64 in
 the parse-metadata (which you can index with the plugin index-metadata)
 
 HTH
 
 Julien
 
 On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cuwrote:
 
 Hi,
 
 I agree with you, and is a genius idea rely on Tika to parse the files,
 but in this particular case when all I want to do is encode the content
 into base64 should I wrote a custom parser to tika and rely on the
 parser-tika plugin to do its magic?
 
 Jorge
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:55:12
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi,
 
 I think you are partly correct.
 
 The core Nutch code itself doesn't do any parsing as such. All parsing
 is relied upon by external parsing libraries.
 
 Basically we need to define a parser to do the parsing, using Tika as
 a wrapper for mimeType detection and subsequent parsing saves us a bit
 of overhead.
 
 Lewis
 
 On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around
 Tika? I thought this was optional since I really don't parse the content
 searching for nothing, I only get the content, transform it into an Image
 object, resize it, and then I encode with base64 to store on the solr
 backend.
 
 So I thought that all this processing could be done getParse method.
 
 Is my assumption correct or is mandatory to write my desired logic using
 Tika?
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:33:01
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi Jorge,
 
 It doesn't look like your actually using Tika as a wrapper for your
 custom parser at all...
 
 You would be need to specify the correct Tika config by calling
 tikaConfig.getParser
 
 hth
 
 On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi all:
 
 I'm working on a custom parser plugin to generate thumbnails from
 images fetched with nutch 1.4. I'm doing this because the humbnails will be
 converted into a base64 encoded string and stored on a Solr backend.
 
 So I basically wrote a custom parser (to which I send all png images,
 for example). I enable the plugin (image-thumbnail) in the nutch-site.xml,
 set some custom properties to load the width and height of the thumbnail.
 Also set the alias on the parse-plugins.xml and set the plugin to handle
 the image/png files, also in this file.
 
 the plugin is being loaded, but every time I get a png image to parse I
 get this:
 
 Error parsing:
 http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png:
 java.lang.NullPointerException
   at
 org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388)
   at
 org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397)
   at
 org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296)
   at
 org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262)
   at
 org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234)
   at

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Mathijs Homminga

Hmmm looking at the ParserFactory code, there can actually be several causes 
for a NullPointerException...
Can you also send the parse-plugins.xml? 

Mathijs Homminga

On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu 
wrote:

 This is the content of my plugin.xml
 
 plugin
   id=image-thumbnail
   name=Image thumbnailer for Orion
   version=1.0.0
   provider-name=nutch.org
 
runtime
  library name=image-thumbnail.jar
 export name=*/
  /library
   /runtime
 
   requires
  import plugin=nutch-extensionpoints/
   /requires
 
   extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailParser
  name=Image thumbnailer parser
  point=org.apache.nutch.parse.Parser
  implementation id=ImageThumbnailParser
  
 class=org.apache.nutch.parse.thumbnail.ImageThumbnailParser/
   /extension
 
   extension 
 id=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter
  name=Image thumbnail indexing filter
  point=org.apache.nutch.indexer.IndexingFilter
  implementation id=ImageThumbnailIndexingFilter
  
 class=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter/
   /extension
 
 /plugin
 
 
 - Mensaje original -
 De: Mathijs Homminga mathijs.hommi...@kalooga.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 17:17:12
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 No need for Tika. Can you send your plugin.xml?
 
 Mathijs Homminga
 
 On Jun 27, 2012, at 23:07, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:
 
 Hi,
 
 I agree with you, and is a genius idea rely on Tika to parse the files, but 
 in this particular case when all I want to do is encode the content into 
 base64 should I wrote a custom parser to tika and rely on the parser-tika 
 plugin to do its magic?
 
 Jorge
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:55:12
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi,
 
 I think you are partly correct.
 
 The core Nutch code itself doesn't do any parsing as such. All parsing
 is relied upon by external parsing libraries.
 
 Basically we need to define a parser to do the parsing, using Tika as
 a wrapper for mimeType detection and subsequent parsing saves us a bit
 of overhead.
 
 Lewis
 
 On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? I 
 thought this was optional since I really don't parse the content searching 
 for nothing, I only get the content, transform it into an Image object, 
 resize it, and then I encode with base64 to store on the solr backend.
 
 So I thought that all this processing could be done getParse method.
 
 Is my assumption correct or is mandatory to write my desired logic using 
 Tika?
 
 - Mensaje original -
 De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 27 de Junio 2012 16:33:01
 Asunto: Re: Problema with NullPointerException on custom Parser
 
 Hi Jorge,
 
 It doesn't look like your actually using Tika as a wrapper for your
 custom parser at all...
 
 You would be need to specify the correct Tika config by calling
 tikaConfig.getParser
 
 hth
 
 On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez
 jlbetanco...@uci.cu wrote:
 Hi all:
 
 I'm working on a custom parser plugin to generate thumbnails from images 
 fetched with nutch 1.4. I'm doing this because the humbnails will be 
 converted into a base64 encoded string and stored on a Solr backend.
 
 So I basically wrote a custom parser (to which I send all png images, for 
 example). I enable the plugin (image-thumbnail) in the nutch-site.xml, set 
 some custom properties to load the width and height of the thumbnail. Also 
 set the alias on the parse-plugins.xml and set the plugin to handle the 
 image/png files, also in this file.
 
 the plugin is being loaded, but every time I get a png image to parse I 
 get this:
 
 Error parsing: 
 http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png:
  java.lang.NullPointerException
   at org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388)
   at 
 org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397)
   at 
 org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296)
   at 
 org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262)
   at 
 org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234)
   at 
 org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119)
   at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
   at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86

Re: Linking documents with Nutch+solr

2012-04-03 Thread Mathijs Homminga

Hi Stany,

Do you have access to your forum's database? If so, there might be no need to 
scrape the posts/articles using a crawler like Nutch. You could use Solr as a 
stand alone indexing server which imports data from your database.
Solr supports MoreLikeThis queries.

Mathijs Homminga

On Apr 3, 2012, at 20:02, Stany Fargose stannyfarg...@gmail.com wrote:

 Hi All,
 
 We want to implement related articles feature in our forum site. I was
 wondering how would we implement Nutch+Solr combination?
 
 Basically is there a way to link articles (or documents for solr) together?
 
 I would love to hear your thoughts.
 
 Thanks!

Re: Fetching/Indexing process is taking a lot of time

2012-03-27 Thread Mathijs Homminga

Hi George,

Just to be sure: 
Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed 
within the 'fetch' step that this issue occurs? 
So, _after_ the Fetcher logs the message Fetcher: starting and _before_ the 
Fetcher logs the message Fetcher: done?

If so, it indeed looks like Hadoop is moving your temporary data at very low 
rates.

Mathijs



On 19 mrt. 2012, at 03:20, George wrote:

 You are rght I'm using Nutch 0.9
 Thank you for sugestion but i need help with this version. 
 Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
 disks.
 Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
 There is no hardware problem.
 
 May be I have not configured something or my fetching script doing that (I
 have not found such function in it) don't know.
 I just need to know why fetched data  is going to temporary directory and
 then is moved to the segment at wery low speed.
 
 My  hadoop-site.xml
 
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
 
 
 configuration
 property
  namehadoop.tmp.dir/name
value/home/crawl/hadoop-${user.name}/value
  descriptionHadoop temp directory/description
  /property
 /configuration
 
 Thanks
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Mathijs Homminga

This is great work!! Thanks Lewis!

I must say that when I read the tutorial it stroke me how much of the effort 
goes into getting Hadoop up and running.

It would be great if we could start with:
First, make sure you have a healthy Hadoop cluster running, see here for the 
Hadoop tutorial ;-)

About the section Deploy Nutch to Multiple Machines: this is not necessary 
right? The job jar should be self containing and ship with all the 
configuration files necessary. 
Nutch should be able to run on any vanilla Hadoop cluster.

Anyway, looking at the questions that arrive at nutch-user, this is really 
really helpful.

Cheers,
Mathijs




On Mar 19, 2012, at 16:19 , Lewis John Mcgibbney wrote:

 Hi Guys,
 
 The NutchHadoopTutorial [0] on  our wiki has finally been updated after
 quite some time. It's a rather long beast, but covers (hopefully)
 everything you require to get cracking with leveraging the lastest versions
 Nutch and Hadoop on a distributed platform and making best use of the great
 technologies.
 
 We would really appreciate feedback as there will undoubtedly be some
 errors or data missing.
 
 Thanks
 
 Lewis
 
 [0] http://wiki.apache.org/nutch/NutchHadoopTutorial
 
 -- 
 *Lewis*

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Mathijs Homminga


 About the section Deploy Nutch to Multiple Machines: this is not
 necessary right? The job jar should be self containing and ship with all
 the configuration files necessary. Nutch should be able to run on any
 vanilla Hadoop cluster.
 
 It does. All you need is a healthy cluster and a Hadoop environment (cluster 
 or local) that points to the jobtracker.

Exactly ;)
Lewis, any reason to keep this section in there?

Mathijs

Re: Fetching/Indexing process is taking a lot of time

2012-03-19 Thread Mathijs Homminga

Which version of Hadoop are you using?

In your script, I see that you have a section called Generate, Fetch,
Parse, Update (Step 2 of $steps) -
At which of these sub steps do you see your problem?

For example: (from the top of my head)
- The Fetch job has a mapper which does the fetching, and has a reducer which
copies the fetched data to de segment dir. Is it this step where you see the
problem?
- The Update job creates a new crawldb and then moves it to the final
destination.

Mathijs

On Mar 19, 2012, at 3:20 , George wrote:

You are rght I'm using Nutch 0.9
Thank you for sugestion but i need help with this version.
Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
disks.
Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
There is no hardware problem.

May be I have not configured something or my fetching script doing that (I
have not found such function in it) don't know.
I just need to know why fetched data is going to temporary directory and
then is moved to the segment at wery low speed.

My hadoop-site.xml

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration
property
namehadoop.tmp.dir/name
value/home/crawl/hadoop-${user.name}/value
descriptionHadoop temp directory/description
/property
/configuration

Thanks

--
View this message in context:
http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

2012-03-18 Thread Mathijs Homminga

Hmmm...
First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That 
version is almost 5 years old. I really suggest that you update to 1.4.
What if you manually move such amounts of data on your disks? Same low speed? 
(btw, do you really have raid 1 (mirroring) on 6 disks?)

Cheers,
Mathijs

On 17 mrt. 2012, at 20:59, George wrote:

 no
 
 for example if i run dept 3 
 
 it fetching  data to  hadoop temporary  directory then  moving data to new
 segment
 and do this cycles 3 times
 
 all data is fetched to dadoop-root (temporary hadoop directory)
 and then nutch is moving this data to the segment  dir in segment folder.
 and for example moving data is taking:
 
 first fetch is in about 3 gb moving in 0.30-2 hours
 second becomes 10-15 Gb and moving takes 10-12 hours
 third cycle become 20-25 Gb and moving takes 5-7 days may be more on more
 depts.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

2012-03-17 Thread Mathijs Homminga

Hi,

Your hardware looks okay.

Moving data from 30,000 urls takes a week at 500kb/s?
That would mean ~10Mb per url. Could that be right?

Anyway, can you tell us at what stage your crawl script is when this kicks in?

Mathijs


On 17 mrt. 2012, at 07:40, George wrote:

 Hello
 
 I.m using nutch 9.0 default installation single machine:
 2x2.5 quad core
 16 GB ram
 6 x 1TB sata raid 1
 Network 1 gbps.
 Not using any distributed file system.
 
 Of cource have it configured
 All headers
 Threads : 100
 
 Trying to crawl 3 url-s with generate per site -1
 
 fetching with :
 
 --Script---
 
 #!/bin/bash
 
 # runbot script to run the Nutch bot for crawling and re-crawling.
 # Usage: bin/runbot [safe]
 #If executed in 'safe' mode, it doesn't delete the temporary
 #directories generated during crawl. This might be helpful for
 #analysis and recovery in case a crawl fails.
 #
 # Author: Susam Pal
 
 # LOCAL VARIABLES
 cd /usr/local/nutch
 
export JAVA_HOME=/usr/local/java
export NUTCH_JAVA_HOME=/usr/local/java
 
export NUTCH_HEAPSIZE=2048
 
NUTCH_HOME=/usr/local/nutch
 
 #if [ -e $NUTCH_HOME/nutch.tmp ]
 #then
 #echo Index process found...
 #else
 #date  $NUTCH_HOME/nutch.tmp
 
 
 depth=1
 threads=100
 adddays=30
 topN=100 #Comment this statement if you don't want to set topN value
 
 # Arguments for rm and mv
 RMARGS=-rf
 MVARGS=-v
 
 # Parse arguments
 if [ $1 == safe ]
 then
  safe=yes
 fi
 
 if [ -z $NUTCH_HOME ]
 then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
 else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
 fi
 
 if [ -n $topN ]
 then
  topN=-topN $topN
 else
  topN=
 fi
 
 steps=8
 echo - Inject (Step 1 of $steps) -
 /bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls
 
 echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
 
 for ((i=0; i = depth ; i++))
 do
  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
  /bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb
 /home/crawl/segments $topN \
  -adddays $adddays
  if [ $? -ne 0 ]
  then
echo runbot: Stopping at depth $depth. No more URLs to fetch.
break
  fi
  segment=`ls -d /home/crawl/segments/* | tail -1`
 
  /bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
echo runbot: fetch $segment at depth `expr $i + 1` failed.
echo runbot: Deleting segment $segment.
rm $RMARGS $segment
continue
  fi
 
  /bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment
 done
 
 echo - Merge Segments (Step 3 of $steps) -
 #/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments
 /home/crawl/segments/*
 #if [ $safe != yes ]
 #then
 #  rm $RMARGS /home/crawl/segments
 #else
 #  rm $RMARGS /home/crawl/BACKUPsegments
 #  mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments
 #fi
 
 #mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments
 
 echo - Invert Links (Step 4 of $steps) -
 /bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb
 /home/crawl/segments/*
 
 echo - Index (Step 5 of $steps) -
 /bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes
 /home/crawl/crawldb /home/crawl/linkdb \
/home/crawl/segments/*
 
 echo - Dedup (Step 6 of $steps) -
 /bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes
 
 echo - Merge Indexes (Step 7 of $steps) -
 /bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex
 /home/crawl/NEWindexes
 
 echo - Loading New Index (Step 8 of $steps) -
 
 if [ $safe != yes ]
 then
  rm $RMARGS /home/crawl/NEWindexes
  rm $RMARGS /home/crawl/index
 else
  rm $RMARGS /home/crawl/BACKUPindexes
  rm $RMARGS /home/crawl/BACKUPindex
  mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes
  mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex
 fi
 
 mv $MVARGS /home/crawl/NEWindex /home/crawl/index
 
 #rm -f ${NUTCH_HOME}/nutch.tmp
 
/bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1
 
/bin/bash $NUTCH_HOME/bin/search.server stop
/bin/bash $NUTCH_HOME/bin/search.server start
 
 echo runbot: FINISHED: Crawl completed!
 echo 
 
 -Script-
 
 all data is fetched to hadoop temporary directory hadoop-root to the
 /home/crawl/hadoop-root
 and after this step data is moving from /home/ctawl/hadoop-root to
 /home/ctawl/segments/xxx
 and this step taking  a lot of time and depend on size it can take a week or
 more
 on this step data is moving with wery low speed 500 kb/ps (sorry i dont know
 what it is doing on this step, I'm just user and have no java programing or
 hadoop experiance)
 
 Is there any way to make this step faster?
 
 Thanks

Re: Blacklisted Tasktracker / AlreadyBeingCreatedException

2012-03-16 Thread Mathijs Homminga

Hi Rafael,

This sounds like a Hadoop DFS issue. Perhaps it's better to post your question 
to:
hdfs-u...@hadoop.apache.org

Mathijs 

On 16 mrt. 2012, at 14:46, Rafael Pappert wrote:

 Hello,
 
 I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to
 time i got an alert that 1 TaskTracker have been blacklisted.
 
 And the log of the reducer contains 3-6 Exceptions like this:
 
 org.apache.hadoop.ipc.RemoteException: 
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
 create file 
 /user/test/crawl/segments/20120316065507/parse_text/part-1/data for 
 DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, 
 because this file is already being created by 
 DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
   at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
 
   at org.apache.hadoop.ipc.Client.call(Client.java:1066)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
   at $Proxy2.create(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
   at $Proxy2.create(Unknown Source)
   at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
   at 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92)
   at 
 org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
   at 
 org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 
 I have no special Plugins, it's a default system. Any ideas?
 
 Thanks in advance,
 Rafael.

Re: Handling duplicate sub domains

2011-11-27 Thread Mathijs Homminga

Hi Markus,

What is your definition of duplicate (sub) domains?

By reading your examples, I think you are looking for domains (or host IP's) 
that are interchangeable. 
That is, domains that give identical response when combined with the same 
protocol, port, path and query (a url).

You could indeed use heuristics (like normalizing . to www.).

I guess that most of the time this happens when the domain has set a wildcard 
dns record (catch-all).
No guarantee however that wildcard domains act 'identical' of course. Although 
(sub) domains may point to the same canonical name or IP address, they still 
may give different responses because of domain/url based dispatching on that 
host (think virtual hosts in Apache) or application level logic.
I guess this is why you never can be 100% sure that the domains are 
duplicates...

Clues I can think of (none of them are hard guarantees):

- Your heuristics using common patterns.
- Do a DNS lookup of the domains... does it point to another domain or an IP 
address which is shared among other domains?
- Did we find duplicate URLs on different hosts?
- Quick: if there are a lot of identical urls (paths+query of 
substantial length) on different subdomains, then the domains might be 
identical. 
- You might want to include a content check in the above.
- Actively check a fingerprint of the main page of each subdomain (e.g. title + 
some headers) and group domains based on this.

I'm currently working on the Host table (in nutchgora) and like to include some 
of this in there too.

Mathijs 


On Nov 27, 2011, at 15:46 , Markus Jelsma wrote:

 Hi,
 
 How do you handle the issue of duplicate (sub) domains? We measure a 
 significant amount of duplicate pages across sub domains. Bad websites for 
 example do not force a single sub domain and accept anything. With regex 
 normalizers we can easily tackle a portion of the problem by normalizing www 
 derivatives such as ww. . or www.w.ww.www. to www. This still leaves a 
 huge amount of incorrect sub domains, leading to duplicates of _entire_ 
 websites.
 
 We've built analysis jobs to detect and list duplicate pages within sub 
 domains (but also works across domains) which we can then reduce with another 
 job to bad sub domains. Yet, one of each sub domain for a given domain must 
 be 
 kept but i've still to figure out which sub domain will prevail.
 
 Here's an example of one such site:
 113425:188example.org
 114314:186startpagina.example.org
 114334:186mobile.example.org
 114339:186massages.example.org
 114340:186massage.example.org
 114362:186http.www.example.org
 114446:185www.example.org
 115280:184m.example.org
 115316:184forum.example.org
 
 In this case it may be simple to select www as the sub domain we want to keep 
 but it is not always so trivial.
 
 Anyone to share some inspiring insights for edge cases that make up the bulk 
 of duplicates?
 
 Thanks,
 markus

Re: solr and nutch confusion...

2011-11-14 Thread Mathijs Homminga

Hi,

First of all, it may depend on the number of urls you are injecting (number of 
urls in ../data/jf).
If this is less than 1000, the first segment will be smaller and depending on 
the number of outlinks found, the second segment might also be.

It can also depend on the maximum number of urls per domain you're fetching 
(although I believe there is no restriction by default: generate.max.count)
If this is set to 100 and you have only one domain in your list, then you might 
end up with just 200 fetched urls.

It can also depend on the fetch result. If you select 1000 urls (topN = 1000) 
but only 77 of them were fetched successfully

It may also depend on removing duplicate urls.

Please take a look at your crawldb to check for more details using the 
CrawlDbReader tool.
And you might also want to look at the logs for clues.

Cheers,
Mathijs




On Nov 15, 2011, at 3:57 , codegigabyte wrote:

 I just started learning about nutch and solr and I am starting to get confuse 
 over some issue.
 
 I using cygwin on windows xp
 
 Basically I crawl with this command:
 
 sh nutch crawl urls -dir ../data/jf -topN 1000
 
 So basically this means that each segments will contain 1000 urls right?
 
 So i went to  the jf folder and see there are 2 folder under segments with 
 timestamp as name.
 
 So theorically I should have 2000 documents right? Or wrong?
 
 so I index it to solr with solrindex
 
 Using the catch-all query *:* return numFound to be 77.
 
 Some of the urls i supposed was crawled was not found in the results.?
 
 Anyone can point me in the right direction?

Re: crawling a subdomain

2011-11-07 Thread Mathijs Homminga

You could write your own simple parse plugin that generates abc.xyz.com/stuff 
as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) 
subsequent crawl cycles.

Mathijs Homminga

On Nov 7, 2011, at 7:15, Peyman Mohajerian mohaj...@gmail.com wrote:

 Thanks Sergey,
 I don't think I was clear on the issue, the subdomain I'm speaking of
 won't be found by the crawler, I have to somehow add it, so in my
 original input url of: http://www.xyz.com/stuff
 there is absolutely no way the crawler would know about 
 http://abc.xyz.com/stuff
 I have to somehow dynamically add the subdomain.
 I also don't have the option of actually adding
 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
 convolution I don't want to bore you with!!).
 
 Thanks,
 Peyman
 
 On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
 sergey.volko...@gmail.com wrote:
 Hi!
 
 I think you should use urlfilter-regex like http://\w\.xyz\.com/stuff.*;
 instead of urlfilter-domain and set db.ignore.external.links to false, this
 will work, but this is quite slow if you have many regex.
 
 You may also try to add xyz.com to domain-suffixes.xml, this may cause some
 side effects, i had never tested this, just looked in DomainURLFilter
 source, so it's probably not really good idea.
 
 Sergey Volkov
 
 On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
 
 Hi Guys,
 
 Let's say my input file is:
 http://www.xyz.com/stuff
 
 and I have thousands of these URLs in my input. How do I configure
 Nutch to also crawl this subdomain for each input:
 http://abc.xyz.com/stuff
 
 I don't want to just replace 'www' with 'abc' i want to crawl both.
 
 Thanks
 Peyman

Re: Funky duplicate url's, getting much worse!

2010-09-28 Thread Mathijs Homminga

Hi Marcus,

I remember Nutch had some troubles with honoring the page's BASE tag when 
resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so 
that's might not be it.

Mathijs


On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:

 Anyone? Where is a proper solution for this issue? As expected, the regex 
 won't catch all imaginable kinds of funky URL's that somehow ended up in the 
 CrawlDB. Before the weekend, i added another news site to the tests i conduct 
 and let it run continuously. Unfortunately, the generator now comes up with 
 all kinds of completely useless URL's, although they do exist but that's just 
 the web application ignoring most parts of the URL's.
 
  
 
 This is the URL that should be considered as proper URL:
 
 http://www.blikopnieuws.nl/nieuwsblok
 
  
 
 Here are two URL's that are completely useless:
 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
 
  
 
 It is very hard to use deduplication on these simply because the content is 
 actually changes too much as time progresses - the latest news block for 
 example. It is therefore a necessity to keep these URL's from ending up in 
 the CrawlDB and so not to waste disk space and update time of the CrawlDB and 
 and huge load of bandwidth - i'm in my current fetch probably going to waste 
 at least a few GB's.
 
  
 
 Looking at the HTML source, it looks like the parser cannot properly handle 
 relative URL's. It is, of course, quite ugly for a site to do this but the 
 parser must not fool itself and come up with URL's that really aren't there. 
 Combined with the issue i began the thread with i believe the following two 
 problems are present - the parser returns imaginary (false) URL's because of:
 
 1. relative href's;
 
 2. URL's in anchors (that is the XML element's body) next to the rhef 
 attribute.
 
  
 
 Please help in finding the source of the problem (Tika? Nutch?) and how to 
 proceed in having it fixed so other users won't waste bandwidth, disk space 
 and CPU cycles =)
 
  
 
  
 
  
 
 Oh, here's a snippet of the fetch job that's currently running, also, notice 
 the news item with the 119039 ID, it's the same as above although that 
 copy/paste was 15 minutes ago. Most item ID's you see below continue to 
 return in the current log output.
 
  
 
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
 fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
 -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
 fetching 
 http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
 fetching

Re: crawling site without www

Re: crawling site without www

Re: Is it posible to know how long it takes to download an amount of data with nutch.

Re: Nutch 2.0 Solr 4.0 Alpha

Re: Nutch output to Solr

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Re: Problema with NullPointerException on custom Parser

Re: Problema with NullPointerException on custom Parser

Re: Problema with NullPointerException on custom Parser

Re: Linking documents with Nutch+solr

Re: Fetching/Indexing process is taking a lot of time

Re: NutchHadoopTutorial Updated

Re: NutchHadoopTutorial Updated

Re: Fetching/Indexing process is taking a lot of time

Re: Fetching/Indexing process is taking a lot of time

Re: Fetching/Indexing process is taking a lot of time

Re: Blacklisted Tasktracker / AlreadyBeingCreatedException

Re: Handling duplicate sub domains

Re: solr and nutch confusion...

Re: crawling a subdomain

Re: Funky duplicate url's, getting much worse!

21 matches

Site Navigation

Mail list logo

Footer information