Re: crawling site without www
(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) ParseSegment: starting at 2012-08-07 16:01:35 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Exception in thread main java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) CrawlDb update: starting at 2012-08-07 16:01:36 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:01:37 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: merging with existing linkdb: crawl/crawldb/linkdb LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 But when seed.txt have www.test.com instead test.com second launch of crawler script found next segment for fetching. On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga mathijs.hommi...@kalooga.com wrote: What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1` $NUTCH_PATH fetch $s1 $NUTCH_PATH parse $s1 $NUTCH_PATH updatedb crawl/crawldb $s1 In seed.txt I have just one site, for example test.com. When I start script it falls on fetch phase. If I change test.com on www.test.com it works fine. Seems the reason, that outgoing link on test.com all have www. prefix. What I need to change in nutch config for work with test.com? Thank you in advance. I hope my explanation is clear :) -- Alexei A. Korolev -- Lewis -- Alexei A. Korolev -- Alexei A. Korolev
Re: crawling site without www
What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1` $NUTCH_PATH fetch $s1 $NUTCH_PATH parse $s1 $NUTCH_PATH updatedb crawl/crawldb $s1 In seed.txt I have just one site, for example test.com. When I start script it falls on fetch phase. If I change test.com on www.test.com it works fine. Seems the reason, that outgoing link on test.com all have www. prefix. What I need to change in nutch config for work with test.com? Thank you in advance. I hope my explanation is clear :) -- Alexei A. Korolev -- Lewis -- Alexei A. Korolev
Re: Is it posible to know how long it takes to download an amount of data with nutch.
What version of Nutch are you using? On Aug 4, 2012, at 5:36 , isidro isidr...@gmail.com wrote: Hi, Where can I get the content size and the fetch times for each fetched file ? Isidro On Thu, Aug 2, 2012 at 11:49 PM, Mathijs Homminga-3 [via Lucene] ml-node+s472066n3998947...@n3.nabble.com wrote: Hi, Unless you monitor the counters of a job while it's running: no. However, you could, in theory, replay the fetch/download by looking at the fetch times, sum the content size and see when it hits the 5G. But you have to write your own tool for that. Mathijs On 3 aug. 2012, at 03:13, isidro [hidden email]http://user/SendEmail.jtp?type=nodenode=3998947i=0 wrote: Hi, Is it posible to know how long it takes to download an amount of data with nutch. I want to know how long it took to download the first 5 G. Isidro -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936.html Sent from the Nutch - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936p3998947.html To unsubscribe from Is it posible to know how long it takes to download an amount of data with nutch., click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3998936code=aXNpZHJvc2FAZ21haWwuY29tfDM5OTg5MzZ8MjAzMTY0OTM3MQ== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-posible-to-know-how-long-it-takes-to-download-an-amount-of-data-with-nutch-tp3998936p3999163.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch 2.0 Solr 4.0 Alpha
Hi, Looking at the code, it looks like your batchId is null. Not sure how that can happen (since the SolrIndexerJob does check arguments). Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)? Please do so and post commandline / nutch config / logs. Cheers, Mathijs On Jul 29, 2012, at 20:33 , X3C TECH t...@x3chaos.com wrote: Hi Lewis, Thanks for below, I just ran it on the new schema. Funny thing is in Solr's example/logs directory there are no files at all, so I'm wondering if Nutch is even hitting Solr. I ran a new crawl, and I'm now getting a Null Pointer at indexing point I assume (right after parse). This is the hadoop log dump (end of it, as the whole dump is large) 2012-07-29 07:25:05,118 INFO parse.ParserJob - Skipping http://wiki.apache.org/nutch/WordIndex; different batch id 2012-07-29 07:25:06,692 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2012-07-29 07:25:11,689 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 1 2012-07-29 07:25:16,747 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2012-07-29 07:25:16,754 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-07-29 07:25:16,755 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2012-07-29 07:25:16,755 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2012-07-29 07:25:19,243 WARN mapred.FileOutputCommitter - Output path is null in cleanup This is the exception Exception in thread main java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:432) at java.util.Properties.setProperty(Properties.java:161) at org.apache.hadoop.conf.Configuration.set(Configuration.java:438) at org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128) at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:53) at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) at org.apache.nutch.crawl.Crawler.run(Crawler.java:192) at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawler.main(Crawler.java:257) Hbase seems to have recorded the parsed content Could it be the SolrJ version? I'm using the default one that I got with the 2.0 release On Sun, Jul 29, 2012 at 11:04 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Iggy, We usually start with asking what the log from your solr server is saying? Additionally, did you know there is another schema for Nutch which 'should' work with Solr 4. Please see below http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema-solr4.xml Lewis On Sun, Jul 29, 2012 at 6:53 PM, X3C TECH t...@x3chaos.com wrote: Hello, Has anyone been successful in hooking up Nutch 2 with Solr4? I seem to have my config screwed up somehow. I've added the Nutch fields to Solr's example schema and changed the field type from text' to text_general However when I index, I get the message SolrIndexerJob:starting SolrIndexerJob:Done but nothing has been indexed. Hadoop log shows no errors, neither does Solr terminal window. I even tried installing Solr 3.6.1 and copying the schema file as is, with no luck, same issue. Does something need to be adjusted in Nutch config? I made no adjustment when I built it, so it's stock beyond adjustments to hook up Hbase listed in tutorial. Your help is highly appreciated, as I'm really boggled by this!! Iggy -- Lewis
Re: Nutch output to Solr
Hi Jim, I believe indexing is not part of the default crawl loop/process. You have to call the indexing job separately. Mathijs Homminga On Jul 12, 2012, at 17:36, Jim Chandler jamescchand...@gmail.com wrote: Would anyone know why when I'm doing my crawling I don't get any output from nutch to solr? I am not getting any errors. I have used the individual nutch commands to crawl. But when all is said and done I only segment and digest fields. Thanks, Jim
Re: Anyone using the 2.X REST API to retrieve crawl results as JSON
Hi Julian, Just to share our experiences with using Nutch 2.0: Indexing in Nutch actually has nothing to do with indexing itself. It just selects some fields from a WebPage, does some very minimal processing (both typically in the indexing filter plugins) and sends the result to a writer. What I notice is that we tend to develop IndexingFilter/IndexingWriter combinations for exporting/pushing data to different external systems (Solr, elasticsearch,...) because not only do these systems use different format/interface (handled by IndexingWriter) but also may support different use cases, and thus may require different fields (done in IndexingFilter). Since indexing is the obvious use case here, I can understand the naming of this process, but again, the data can be pushed anywhere. Currently, we need to call a different IndexingJob (which uses a different Writer) and change the NutchConfiguration (to include the right IndexingFilters) to push data to another sink. I would be great if Nutch could support different target systems with one configuration. Mathijs On Jul 11, 2012, at 13:34 , Julien Nioche wrote: Hi Lewis I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed about retrieving crawl results as JSON From my own pov it appears that Nutch 2.X is 'closer' to the model required for a multiple backends implementation although there is still quite a bit of work to do here. backend for crawl storage != target of the exporter/indexer What I am slightly confused about, which hasn't been mentioned on this particular issue is whether individual Gora modules would make up part of the stack or whether the abstraction would somehow be written @Nutch side... of course this then gets a bit more tricky when we begin thinking about current 1.X and how to progress with a suitable long term vision. this is definitely on the Nutch side and applies in the same way for 1.x and 2.x. Think about it as a pluggable indexer : regardless of what backend is used for storing the crawl table you might want to send some of the content (with possible transformations) to e.g. SOLR, ElasticSearch, a text file, a database etc... At the moment we are limited to SOLR - which is OK as most people use Nutch for indexing / searching but the point is that we should have more flexibility. I have used the terms 'pluggable indexer' before as well as 'pluggable exporter' I suppose the difference is whether we take care of finding which URLs should be deleted (indexer) or just dump a snapshot of the content (exporter). See comments on https://issues.apache.org/jira/browse/NUTCH-1047 On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: I'd think that this would be more a case for the universal exporter (a.k.a multiple indexing backends) that we mentioned several times. The REST API is more a way of piloting a crawl remotely. It could certainly be twisted into doing all sorts of things but I am not sure it would be very practical when dealing with very large data. Instead having a pluggable exporter would allow you to define what backend you want to send the data to and what transformations to do on the way (e.g. convert to JSON). Alternatively a good old custom map reduce job based is the way to go. HTH Jul On 10 July 2012 22:42, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, I am looking to create a dataset for use in an example scenario where I want to create all the products you would typically find in the online Amazon store e.g. loads of products with different categories, different prices, titles, availability, condition etc etc etc. One way I was thinking of doing this was using the above API written into Nutch 2.X to get the results as JSON these could then hopefully be loaded into my product table in my datastore and we could begin to build up the database of products. Having never used the REST API directly I wonder if anyone has any information on this and whether I can obtain some direction relating to producing my crawl results as JSON. I'm also going to look into Andrzej's patch in NUTCH-932 also so I'll try to update this thread once I make some progress with it. Thanks in advance for any sharing of experiences with this one. Best Lewis -- Lewis -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Problema with NullPointerException on custom Parser
Hi Jorge, I can indeed reproduce your problem using your code. After some debugging... You have to add a contentType to your implementation in plugin.xml: implementation id=ImageThumbnailParser class=...ImageThumbnailParserparameter name=contentType value=image/png//implementation Good luck! Send from my iphone, Mathijs Homminga On Jun 28, 2012, at 0:12, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Of course Mathijs, thank you for the time and the replies, here goes my parse-plugins.xml (as an attachment). Greetings! - Mensaje original - De: Mathijs Homminga mathijs.hommi...@kalooga.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 17:44:43 Asunto: Re: Problema with NullPointerException on custom Parser Hmmm looking at the ParserFactory code, there can actually be several causes for a NullPointerException... Can you also send the parse-plugins.xml? Mathijs Homminga On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: This is the content of my plugin.xml plugin id=image-thumbnail name=Image thumbnailer for Orion version=1.0.0 provider-name=nutch.org runtime library name=image-thumbnail.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailParser name=Image thumbnailer parser point=org.apache.nutch.parse.Parser implementation id=ImageThumbnailParser class=org.apache.nutch.parse.thumbnail.ImageThumbnailParser/ /extension extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter name=Image thumbnail indexing filter point=org.apache.nutch.indexer.IndexingFilter implementation id=ImageThumbnailIndexingFilter class=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter/ /extension /plugin - Mensaje original - De: Mathijs Homminga mathijs.hommi...@kalooga.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 17:17:12 Asunto: Re: Problema with NullPointerException on custom Parser No need for Tika. Can you send your plugin.xml? Mathijs Homminga On Jun 27, 2012, at 23:07, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi, I agree with you, and is a genius idea rely on Tika to parse the files, but in this particular case when all I want to do is encode the content into base64 should I wrote a custom parser to tika and rely on the parser-tika plugin to do its magic? Jorge - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:55:12 Asunto: Re: Problema with NullPointerException on custom Parser Hi, I think you are partly correct. The core Nutch code itself doesn't do any parsing as such. All parsing is relied upon by external parsing libraries. Basically we need to define a parser to do the parsing, using Tika as a wrapper for mimeType detection and subsequent parsing saves us a bit of overhead. Lewis On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? I thought this was optional since I really don't parse the content searching for nothing, I only get the content, transform it into an Image object, resize it, and then I encode with base64 to store on the solr backend. So I thought that all this processing could be done getParse method. Is my assumption correct or is mandatory to write my desired logic using Tika? - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:33:01 Asunto: Re: Problema with NullPointerException on custom Parser Hi Jorge, It doesn't look like your actually using Tika as a wrapper for your custom parser at all... You would be need to specify the correct Tika config by calling tikaConfig.getParser hth On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi all: I'm working on a custom parser plugin to generate thumbnails from images fetched with nutch 1.4. I'm doing this because the humbnails will be converted into a base64 encoded string and stored on a Solr backend. So I basically wrote a custom parser (to which I send all png images, for example). I enable the plugin (image-thumbnail) in the nutch-site.xml, set some custom properties to load the width and height of the thumbnail. Also set the alias on the parse-plugins.xml and set the plugin to handle the image/png files, also in this file. the plugin is being loaded, but every time I get a png image to parse I get this: Error parsing: http
Re: Problema with NullPointerException on custom Parser
You can use: image/(bmp|gif|jpeg|png|tiff) in your plugin.xml, this will cover all/most images. On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote: Hi Julien! Thank you for your explanation I realize that Tika indeed does a mimetype detection. I just was searching a way to ensure that in the plugin I'm developing only do the processing with images, this is just as a fail safe in case that some wrong configuration its done in conf/parse-plugins.xml. I'm thinking now that perhaps I can read the image types allowed to fetch in the nutch configuration and use this as a filter, you think this would be possible? - Mensaje original - De: Julien Nioche lists.digitalpeb...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 28 de Junio 2012 12:37:29 Asunto: Re: Problema with NullPointerException on custom Parser Guys Just to make sure there is no misunderstanding : the detection of the MimeType is done BEFORE the parsing step and it is what allows the parsing step to determine which parser to use. The mimetype detection uses Tika *and * there is a universal parser which is parse-tika (a.k.a the Tika Wrapper). These are two different things and you don't need to use parse-tika and can rely on other plugins. Now what you can do is to write a Parser and associate it with the mime-types of your choice : see conf/parse-plugins.xml and how to override parse-tika for a given mimetype. Another approach is do implement a HtmlParseFilter which will be called by parse-tika (assuming it is activated) from where you can access the Content and store the base64 in the parse-metadata (which you can index with the plugin index-metadata) HTH Julien On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cuwrote: Hi, I agree with you, and is a genius idea rely on Tika to parse the files, but in this particular case when all I want to do is encode the content into base64 should I wrote a custom parser to tika and rely on the parser-tika plugin to do its magic? Jorge - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:55:12 Asunto: Re: Problema with NullPointerException on custom Parser Hi, I think you are partly correct. The core Nutch code itself doesn't do any parsing as such. All parsing is relied upon by external parsing libraries. Basically we need to define a parser to do the parsing, using Tika as a wrapper for mimeType detection and subsequent parsing saves us a bit of overhead. Lewis On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? I thought this was optional since I really don't parse the content searching for nothing, I only get the content, transform it into an Image object, resize it, and then I encode with base64 to store on the solr backend. So I thought that all this processing could be done getParse method. Is my assumption correct or is mandatory to write my desired logic using Tika? - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:33:01 Asunto: Re: Problema with NullPointerException on custom Parser Hi Jorge, It doesn't look like your actually using Tika as a wrapper for your custom parser at all... You would be need to specify the correct Tika config by calling tikaConfig.getParser hth On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi all: I'm working on a custom parser plugin to generate thumbnails from images fetched with nutch 1.4. I'm doing this because the humbnails will be converted into a base64 encoded string and stored on a Solr backend. So I basically wrote a custom parser (to which I send all png images, for example). I enable the plugin (image-thumbnail) in the nutch-site.xml, set some custom properties to load the width and height of the thumbnail. Also set the alias on the parse-plugins.xml and set the plugin to handle the image/png files, also in this file. the plugin is being loaded, but every time I get a png image to parse I get this: Error parsing: http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png: java.lang.NullPointerException at org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388) at org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397) at org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296) at org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262) at org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234) at
Re: Problema with NullPointerException on custom Parser
Hmmm looking at the ParserFactory code, there can actually be several causes for a NullPointerException... Can you also send the parse-plugins.xml? Mathijs Homminga On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: This is the content of my plugin.xml plugin id=image-thumbnail name=Image thumbnailer for Orion version=1.0.0 provider-name=nutch.org runtime library name=image-thumbnail.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailParser name=Image thumbnailer parser point=org.apache.nutch.parse.Parser implementation id=ImageThumbnailParser class=org.apache.nutch.parse.thumbnail.ImageThumbnailParser/ /extension extension id=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter name=Image thumbnail indexing filter point=org.apache.nutch.indexer.IndexingFilter implementation id=ImageThumbnailIndexingFilter class=org.apache.nutch.parse.thumbnail.ImageThumbnailIndexingFilter/ /extension /plugin - Mensaje original - De: Mathijs Homminga mathijs.hommi...@kalooga.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 17:17:12 Asunto: Re: Problema with NullPointerException on custom Parser No need for Tika. Can you send your plugin.xml? Mathijs Homminga On Jun 27, 2012, at 23:07, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi, I agree with you, and is a genius idea rely on Tika to parse the files, but in this particular case when all I want to do is encode the content into base64 should I wrote a custom parser to tika and rely on the parser-tika plugin to do its magic? Jorge - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:55:12 Asunto: Re: Problema with NullPointerException on custom Parser Hi, I think you are partly correct. The core Nutch code itself doesn't do any parsing as such. All parsing is relied upon by external parsing libraries. Basically we need to define a parser to do the parsing, using Tika as a wrapper for mimeType detection and subsequent parsing saves us a bit of overhead. Lewis On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi Lewis, thank you for the reply. Is mandatory wrote a wrap around Tika? I thought this was optional since I really don't parse the content searching for nothing, I only get the content, transform it into an Image object, resize it, and then I encode with base64 to store on the solr backend. So I thought that all this processing could be done getParse method. Is my assumption correct or is mandatory to write my desired logic using Tika? - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Miércoles, 27 de Junio 2012 16:33:01 Asunto: Re: Problema with NullPointerException on custom Parser Hi Jorge, It doesn't look like your actually using Tika as a wrapper for your custom parser at all... You would be need to specify the correct Tika config by calling tikaConfig.getParser hth On Wed, Jun 27, 2012 at 7:46 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi all: I'm working on a custom parser plugin to generate thumbnails from images fetched with nutch 1.4. I'm doing this because the humbnails will be converted into a base64 encoded string and stored on a Solr backend. So I basically wrote a custom parser (to which I send all png images, for example). I enable the plugin (image-thumbnail) in the nutch-site.xml, set some custom properties to load the width and height of the thumbnail. Also set the alias on the parse-plugins.xml and set the plugin to handle the image/png files, also in this file. the plugin is being loaded, but every time I get a png image to parse I get this: Error parsing: http://localhost/sites/all/themes/octavitos/images/iconos/audiointernet.png: java.lang.NullPointerException at org.apache.nutch.parse.ParserFactory.match(ParserFactory.java:388) at org.apache.nutch.parse.ParserFactory.getExtension(ParserFactory.java:397) at org.apache.nutch.parse.ParserFactory.matchExtensions(ParserFactory.java:296) at org.apache.nutch.parse.ParserFactory.findExtensions(ParserFactory.java:262) at org.apache.nutch.parse.ParserFactory.getExtensions(ParserFactory.java:234) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:119) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:86
Re: Linking documents with Nutch+solr
Hi Stany, Do you have access to your forum's database? If so, there might be no need to scrape the posts/articles using a crawler like Nutch. You could use Solr as a stand alone indexing server which imports data from your database. Solr supports MoreLikeThis queries. Mathijs Homminga On Apr 3, 2012, at 20:02, Stany Fargose stannyfarg...@gmail.com wrote: Hi All, We want to implement related articles feature in our forum site. I was wondering how would we implement Nutch+Solr combination? Basically is there a way to link articles (or documents for solr) together? I would love to hear your thoughts. Thanks!
Re: Fetching/Indexing process is taking a lot of time
Hi George, Just to be sure: Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed within the 'fetch' step that this issue occurs? So, _after_ the Fetcher logs the message Fetcher: starting and _before_ the Fetcher logs the message Fetcher: done? If so, it indeed looks like Hadoop is moving your temporary data at very low rates. Mathijs On 19 mrt. 2012, at 03:20, George wrote: You are rght I'm using Nutch 0.9 Thank you for sugestion but i need help with this version. Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200 disks. Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps. There is no hardware problem. May be I have not configured something or my fetching script doing that (I have not found such function in it) don't know. I just need to know why fetched data is going to temporary directory and then is moved to the segment at wery low speed. My hadoop-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehadoop.tmp.dir/name value/home/crawl/hadoop-${user.name}/value descriptionHadoop temp directory/description /property /configuration Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: NutchHadoopTutorial Updated
This is great work!! Thanks Lewis! I must say that when I read the tutorial it stroke me how much of the effort goes into getting Hadoop up and running. It would be great if we could start with: First, make sure you have a healthy Hadoop cluster running, see here for the Hadoop tutorial ;-) About the section Deploy Nutch to Multiple Machines: this is not necessary right? The job jar should be self containing and ship with all the configuration files necessary. Nutch should be able to run on any vanilla Hadoop cluster. Anyway, looking at the questions that arrive at nutch-user, this is really really helpful. Cheers, Mathijs On Mar 19, 2012, at 16:19 , Lewis John Mcgibbney wrote: Hi Guys, The NutchHadoopTutorial [0] on our wiki has finally been updated after quite some time. It's a rather long beast, but covers (hopefully) everything you require to get cracking with leveraging the lastest versions Nutch and Hadoop on a distributed platform and making best use of the great technologies. We would really appreciate feedback as there will undoubtedly be some errors or data missing. Thanks Lewis [0] http://wiki.apache.org/nutch/NutchHadoopTutorial -- *Lewis*
Re: NutchHadoopTutorial Updated
About the section Deploy Nutch to Multiple Machines: this is not necessary right? The job jar should be self containing and ship with all the configuration files necessary. Nutch should be able to run on any vanilla Hadoop cluster. It does. All you need is a healthy cluster and a Hadoop environment (cluster or local) that points to the jobtracker. Exactly ;) Lewis, any reason to keep this section in there? Mathijs
Re: Fetching/Indexing process is taking a lot of time
Which version of Hadoop are you using? In your script, I see that you have a section called Generate, Fetch, Parse, Update (Step 2 of $steps) - At which of these sub steps do you see your problem? For example: (from the top of my head) - The Fetch job has a mapper which does the fetching, and has a reducer which copies the fetched data to de segment dir. Is it this step where you see the problem? - The Update job creates a new crawldb and then moves it to the final destination. Mathijs On Mar 19, 2012, at 3:20 , George wrote: You are rght I'm using Nutch 0.9 Thank you for sugestion but i need help with this version. Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200 disks. Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps. There is no hardware problem. May be I have not configured something or my fetching script doing that (I have not found such function in it) don't know. I just need to know why fetched data is going to temporary directory and then is moved to the segment at wery low speed. My hadoop-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehadoop.tmp.dir/name value/home/crawl/hadoop-${user.name}/value descriptionHadoop temp directory/description /property /configuration Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Hmmm... First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That version is almost 5 years old. I really suggest that you update to 1.4. What if you manually move such amounts of data on your disks? Same low speed? (btw, do you really have raid 1 (mirroring) on 6 disks?) Cheers, Mathijs On 17 mrt. 2012, at 20:59, George wrote: no for example if i run dept 3 it fetching data to hadoop temporary directory then moving data to new segment and do this cycles 3 times all data is fetched to dadoop-root (temporary hadoop directory) and then nutch is moving this data to the segment dir in segment folder. and for example moving data is taking: first fetch is in about 3 gb moving in 0.30-2 hours second becomes 10-15 Gb and moving takes 10-12 hours third cycle become 20-25 Gb and moving takes 5-7 days may be more on more depts. -- View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Hi, Your hardware looks okay. Moving data from 30,000 urls takes a week at 500kb/s? That would mean ~10Mb per url. Could that be right? Anyway, can you tell us at what stage your crawl script is when this kicks in? Mathijs On 17 mrt. 2012, at 07:40, George wrote: Hello I.m using nutch 9.0 default installation single machine: 2x2.5 quad core 16 GB ram 6 x 1TB sata raid 1 Network 1 gbps. Not using any distributed file system. Of cource have it configured All headers Threads : 100 Trying to crawl 3 url-s with generate per site -1 fetching with : --Script--- #!/bin/bash # runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] #If executed in 'safe' mode, it doesn't delete the temporary #directories generated during crawl. This might be helpful for #analysis and recovery in case a crawl fails. # # Author: Susam Pal # LOCAL VARIABLES cd /usr/local/nutch export JAVA_HOME=/usr/local/java export NUTCH_JAVA_HOME=/usr/local/java export NUTCH_HEAPSIZE=2048 NUTCH_HOME=/usr/local/nutch #if [ -e $NUTCH_HOME/nutch.tmp ] #then #echo Index process found... #else #date $NUTCH_HOME/nutch.tmp depth=1 threads=100 adddays=30 topN=100 #Comment this statement if you don't want to set topN value # Arguments for rm and mv RMARGS=-rf MVARGS=-v # Parse arguments if [ $1 == safe ] then safe=yes fi if [ -z $NUTCH_HOME ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -n $topN ] then topN=-topN $topN else topN= fi steps=8 echo - Inject (Step 1 of $steps) - /bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls echo - Generate, Fetch, Parse, Update (Step 2 of $steps) - for ((i=0; i = depth ; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- /bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb /home/crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d /home/crawl/segments/* | tail -1` /bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo runbot: fetch $segment at depth `expr $i + 1` failed. echo runbot: Deleting segment $segment. rm $RMARGS $segment continue fi /bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment done echo - Merge Segments (Step 3 of $steps) - #/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments /home/crawl/segments/* #if [ $safe != yes ] #then # rm $RMARGS /home/crawl/segments #else # rm $RMARGS /home/crawl/BACKUPsegments # mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments #fi #mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments echo - Invert Links (Step 4 of $steps) - /bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb /home/crawl/segments/* echo - Index (Step 5 of $steps) - /bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes /home/crawl/crawldb /home/crawl/linkdb \ /home/crawl/segments/* echo - Dedup (Step 6 of $steps) - /bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes echo - Merge Indexes (Step 7 of $steps) - /bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex /home/crawl/NEWindexes echo - Loading New Index (Step 8 of $steps) - if [ $safe != yes ] then rm $RMARGS /home/crawl/NEWindexes rm $RMARGS /home/crawl/index else rm $RMARGS /home/crawl/BACKUPindexes rm $RMARGS /home/crawl/BACKUPindex mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex fi mv $MVARGS /home/crawl/NEWindex /home/crawl/index #rm -f ${NUTCH_HOME}/nutch.tmp /bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1 /bin/bash $NUTCH_HOME/bin/search.server stop /bin/bash $NUTCH_HOME/bin/search.server start echo runbot: FINISHED: Crawl completed! echo -Script- all data is fetched to hadoop temporary directory hadoop-root to the /home/crawl/hadoop-root and after this step data is moving from /home/ctawl/hadoop-root to /home/ctawl/segments/xxx and this step taking a lot of time and depend on size it can take a week or more on this step data is moving with wery low speed 500 kb/ps (sorry i dont know what it is doing on this step, I'm just user and have no java programing or hadoop experiance) Is there any way to make this step faster? Thanks
Re: Blacklisted Tasktracker / AlreadyBeingCreatedException
Hi Rafael, This sounds like a Hadoop DFS issue. Perhaps it's better to post your question to: hdfs-u...@hadoop.apache.org Mathijs On 16 mrt. 2012, at 14:46, Rafael Pappert wrote: Hello, I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to time i got an alert that 1 TaskTracker have been blacklisted. And the log of the reducer contains 3-6 Exceptions like this: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-1/data for DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy2.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy2.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92) at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) I have no special Plugins, it's a default system. Any ideas? Thanks in advance, Rafael.
Re: Handling duplicate sub domains
Hi Markus, What is your definition of duplicate (sub) domains? By reading your examples, I think you are looking for domains (or host IP's) that are interchangeable. That is, domains that give identical response when combined with the same protocol, port, path and query (a url). You could indeed use heuristics (like normalizing . to www.). I guess that most of the time this happens when the domain has set a wildcard dns record (catch-all). No guarantee however that wildcard domains act 'identical' of course. Although (sub) domains may point to the same canonical name or IP address, they still may give different responses because of domain/url based dispatching on that host (think virtual hosts in Apache) or application level logic. I guess this is why you never can be 100% sure that the domains are duplicates... Clues I can think of (none of them are hard guarantees): - Your heuristics using common patterns. - Do a DNS lookup of the domains... does it point to another domain or an IP address which is shared among other domains? - Did we find duplicate URLs on different hosts? - Quick: if there are a lot of identical urls (paths+query of substantial length) on different subdomains, then the domains might be identical. - You might want to include a content check in the above. - Actively check a fingerprint of the main page of each subdomain (e.g. title + some headers) and group domains based on this. I'm currently working on the Host table (in nutchgora) and like to include some of this in there too. Mathijs On Nov 27, 2011, at 15:46 , Markus Jelsma wrote: Hi, How do you handle the issue of duplicate (sub) domains? We measure a significant amount of duplicate pages across sub domains. Bad websites for example do not force a single sub domain and accept anything. With regex normalizers we can easily tackle a portion of the problem by normalizing www derivatives such as ww. . or www.w.ww.www. to www. This still leaves a huge amount of incorrect sub domains, leading to duplicates of _entire_ websites. We've built analysis jobs to detect and list duplicate pages within sub domains (but also works across domains) which we can then reduce with another job to bad sub domains. Yet, one of each sub domain for a given domain must be kept but i've still to figure out which sub domain will prevail. Here's an example of one such site: 113425:188example.org 114314:186startpagina.example.org 114334:186mobile.example.org 114339:186massages.example.org 114340:186massage.example.org 114362:186http.www.example.org 114446:185www.example.org 115280:184m.example.org 115316:184forum.example.org In this case it may be simple to select www as the sub domain we want to keep but it is not always so trivial. Anyone to share some inspiring insights for edge cases that make up the bulk of duplicates? Thanks, markus
Re: solr and nutch confusion...
Hi, First of all, it may depend on the number of urls you are injecting (number of urls in ../data/jf). If this is less than 1000, the first segment will be smaller and depending on the number of outlinks found, the second segment might also be. It can also depend on the maximum number of urls per domain you're fetching (although I believe there is no restriction by default: generate.max.count) If this is set to 100 and you have only one domain in your list, then you might end up with just 200 fetched urls. It can also depend on the fetch result. If you select 1000 urls (topN = 1000) but only 77 of them were fetched successfully It may also depend on removing duplicate urls. Please take a look at your crawldb to check for more details using the CrawlDbReader tool. And you might also want to look at the logs for clues. Cheers, Mathijs On Nov 15, 2011, at 3:57 , codegigabyte wrote: I just started learning about nutch and solr and I am starting to get confuse over some issue. I using cygwin on windows xp Basically I crawl with this command: sh nutch crawl urls -dir ../data/jf -topN 1000 So basically this means that each segments will contain 1000 urls right? So i went to the jf folder and see there are 2 folder under segments with timestamp as name. So theorically I should have 2000 documents right? Or wrong? so I index it to solr with solrindex Using the catch-all query *:* return numFound to be 77. Some of the urls i supposed was crawled was not found in the results.? Anyone can point me in the right direction?
Re: crawling a subdomain
You could write your own simple parse plugin that generates abc.xyz.com/stuff as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) subsequent crawl cycles. Mathijs Homminga On Nov 7, 2011, at 7:15, Peyman Mohajerian mohaj...@gmail.com wrote: Thanks Sergey, I don't think I was clear on the issue, the subdomain I'm speaking of won't be found by the crawler, I have to somehow add it, so in my original input url of: http://www.xyz.com/stuff there is absolutely no way the crawler would know about http://abc.xyz.com/stuff I have to somehow dynamically add the subdomain. I also don't have the option of actually adding 'http://abc.xyz.com/stuff' in my input file (a bit of an extra convolution I don't want to bore you with!!). Thanks, Peyman On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov sergey.volko...@gmail.com wrote: Hi! I think you should use urlfilter-regex like http://\w\.xyz\.com/stuff.*; instead of urlfilter-domain and set db.ignore.external.links to false, this will work, but this is quite slow if you have many regex. You may also try to add xyz.com to domain-suffixes.xml, this may cause some side effects, i had never tested this, just looked in DomainURLFilter source, so it's probably not really good idea. Sergey Volkov On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote: Hi Guys, Let's say my input file is: http://www.xyz.com/stuff and I have thousands of these URLs in my input. How do I configure Nutch to also crawl this subdomain for each input: http://abc.xyz.com/stuff I don't want to just replace 'www' with 'abc' i want to crawl both. Thanks Peyman
Re: Funky duplicate url's, getting much worse!
Hi Marcus, I remember Nutch had some troubles with honoring the page's BASE tag when resolving relative outlinks. However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it. Mathijs On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: Anyone? Where is a proper solution for this issue? As expected, the regex won't catch all imaginable kinds of funky URL's that somehow ended up in the CrawlDB. Before the weekend, i added another news site to the tests i conduct and let it run continuously. Unfortunately, the generator now comes up with all kinds of completely useless URL's, although they do exist but that's just the web application ignoring most parts of the URL's. This is the URL that should be considered as proper URL: http://www.blikopnieuws.nl/nieuwsblok Here are two URL's that are completely useless: http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/ It is very hard to use deduplication on these simply because the content is actually changes too much as time progresses - the latest news block for example. It is therefore a necessity to keep these URL's from ending up in the CrawlDB and so not to waste disk space and update time of the CrawlDB and and huge load of bandwidth - i'm in my current fetch probably going to waste at least a few GB's. Looking at the HTML source, it looks like the parser cannot properly handle relative URL's. It is, of course, quite ugly for a site to do this but the parser must not fool itself and come up with URL's that really aren't there. Combined with the issue i began the thread with i believe the following two problems are present - the parser returns imaginary (false) URL's because of: 1. relative href's; 2. URL's in anchors (that is the XML element's body) next to the rhef attribute. Please help in finding the source of the problem (Tika? Nutch?) and how to proceed in having it fixed so other users won't waste bandwidth, disk space and CPU cycles =) Oh, here's a snippet of the fetch job that's currently running, also, notice the news item with the 119039 ID, it's the same as above although that copy/paste was 15 minutes ago. Most item ID's you see below continue to return in the current log output. fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer fetching http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488 fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/ fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/ fetching