Re: crawling site without www

2012-08-07 Thread Mathijs Homminga
: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 But when seed.txt have www.test.com instead test.com second launch of crawler script found next segment for fetching. On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga mathijs.hommi...@kalooga.com wrote: What do you mean exactly

Re: crawling site without www

2012-08-04 Thread Mathijs Homminga
What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney

Re: Is it posible to know how long it takes to download an amount of data with nutch.

2012-08-03 Thread Mathijs Homminga
What version of Nutch are you using? On Aug 4, 2012, at 5:36 , isidro isidr...@gmail.com wrote: Hi, Where can I get the content size and the fetch times for each fetched file ? Isidro On Thu, Aug 2, 2012 at 11:49 PM, Mathijs Homminga-3 [via Lucene] ml-node+s472066n3998947...@n3

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Mathijs Homminga
Hi, Looking at the code, it looks like your batchId is null. Not sure how that can happen (since the SolrIndexerJob does check arguments). Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)? Please do so and post commandline / nutch config / logs. Cheers, Mathijs On

Re: Nutch output to Solr

2012-07-12 Thread Mathijs Homminga
Hi Jim, I believe indexing is not part of the default crawl loop/process. You have to call the indexing job separately. Mathijs Homminga On Jul 12, 2012, at 17:36, Jim Chandler jamescchand...@gmail.com wrote: Would anyone know why when I'm doing my crawling I don't get any output from nutch

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

2012-07-11 Thread Mathijs Homminga
Hi Julian, Just to share our experiences with using Nutch 2.0: Indexing in Nutch actually has nothing to do with indexing itself. It just selects some fields from a WebPage, does some very minimal processing (both typically in the indexing filter plugins) and sends the result to a writer.

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Mathijs Homminga
! Send from my iphone, Mathijs Homminga On Jun 28, 2012, at 0:12, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Of course Mathijs, thank you for the time and the replies, here goes my parse-plugins.xml (as an attachment). Greetings! - Mensaje original - De: Mathijs

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Mathijs Homminga
You can use: image/(bmp|gif|jpeg|png|tiff) in your plugin.xml, this will cover all/most images. On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote: Hi Julien! Thank you for your explanation I realize that Tika indeed does a mimetype detection. I just was searching a way to

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Mathijs Homminga
Hmmm looking at the ParserFactory code, there can actually be several causes for a NullPointerException... Can you also send the parse-plugins.xml? Mathijs Homminga On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: This is the content of my plugin.xml

Re: Linking documents with Nutch+solr

2012-04-03 Thread Mathijs Homminga
Hi Stany, Do you have access to your forum's database? If so, there might be no need to scrape the posts/articles using a crawler like Nutch. You could use Solr as a stand alone indexing server which imports data from your database. Solr supports MoreLikeThis queries. Mathijs Homminga On Apr

Re: Fetching/Indexing process is taking a lot of time

2012-03-27 Thread Mathijs Homminga
Hi George, Just to be sure: Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed within the 'fetch' step that this issue occurs? So, _after_ the Fetcher logs the message Fetcher: starting and _before_ the Fetcher logs the message Fetcher: done? If so, it indeed

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Mathijs Homminga
This is great work!! Thanks Lewis! I must say that when I read the tutorial it stroke me how much of the effort goes into getting Hadoop up and running. It would be great if we could start with: First, make sure you have a healthy Hadoop cluster running, see here for the Hadoop tutorial ;-)

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Mathijs Homminga
About the section Deploy Nutch to Multiple Machines: this is not necessary right? The job jar should be self containing and ship with all the configuration files necessary. Nutch should be able to run on any vanilla Hadoop cluster. It does. All you need is a healthy cluster and a Hadoop

Re: Fetching/Indexing process is taking a lot of time

2012-03-19 Thread Mathijs Homminga
Which version of Hadoop are you using? In your script, I see that you have a section called Generate, Fetch, Parse, Update (Step 2 of $steps) - At which of these sub steps do you see your problem? For example: (from the top of my head) - The Fetch job has a mapper which does the

Re: Fetching/Indexing process is taking a lot of time

2012-03-18 Thread Mathijs Homminga
Hmmm... First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That version is almost 5 years old. I really suggest that you update to 1.4. What if you manually move such amounts of data on your disks? Same low speed? (btw, do you really have raid 1 (mirroring) on 6 disks?) Cheers,

Re: Fetching/Indexing process is taking a lot of time

2012-03-17 Thread Mathijs Homminga
Hi, Your hardware looks okay. Moving data from 30,000 urls takes a week at 500kb/s? That would mean ~10Mb per url. Could that be right? Anyway, can you tell us at what stage your crawl script is when this kicks in? Mathijs On 17 mrt. 2012, at 07:40, George wrote: Hello I.m using nutch

Re: Blacklisted Tasktracker / AlreadyBeingCreatedException

2012-03-16 Thread Mathijs Homminga
Hi Rafael, This sounds like a Hadoop DFS issue. Perhaps it's better to post your question to: hdfs-u...@hadoop.apache.org Mathijs On 16 mrt. 2012, at 14:46, Rafael Pappert wrote: Hello, I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to time i got an alert that 1

Re: Handling duplicate sub domains

2011-11-27 Thread Mathijs Homminga
Hi Markus, What is your definition of duplicate (sub) domains? By reading your examples, I think you are looking for domains (or host IP's) that are interchangeable. That is, domains that give identical response when combined with the same protocol, port, path and query (a url). You could

Re: solr and nutch confusion...

2011-11-14 Thread Mathijs Homminga
Hi, First of all, it may depend on the number of urls you are injecting (number of urls in ../data/jf). If this is less than 1000, the first segment will be smaller and depending on the number of outlinks found, the second segment might also be. It can also depend on the maximum number of urls

Re: crawling a subdomain

2011-11-07 Thread Mathijs Homminga
You could write your own simple parse plugin that generates abc.xyz.com/stuff as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) subsequent crawl cycles. Mathijs Homminga On Nov 7, 2011, at 7:15, Peyman Mohajerian mohaj...@gmail.com wrote: Thanks Sergey, I don't think I

Re: Funky duplicate url's, getting much worse!

2010-09-28 Thread Mathijs Homminga
Hi Marcus, I remember Nutch had some troubles with honoring the page's BASE tag when resolving relative outlinks. However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it. Mathijs On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: Anyone? Where is