: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
But when seed.txt have www.test.com instead test.com second launch of
crawler script found next segment for fetching.
On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga
mathijs.hommi...@kalooga.com wrote:
What do you mean exactly
What do you mean exactly with it falls on fetch phase?
Do you get an error?
Does test.com exist?
Does it perhaps redirect to www.test.com?
...
Mathijs
On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote:
yes
On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney
What version of Nutch are you using?
On Aug 4, 2012, at 5:36 , isidro isidr...@gmail.com wrote:
Hi,
Where can I get the content size and the fetch times for each fetched file ?
Isidro
On Thu, Aug 2, 2012 at 11:49 PM, Mathijs Homminga-3 [via Lucene]
ml-node+s472066n3998947...@n3
Hi,
Looking at the code, it looks like your batchId is null. Not sure how that can
happen (since the SolrIndexerJob does check arguments).
Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)?
Please do so and post commandline / nutch config / logs.
Cheers,
Mathijs
On
Hi Jim,
I believe indexing is not part of the default crawl loop/process. You have to
call the indexing job separately.
Mathijs Homminga
On Jul 12, 2012, at 17:36, Jim Chandler jamescchand...@gmail.com wrote:
Would anyone know why when I'm doing my crawling I don't get any output
from nutch
Hi Julian,
Just to share our experiences with using Nutch 2.0:
Indexing in Nutch actually has nothing to do with indexing itself. It just
selects some fields from a WebPage, does some very minimal processing (both
typically in the indexing filter plugins) and sends the result to a writer.
!
Send from my iphone,
Mathijs Homminga
On Jun 28, 2012, at 0:12, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
wrote:
Of course Mathijs, thank you for the time and the replies, here goes my
parse-plugins.xml (as an attachment).
Greetings!
- Mensaje original -
De: Mathijs
You can use:
image/(bmp|gif|jpeg|png|tiff)
in your plugin.xml, this will cover all/most images.
On Jun 28, 2012, at 19:40 , Jorge Luis Betancourt Gonzalez wrote:
Hi Julien!
Thank you for your explanation I realize that Tika indeed does a mimetype
detection. I just was searching a way to
Hmmm looking at the ParserFactory code, there can actually be several causes
for a NullPointerException...
Can you also send the parse-plugins.xml?
Mathijs Homminga
On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
wrote:
This is the content of my plugin.xml
Hi Stany,
Do you have access to your forum's database? If so, there might be no need to
scrape the posts/articles using a crawler like Nutch. You could use Solr as a
stand alone indexing server which imports data from your database.
Solr supports MoreLikeThis queries.
Mathijs Homminga
On Apr
Hi George,
Just to be sure:
Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed
within the 'fetch' step that this issue occurs?
So, _after_ the Fetcher logs the message Fetcher: starting and _before_ the
Fetcher logs the message Fetcher: done?
If so, it indeed
This is great work!! Thanks Lewis!
I must say that when I read the tutorial it stroke me how much of the effort
goes into getting Hadoop up and running.
It would be great if we could start with:
First, make sure you have a healthy Hadoop cluster running, see here for the
Hadoop tutorial ;-)
About the section Deploy Nutch to Multiple Machines: this is not
necessary right? The job jar should be self containing and ship with all
the configuration files necessary. Nutch should be able to run on any
vanilla Hadoop cluster.
It does. All you need is a healthy cluster and a Hadoop
Which version of Hadoop are you using?
In your script, I see that you have a section called Generate, Fetch,
Parse, Update (Step 2 of $steps) -
At which of these sub steps do you see your problem?
For example: (from the top of my head)
- The Fetch job has a mapper which does the
Hmmm...
First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That
version is almost 5 years old. I really suggest that you update to 1.4.
What if you manually move such amounts of data on your disks? Same low speed?
(btw, do you really have raid 1 (mirroring) on 6 disks?)
Cheers,
Hi,
Your hardware looks okay.
Moving data from 30,000 urls takes a week at 500kb/s?
That would mean ~10Mb per url. Could that be right?
Anyway, can you tell us at what stage your crawl script is when this kicks in?
Mathijs
On 17 mrt. 2012, at 07:40, George wrote:
Hello
I.m using nutch
Hi Rafael,
This sounds like a Hadoop DFS issue. Perhaps it's better to post your question
to:
hdfs-u...@hadoop.apache.org
Mathijs
On 16 mrt. 2012, at 14:46, Rafael Pappert wrote:
Hello,
I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to
time i got an alert that 1
Hi Markus,
What is your definition of duplicate (sub) domains?
By reading your examples, I think you are looking for domains (or host IP's)
that are interchangeable.
That is, domains that give identical response when combined with the same
protocol, port, path and query (a url).
You could
Hi,
First of all, it may depend on the number of urls you are injecting (number of
urls in ../data/jf).
If this is less than 1000, the first segment will be smaller and depending on
the number of outlinks found, the second segment might also be.
It can also depend on the maximum number of urls
You could write your own simple parse plugin that generates abc.xyz.com/stuff
as outlink of www.xyz.com/stuff. Which is then crawled in (one of the)
subsequent crawl cycles.
Mathijs Homminga
On Nov 7, 2011, at 7:15, Peyman Mohajerian mohaj...@gmail.com wrote:
Thanks Sergey,
I don't think I
Hi Marcus,
I remember Nutch had some troubles with honoring the page's BASE tag when
resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so
that's might not be it.
Mathijs
On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
Anyone? Where is
21 matches
Mail list logo