.
-sujit
On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote:
Hi,
We're in the process of testing Solr trunk's cloud features that recently
includes initial work for distributed indexing. With it, there is no need
anymore for doing the partitioning client site because Solr
Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try
again.
Many thanks Remi.
Finally, after un reboot og the computer (I send my question just before
leaving my desk), Nutch started to crawl (amazing :))) )
But now, during the crawl process, I got that :
- User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
www.gettinhahead.co.in
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-crowl-AJAX-populated-pages-tp3
783398p3783398.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
*Lewis*
--
Markus Jelsma - CTO - Openindex
--
View this message in context:
http://lucene.472066.n3.nabble.com/crawldb-modifications-tp3781740p3781740.
html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
In that case i suggest using crawldbscanner tool or the new regex feature for
the crawldbreader tool in trunk.
On Tuesday 28 February 2012 13:04:47 remi tassing wrote:
I think he ment to remove some specific URLs not everything
On Tue, Feb 28, 2012 at 1:51 PM, Markus Jelsma
markus.jel
Impressive!
On Tue, 28 Feb 2012 20:41:58 -0500, Jason Trost jason.tr...@gmail.com
wrote:
Blog post for anyone who's interested. I cover a basic howto for
getting Nutch to use Apache Gora to store web crawl data in Accumulo.
Let me know if you have any questions.
Accumulo, Nutch, and GORA
Short anwer: continue crawling!
When going to crawl a large amount of records i wouldn't encourage you
to use the crawl command. It's better to build a small shell script that
repeats the crawl cycle over and over.
Remember, the depth parameter is nothing more than a crawl cycle
exectuted
but I like what it's offering. I got
the
basic setup working.
I was wondering how would we implement 'Featured link' using
Nutch-Solr. I
would like to hear your thoughts.
Thanks in advance.
-Stan
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
on one machine. I notice that
they are
conflicting because they all access
/tmp/hadoop-username/mapred
How do I change the location of this folder ?
Do I have use hadoop to run multiple crawlers each specific to a site
?
thanks
Jeremy
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com
to:
$NUCHT_DIR/runtime/local/conf/nutch-site.xml
Jeremy
On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma wrote:
you can either:
1. run on hadoop
2. not run multiple concurrent jobs on a local machine
3. set a hadoop.tmp.dir per job
4. merge all crawls to a single crawl
this message in context:
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
wrote:
Hello,
I need to have different fetch intervals for initial seed urls and
urls extracted from them at depth 1. How this can be achieved. I
tried
-adddays option in generate command but it seems it cannot be used to
solve this issue.
Thanks in advance.
Alex.
--
Markus Jelsma - CTO
to get results by using 'starts with' or prefix query.
e.g. Return all results where url starts with http://auto.yahoo.com
[1]
Thanks again!
On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma wrote:
Hi
What is a featured link? Maybe Solr's elevator component is what
your are looking for?
cheers
records restricted by status:
generate -Dgenerate.restrict.status=status
Thanks.
Alex.
-Original Message-
From: Markus Jelsma
To: user
Cc: nutch-user
Sent: Thu, Mar 1, 2012 10:30 pm
Subject: Re: different fetch interval for each depth urls
Well, you could set a new default fetch
.nabble.com/Incompatible-format-version-2-expected-1-or-lower-tp3796473p3796473.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
this in a larger setup.
thanks !
pvremort
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
to alter these settings to point to the non-default Hadoop?
Regards,
Dean.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
dean.pul...@semantico.com wrote:
Thanks for your reply.
I understand what you've said, but how does Nutch know where the
Hadoop jobtracker is running?
Regards,
Dean.
On 20/03/2012 11:03, Markus Jelsma wrote:
This is not a Nutch thing. A Nutch job, any job, is submitted to the
Hadoop Jobtracker
/urls-won-t-get-crawled-tp3650610p384206
6.html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
of the great technologies.
We would really appreciate feedback as there will undoubtedly be some
errors or data missing.
Thanks
Lewis
[0] http://wiki.apache.org/nutch/NutchHadoopTutorial
--
Markus Jelsma - CTO - Openindex
mapred.JobClient: Map output records=5*
===
=
Regards
Andy
--
Markus Jelsma - CTO - Openindex
, but nothing happened.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158
.html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
a database that could potentially be locked at any point
in time?
Thanks!
--
View this message in context:
http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma
scripting and locking horror and it's an I/O
consumer.
--
View this message in context:
http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
This is not supported by Nutch and there's no issue ticket yet. Feel
free to open one.
On Thu, 22 Mar 2012 14:32:26 -0500, thomas.j.lut...@wellsfargo.com
wrote:
Ran across a posting for the Nutch roadmap mentioning support for the
canonical tag.
.n3.**
nabble.com/Relative-urls-**interpage-href-anchors-**
tp3861215p3861215.htmlhttp://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http
be the command to do that?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-sourc
e-to-Solr-tp3855918p3855918.html Sent from the Nutch - User mailing list
archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?
We have got 70k+ pages in the index and a full recrawl would take to
long.
Thanks
Jan
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
like a result.
When I can jump this raw during my crawling? Is possible exclude
this
raw?
thank you in adavande
alessio
--
*Lewis*
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
,
Anastasia
--
View this message in context:
http://lucene.472066.n3.nabble.com/Class-in-the-code-that-handles-parsing-
of-html-files-and-selection-of-URLs-tp3890250p3890250.html Sent from the
Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.bu...@gmail.com
nutch.bu...@gmail.com wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those cannot
input file.
Any other insights on these issues will be appreciated
Markus Jelsma-2 wrote
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.buddy@
nutch.buddy@ wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.
1. I run nutch on a huge amount of urls
Hi,
Recently a reducer got killed because of this. Increasing heap did work
but the next job some days later also failed. I looked at the code and i
cannot seem to find why it would take more than 400MB of RAM to process
outlinks of a single record. We do limit outlinks so the HashSets pages
this
functionality?
Best regards,
--Anders Rask
www.findwise.com
--
Markus Jelsma - CTO - Openindex
in order to recrawl sites then the total number of
URLs that
are crawled for one site will not be limited by the
generate.max.count
parameter. Am I right?
Best regards,
--Anders Rask
www.findwise.com
Den 11 april 2012 17:14 skrev Markus Jelsma
markus.jel...@openindex.io:
Check these properties
somewhere? We have had
this URL for a longer time and it happily passed all jobs many times
before.
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also
Debugging this with a stand-alone Tika would certainly make things
easier. There may be an issue in Tika or even in the parser
implementation itself.
On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), nutch.bu...@gmail.com
nutch.bu...@gmail.com wrote:
I'm running nutch on large xlsx file (100-150mb),
The CrawlDB is not a suitable data source but the WebGraph's NodeDB is.
You could probably write a new MR tool reading the NodeDB and outputting
data in a format such a visualization tool understands.
I think the only real problem would be the size of the data.
On Sun, 15 Apr 2012 12:43:57
, but I
don't see a nodedb folder.
Thanks in advance.
Safdar
On Sun, Apr 15, 2012 at 4:17 PM, Markus Jelsma wrote:
The CrawlDB is not a suitable data source but the WebGraph's NodeDB
is. You could probably write a new MR tool reading the NodeDB and
outputting data in a format
This error?
[javac] warning: [path] bad path element
/home/markus/projects/apache/nutch/trunk/build/lib/activation.jar: no
such file or directory
On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
Whilst doing some testing on Nutchgora within
Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
.
On Sun, Apr 15, 2012 at 10:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
This error?
[javac] warning: [path] bad path element
/home/markus/projects/apache/**
nutch/trunk/build/lib/**activation.jar: no such file or directory
On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney
an OutlinkDB can
make a mess out of itself? Should we enforce uniqueness in the mean
time?
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I
Will provide a patch tomorrow.
https://issues.apache.org/jira/browse/NUTCH-1335
On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
It seems a single URL has about half a million outlinks connected to
it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000
On Sat, 21 Apr 2012 17:44:49 -0700 (PDT), benmccann
benjamin.j.mcc...@gmail.com wrote:
Hi,
I have a few questions about getting started. Is there a good
tutorial
anywhere?
Questions I have:
* How do I restrict the crawling or saving of pages to only those
matching
certain regexes?
With
the status
in the Hadoop web gui.
I'm doing a local crawl. Does this mean the Hadoop web gui is
unavailable? Is there anyway to check status of a local crawl?
What's the
URL for the hadoop web gui?
Thanks!
-Ben
On Sun, Apr 22, 2012 at 7:33 AM, Markus Jelsma-2 [via Lucene]
ml-node
of monickr: http://monickr.com [3]
01926 813736 | 07973 156616
_-- _
Links:
--
[1] http://[domain]/solr/
[2] http://www.tellura.co.uk/
[3] http://monickr.com/
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
Hi,
We sometimes see the generator running OOM. This happens because we
either have a too high topN value or too many segments to generate. In
any case, a very large amount of records is being generated with the
same (lowest) score and end up in a single reducer. We limit the
generator by
of Nutch info on the web...
http://wiki.apache.org/nutch/
http://wiki.apache.org/nutch/PluginCentral
hth
Lewis
--
Lewis
--
Markus Jelsma - CTO - Openindex
.
With kind regard,
Roberto Gardenier
--
Markus Jelsma - CTO - Openindex
Do you have running task trackers and data nodes? Which Nutch job did
you start? Any custom code?
Check the logs of of the four Hadoop daemons, there may be something
there.
On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen
dean.pul...@semantico.com wrote:
Hi all,
If this is definitely a
FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
of
reducers or, slightly increase the host or domain limit value.
On Thu, 26 Apr 2012 21:02:58 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
We sometimes see the generator running OOM. This happens because we
either have a too high topN value or too many segments to generate.
In
any case
of that command I don't
see any keywords or description fields :( just the usual ones
(site,title,content,etc).
Am I missing something here?
Also let me know if you need more details or my nutch-site.xml
config file...
Regards
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com
to
an indexed document.
From: Markus Jelsma markus.jel...@openindex.io
To: ML mail mlnos...@yahoo.com
Cc: Lewis John Mcgibbney lewis.mcgibb...@gmail.com; user@nutch.apache.org
Sent: Thursday, May 3, 2012 9:32 AM
Subject: Re: Indexing meta tags in Nutch 1.4
Hi,
This is a tough problem indeed. We partially mitigate this problem by
using several regular expressions, linkrank scores with domain limiting
generator for regular crawls and a second shallow crawl, only following
links from the home page.
A custom URLFilter as Ferdy explains is a good
html snippet
as a link?
tr onclick=clickOnLink(http://www.example.com/link;);.../tr
Thanks,
Mohammad
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
many segments of ~N records
are generated.
Markus Jelsma-2 wrote
On Mon, 7 May 2012 22:31:43 -0700 (PDT), nutch.buddy@
nutch.buddy@ wrote:
In a previous discussion about handling of failures in nutch, it
was
mentioned that a broken segment cannot be fixed and it's urls
should
be
re
Hi
Nutch should parse an HTML file with a .txt extension just as a normal
HTML file, at least, here it does. What does your parserchecker say? In
any case you must strip potential left-over HTML in your Solr analyzer,
if left like this it's a bad XSS vulnerability.
Cheers
On Tue, 8 May
/?page=2633pid=1043ELEsite=191;1;db_unfetched;Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;null
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen
--
Markus Jelsma
a custom URL Normalizer to get this to work.
But why? It doesn't seem alright.
On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
] mailto:krist...@yahoo-inc.com
[22]
http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207
[23]
http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002
[24] http://webmail.openindex.io/tel:%28408%29%20349%203300
[25] http://webmail.openindex.io/tel:%28408%29%20349%203301
--
Markus
...@gmail.com
[1] http://www8.org/w8-papers/5a-search-query/crawling/
[2] http://www.cse.iitb.ac.in/~soumen/focus/
[3]
http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
--
Markus Jelsma - CTO - Openindex
that
CrawlDB
would not allow duplicate links to get inside it?
What link deduplication do you mean? CrawlDB records have a unique key
on the URL.
Regards | Vikas
www.knoldus.com
--
Markus Jelsma - CTO - Openindex
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
is
mentioned. Tried to upgrade to hadoop-core-0.20.203.0.jar but then
this
is
thrown:
Exception in thread main java.lang.**NoClassDefFoundError:
org/apache/commons/**configuration/Configuration
Can someone, please, shed some light on this?
Thanks.
Igor
--
Markus Jelsma - CTO - Openindex
and there is plenty free
space.
All the best,
Igor
On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote:
Plenty of disk space does not mean you have enough room in your
hadoop.tmp.dir which is /tmp by default.
On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote:
Hi, Adriana, Sebastian,
We
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
-tp3974397p3976568.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
hi
On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote:
Hi Markus,
Thanks for your response. My responses inline
On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
hi
On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati vi...@knoldus.com
wrote
should upgrade accordingly in trunk.
Thanks
Lewis
On Thu, May 10, 2012 at 1:56 PM, Michael Erickson
erickson.mich...@gmail.com wrote:
On May 10, 2012, at 1:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga to...@ozses.net wrote:
Hi
to
work?
Thanks
Matthias
--
Markus Jelsma - CTO - Openindex
, it works similar and uses the same signature
algorithm as Nutch has. Please consult the Solr wiki page on
deduplication.
Good luck
On Thu, 10 May 2012 22:54:37 +0300, Tolga to...@ozses.net wrote:
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300
, Markus Jelsma wrote:
thanks
This is a known issue:
https://issues.apache.org/jira/browse/NUTCH-1100
I have not been able find the bug nor do i know how to reproduce it
from scratch. If you have a public site with which we can reproduce it
please comment to the Jira ticket. Make sure you use
mode.
Also I want some urls filtered by my urlfilter to be stored in
an
external
flat file. How can I achieve this.
--
*Thanks Regards*
*
*
*Vijith V*
--
*Thanks Regards*
*
*
*Vijith V*
--
*Thanks Regards*
*
*
*Vijith V*
--
Markus Jelsma
in fact it uses much less memory than it can.
Any idea?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
yes
On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote:
is whole web content download possible?
include Flash, Image, CSS, JavaScript
you for your help.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type
-text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing
list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
/11/12 9:40 AM, Markus Jelsma wrote:
Ah, that means don't use the crawl command and do a little shell
scripting to execute the separte crawl cycle commands, see the nutch
wiki for examples. And don't do solrdedup. Search the Solr wiki for
deduplication.
cheers
On Fri, 11 May 2012 07
?
Matthias
On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
By default each crawl is iterative. The crawl command is nothing more
than a wrapper around the individual crawl cycle commands. The depth
parameter is nothing
-Original message-
From:Matthias Paul magethle.nu...@gmail.com
Sent: Fri 18-May-2012 14:57
To: user@nutch.apache.org
Subject: Exclude certain mime-types
How can I exlude certain mime-types from crawling, for example Word-documents?
If I have parse-tika in plugin.includes it
] Apache Nutch 1.5 release rc #1
When will Nutch 1.5 be released?
Matthias
On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal bharat.go...@shiksha.com
wrote:
+1
On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:
+1
On Mon, 16 Apr 2012 05:43:22 +, Mattmann, Chris
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
CrawlDatum's meta data as i did with:
https://issues.apache.org/jira/browse/NUTCH-1024
-Original message-
From:Vikas Hazrati vi...@knoldus.com
Sent: Mon 21-May-2012 13:44
To: user@nutch.apache.org
Subject:
Hi
Which version do you use? It should list the troubling URL. What's the stack
trace?
Cheers
-Original message-
From:Ing. Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Mon 21-May-2012 17:07
To: user@nutch.apache.org
Subject: error parsing some xml
Hi all.
When I try to crawl
)
- Mensaje original -
De: Markus Jelsma markus.jel...@openindex.io
Para: user@nutch.apache.org
Enviados: Lunes, 21 de Mayo 2012 11:41:40
Asunto: RE: error parsing some xml
Hi
Which version do you use? It should list the troubling URL
Please read the description.
-Original message-
From:Tolga to...@ozses.net
Sent: Tue 22-May-2012 11:37
To: user@nutch.apache.org
Subject: Re: PDF not crawled/indexed
What is that value's unit? kilobytes? My PDF file is 4.7mb.
On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Tue 22-May-2012 19:40
To: user@nutch.apache.org
Subject: URL filtering and normalization
Somehow my crawler started fetching youtube. I'm not really sure why as I
have db.ignore.external.links set to true.
Weird!
Great!
My +1 for a new release based on the state of the codebase.
-Original message-
From:Julien Nioche lists.digitalpeb...@gmail.com
Sent: Tue 22-May-2012 22:19
To: d...@nutch.apache.org
Cc: user@nutch.apache.org
Subject: Re: Apache Nutch release 1.5 RC2
Read
Hi,
Yes, this is no problem.
Cheers
-Original message-
From:Dustine Rene Bernasor dust...@thecyberguardian.com
Sent: Thu 24-May-2012 12:58
To: user@nutch.apache.org
Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly
Hello
I was wondering, would it be possible to
Hi,
That's a patch for the fetcher. The error you are seeing is quite simple
actually. Because you set those two link.ignore parameters to true, no links
between the same domain and host or aggregated, only links from/to external
hosts and domains. This is a good setting for wide web crawls.
and
link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
set to true? Or should I just set all of the link.ignore.xxx.xxx values
to false?
On 5/29/2012 4:43 PM, Markus Jelsma wrote:
Hi,
That's a patch for the fetcher. The error you are seeing is quite simple
actually. Because you set
valuecom.custom.CustomEventFetchScheduler/value
/property
How do I include my custom logic so that it gets picked as a part of the
crawl cycle.
Regards | Vikas
On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Yes, you can pass ParseMeta keys to the FetchSchedule as part
Hi,
The generator can only do it the other way around via the addDays parameter. To
make it work your way you can modifiy the generator to restrict to documents
younger than 48 hours.
Cheers
-Original message-
From:Shameema Umer shem...@gmail.com
Sent: Mon 04-Jun-2012 08:33
To:
701 - 800 of 1614 matches
Mail list logo