Hi,
> I am thinking of writing a custom indexFilter plugin that returns an empty
> document if the parsed content meets the condition above.
If null is returned, a document is skipped from indexing.
> However, I do not know how to get the depth of a Url. So, I looked into the
> scoring-depth plu
Hi,
> except that some fileds in the schma.xml are not indexed to solr.
> The fields in " " and " " are indexed
> to solr, but other fields, such as the fields in "" , are not.
> what is the problem? Or any other work should be do for that?
Of course, these plugins must be also activated in pro
+1
* src package: compiles, tests pass
* bin package: successfully run small test crawl and indexed to Solr
On 08/13/2014 07:31 AM, Lewis John Mcgibbney wrote:
> Hi user@ & dev@,
>
> This thread is a VOTE for releasing Apache Nutch 1.9. The release candidate
> comprises the following components
Hi,
in general, it should be possible to adapt Nutch to this task:
1 inject 100k URLs
* fixed fetch interval for each can be defined in seed list:
url \t nutchFetchIntervalMDName=
2 generate fetch list(s)
* select pages which need to be checked now
* partion by host (and/or parser)
3
Hi Steve,
does the job file contain the original parse-html from Nutch 1.5.1?
I cannot sync the stack with
http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup
(nor with the current trunk / 1.9), e.g. pars
Hi Paul,
documents in a directory are first just links.
There is a limit on the max. number of links per page.
You may guess: the default is 100 :)
Increase it, or even set it to -1, see below.
Cheers,
Sebastian
db.max.outlinks.per.page
100
The maximum number of outlinks that we'll proce
Hi Paul,
> Not sure why nutch is not adding new URL's. Is it because
> http://localhost/doccontrol is not the "root" and will only be scanned
> again in 30 days time?
Every document, even seeds (including "root") is re-crawled after 30 days
per default.
> I thought the db.update.additions.allow
> Looks like this should have been removed , is the regex in
> regex-normalize.xml correct ?
>
Yes. It removes various session ids, see
src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
Can you give a concrete example of a session id not removed?
Which Nutch version is used?
Tha
/www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p;jsessionid=7936CA95263E9C78B735E5EBE827BDDA.bbolsp-app04-163?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8
>
>
>
> On Mon, Sep 22, 2014 at 4:12 PM, Sebastian Nagel <
>
Hi,
that's caused by a "robots:noindex" in the info dict of the PDF.
Tika puts this into the metadata
and Nutch then empties title and content.
I haven't been aware of this way of excluding non-HTML documents,
so we have to check whether this is a but or not.
The intention of authors/creators o
, A Laxmi wrote:
> Hi Sebastian,
>
> How do we know it has "robots:noindex"? The link I am referring to is -
> http://www.fs.fed.us/global/iitf/pubs/ja_iitf_2012_holm001.pdf
>
> Thanks for your help!
>
>
>
> On Mon, Sep 29, 2014 at 5:38 AM, Sebastian Na
Hi,
> Having looked at the wiki, NUTCH-655, and NUTCH-855, it seems like using
> the urlmeta plugin out of the box would not achieve this, because the
> metadata would be propagated to all outlinks (which presumably would
> include its parent, et al.).
>
> Is this correct? If so, is there any buil
Hi,
> If i do parsechecker to http://www.cubadebate.cu/ the output language
> is gl but this is not well because the language is spanish.
Confirmed also for current trunk with default settings: detected language is
"Galician" (gl).
Confusion between similar/related languages (e.g., Spanish and
Hi,
as mentioned on the wiki page:
This page is extremely out of date. It is not useful for modern versions of
Nutch.
Of course, you have first to crawl and index some content.
But you should use a recent version of Nutch in combination
with Solr or ElasticSearch.
Best,
Sebastian
On 10/16/20
Hi Vijay,
> When I use segment reader and dump data, I am not able to link the original
> url with the redirect
> page that is actually fetched.
That's a non-trivial but interesting problem. Just a few thoughts,
I have no ready solution at hand. Maybe there is one, but I'm unable
to get on it.
Hi Amit,
in Nutch 2.x there are no segments and there is no LinkDB.
Every data is hold in one single "WebTable".
Usually, you want to keep the most recent version
of each document (one row in the table).
Depending on the storage back-end and its configuration
there may be multiple versions stored
Hi,
exclusion of DOM elements is not (yet) part of the Nutch
package (1.9). You need to patch Nutch, see
https://issues.apache.org/jira/browse/NUTCH-585
Sebastian
2014-11-12 9:31 GMT+01:00 Jigal van Hemert | alterNET internet BV <
ji...@alternet.nl>:
> On 11 November 2014 09:12, Moumita Dhar0
Hi,
protocol-http also supports https with Nutch 1.9
(with some limitations, see NUTCH-1676).
Can you try it without httpclient?
Thanks,
Sebastian
2014-11-11 20:42 GMT+01:00 Eyeris RodrIguez Rueda :
> Hello all.
>
> A few days ago I started using nutch 1.9 but i have a problem tryng to use
> p
Hi,
if it's about a recent Nutch version: there is no such property.
(sorry, if it's taken from http://wiki.apache.org/nutch/FetchOptions:
this information is really outdated)
With Nutch 1.9 the following properties are available
which will cause threads to be started and stopped
to come close t
that page...
>
> On Tue, Nov 25, 2014 at 12:08 PM, Sebastian Nagel <
> wastl.na...@googlemail.com> wrote:
>
> > Hi,
> >
> > if it's about a recent Nutch version: there is no such property.
> > (sorry, if it's taken from http://wiki.apache.org/nutch
Hi Issam, hi Markus,
the warning that there are hung threads is shown also in 1.8.
With NUTCH-1182 the hung threads are logged (if they are alive):
- URL in process / being fetched
- with DEBUG logging: stack where thread is hanging
If the problem persists, would it be possible to see more
contex
Hi Murali,
> We have set the number of redirection property to 5.
By http.redirect.max = 5, right?
Just edit $NUTCH_HOME/conf/log4j.properties :
log4j.logger.org.apache.nutch.fetcher.Fetcher=DEBUG,cmdstdout
Redirects are then logged by Fetcher.
Btw., even with http.redirect.max == 0 redirects
Hi Eyeris,
> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that
> the spider jump this
step if i don´t set solr parameter ?
Yes, that's possible in recent trunk of 1.x, see NUTCH-1832
(in doubt, it should be possible to update/replace only bin/crawl):
Just pass an empty
Hi,
a late response: we finally got the same problem on some of our build machines.
Please, follow the thread on dev@nutch:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201412.mbox/%3C548CA860.5040808%40googlemail.com%3E
Thanks,
Sebastian
On 11/28/2014 04:45 PM, Little Wing wrote:
> Hi,
>
Hi,
what about the -crawlId option available with all bin/nutch
tools (inject, fetch, parse, etc.) and also for bin/crawl?
This should start a new table (keyspace, schema, or however it's called)
_webpage.
Best,
Sebastian
On 12/16/2014 09:17 PM, Tamer Yousef wrote:
> Hi All:
> I do have nutc
Hi,
this issue (NUTCH-1566) with spaces on paths
is already fixed in 1.9, but not in 2.2.1.
It will be fixed in 2.3.
Ev. you can replace bin/nutch in 2.2.1 with the
version taken from 2.x trunk
(http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/nutch).
Alternatively, apply the fix accor
Hi,
the log messages do not indicate any error:
the sonar antlib is only required to run
% ant sonar
(see https://issues.apache.org/jira/browse/NUTCH-1109)
If the build really does not succeed, you'll find the
reason more close to the message
BUILD FAILED
Can you provide more context to locali
"name\":\"inlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}");
>
> [javac]
> ^
> [javac]
> /opt/nutch/apache-nutch-2.2.1/src/java/org/apache/nutch/storage/Host.java:51:
> error: cannot fin
Hi Hesham,
in conversations/threads, please, reply always to the list:
you'll get help from other list members, and the discussion
may help other users with the same or similar problem (now
or later in the list archive).
> Can I run Nutch 2.2.1 with Cygwin on windows 8.1 or Windows Server 2012 R2
Hi Hesham,
if working with Shell scripts on Windows,
take care that Unix line breaks are used
exclusively. The Bash shell is "sensitive"
in this respect.
Sebastian
On 12/23/2014 02:38 AM, Hesham Hussein wrote:
> When I used
>
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/nutch
>
Hi Steve,
> https://issues.apache.org/jira/browse/NUTCH-1076
> Is this the reason indexing isn't working for me when I crawl a file system?
Possibly, but at a first glance I would try the current trunk.
A lot of issues have been fixed regarding protocol-file,
in addition to the redirect issues: N
+1
- successful small test crawl with HBase 0.94.26
- verified signatures
On 01/09/2015 09:58 AM, Lewis John Mcgibbney wrote:
> Hi user@ & dev@,
>
> This thread is a VOTE for releasing Apache Nutch 2.3.
> Quite incredibly we addressed 143 issues as per the release report
> http://s.apache.org/nu
Hi,
the regular expression looks good.
Which conf/regex-urlfilter.txt has been changed?
runtime/local/conf/regex-urlfilter.txt ?
If
conf/regex-urlfilter.txt is changed
you need to run "ant runtime" again
to install the configuration changes
into runtime/local/conf.
For distributed mode you need
Hi Kartik,
I've tried the same URL and parsing worked well with Nutch 1.x (trunk).
Which Nutch version is used?
The error indicates that the fetch didn't succeed with HTTP status 200
which may happen (it could be a temporary failure).
If no failure is indicated in the logs, it's possible
to get
Hi Talat,
> - AdaptiveFetchSchedular do not work. In default settings float, it needs
> integer.
Confirmed, in nutch-default.xml these two properties are defined as floats
but read as integers. Configuration.getInt(name) then returns the default value.
db.fetch.schedule.adaptive.min_interval
Hi,
> I am trying to crawl the webpages using Nutch-2.1, but I am not getting
> relative urls as outlinks when parsing the HTML content of webpage.
> Web page is having relative URLs as below :
>
Is the page HTML? Then outlinks are extracted via markup, e.g.
Relative links are always m
Hi,
> that can be done via a URL filter in Nutch,
Should be "URL normalizer", right?
I did this once by adding rules to regex-normalize.xml.
If the URLs are in a certain language with a limited set
on non-ASCII letters (that's the case for Turkish),
this will result in a dozen of extra rules.
B
Hi Iain,
is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links
probably in combination with complex filters or long URLs
(which make the regex filter slow).
Filtering/normalization is on per default.
You have to disable it explicitly
Dear Reza Nazarpour,
Nutch is an open, community-driven project.
That's why a I loop this communication forward
to the Nutch mailing list (user@nutch.apache.org).
> which is a brilliant piece of work.
On behalf of all contributors and volunteers: thank you very much!
> without a thorough documen
ry 2, 2015 11:36 AM
> To: user@nutch.apache.org
> Subject: RE: InvertLinks Performance Nutch 1.6
>
> Thanks Sebastian -- I had not turned off filtering/normalization and did not
> appreciate they could be a significant contribution. I will give that a try.
>
> -Original Me
Hi Tizy,
you mean https://issues.apache.org/jira/browse/NUTCH-827 ?
1. download the latest patch
2. checkout/download the Nutch sources
- better use trunk (upcoming 1.10):
the patch may not apply cleanly to 1.9
3. apply the patch, see
http://wiki.apache.org/nutch/HowToContribute#Applyi
Hi,
> So I add the following rule in regex-urlfilter.txt
> +^https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/*/$
This regex allows
https://thinkarchitect.wordpress.com/2015/02/06/
but does not allow
https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-
Hi Lex,
> Fundamentally these are the same? Both used to limit generated URLs.
All kinds of URL filters are the same in this point: explicitly include
or exclude URLs from being crawled/followed.
> If I include both in plugins.include will both be used?
Yes, both will be used.
> And if so in wha
your help,
>
> I tried to run *bin/nutch org.apache.nutch.net.URLFilterChecker
> -allCombined* to test my regex-urlfilter.txt, it take long-time without no
> results.
>
> What should I do? Is there any methods to test my regex in Nutch?
>
>
> On Wed, Feb 11, 2015 at 3:55
Dear all,
on behalf of the Nutch PMC it is my pleasure to announce that
Jorge Luis Betancourt Gonzalez has been voted in as committer
and member of the Nutch PMC. Jorge, would you mind telling us
about yourself, what you've done so far with Nutch, which areas
you think you'd like to get involved,
Alternatively, have a look at this description
how to manually add the certificates:
http://stackoverflow.com/questions/6659360/how-to-solve-javax-net-ssl-sslhandshakeexception-error
On 02/23/2015 05:02 PM, Eyeris RodrIguez Rueda wrote:
> Hello Martin.
> I think that the problem is with httpclient
Hi Dzmitry,
have a look on
https://issues.apache.org/jira/browse/NUTCH-1870
Work is ongoing (I'm short before pushing an improved patch).
Help in testing and improving the patches is always welcome! :)
It's currently only for 1.x, but plugins are relatively easy
to port.
Best,
Sebastian
On 02/
Hi Slavik,
assumed that
/user/ubuntu/urls/
contains seed URLs it should not contain also the CrawlDb.
The path in the error message
/user/ubuntu/urls/crawldb
suggests that Injector tries to read URLs from crawldb
which is (a) a directory and (b) contains binary data.
Sebastian
On 03/10/2015
Hi Marko,
even with
http.redirect.max == 0
Nutch follows redirect but they are like ordinary links
recorded for fetch in the next round(s).
> The first fetch seems to download something, but the second generate job
> doesn't appear to produce a new segment,
Are the redirect targets accepted by
Hi,
that's a bug which will be fixed in Nutch 1.10, see
https://issues.apache.org/jira/browse/NUTCH-1939
As a work-around it's possible to set
http.redirect.max = 0
and to follow redirects in the next cycle.
Cheers,
Sebastian
On 03/20/2015 08:28 PM, Roannel Fernandez Hernandez wrote:
> Hello,
See also https://issues.apache.org/jira/browse/NUTCH-1939
(it's a bug in Nutch 1.9)
On 03/19/2015 10:10 PM, Sebastian Nagel wrote:
> Hi Marko,
>
> even with
> http.redirect.max == 0
> Nutch follows redirect but they are like ordinary links
> recorded for fetch in the n
Dear all,
it is my pleasure to announce that Mo Omer has been voted in
as committer and member of the Nutch PMC. Mo, would you mind
telling us about yourself, what you've done so far with Nutch,
which areas you think you'd like to get involved, etc...?
Congratulations and welcome on board!
Regar
Hi Jackie,
as a work-around you could set
http.redirect.max = 0
Nutch will follow redirects then in the next cycle.
Best,
Sebastian
2015-03-23 12:44 GMT+01:00 Richardson, Jacquelyn F. :
> Hi,
>
> I am having trouble getting Nutch 1.9 to handle redirects. I found a
> patch (https://issues.apa
Hi,
assumed that the external URLs are not known beforehand
I don't see a simple solution - you need to add a custom
scoring filter plugin. If the URLs are known it's easy:
check the property db.ignore.external.links.
In Nutch 1.x there is the plugin scoring-depth which
allows you to specify a
Hi,
to ensure politeness by guaranteed intervals between accesses
to the same host, all URLs of one single host (or optionally IP address)
are placed in one queue which is processed by a single task.
The longest queue determines the time required to execute one
fetch cycle.
If the URLs crawled sp
Hi Scott,
cycles/rounds/depth is roughly equivalent to the number of hops/links to reach
a document starting from one of the seeds. It has nothing in common with the
depth in the server's file system hierarchy. If there is a link from
http://www.bizjournals.com/triangle/
to e.g.
http://www.bizjo
Hi Iain,
> I have copied tika-mimetypes.xml from the tika jar file and installed a copy
> in my configuration directory. I have updated nutch-site.xml to point to
> this file and the log entries indicate that this is being found.
... and the property mime.type.magic is true (default)?
>
>
s Sebastian.
>
> mime.type.magic is true.
>
> I don’t have control over the web server, so cannot test with
> application/javascript
>
> Time for some deeper debugging it seems. Will update the list with findings.
>
> -Original Message-
> From: Sebastian Nage
e.
>>>
>>> Can anyone familiar with the Tika implementation tell me if there is a way
>>> to update Nutch's MimeUtil.java to instantiate Tika to use the
>>> configuration file from Nutch? Or would it be better just to update the
>>> configurat
Hi Arkadi,
agreed that's a bug.
> if ( parseResult != null ) parseResult.filter() ;
parseResult.isSuccess()
would do the check without modifying the ParseResult
In case, that also fall-back parsers fail it could useful to
return one (the first? the last?) failed ParseResult. Luckily the parse
Hi Yulio,
in this case Nutch behaves just correct ("polite"):
When I run parsechecker I get:
Parse Metadata: robots=noindex,nofollow ...
because of the meta tags:
Because of this robots directive Nutch empties content, title
and outlinks of this page.
Best,
Sebastian
On 04/23/2015 07:40 PM
Dear all,
it is my pleasure to announce that Guiseppe Totaro has joined us
as committer and member of the Nutch PMC. Congratulations on your
new role within the Apache Nutch community!
Guiseppe, would you mind telling us about yourself, and what you
are doing with Nutch, what you plan to do, etc
+1
- download bin package
- verified signature
- run small test crawl (local mode) and index to Solr
On 04/29/2015 11:54 PM, Lewis John Mcgibbney wrote:
> Hi user@ & dev@,This thread is a VOTE for releasing Apache Nutch 1.10.
> The release candidate comprises the following components.* A staging
Hi,
> I have read that if indexingfilter.order property is empty so the order is
> defined by
> plugin.includes property but for some reason this is NOT happening(maybe a
> bug?).
The property plugin.includes is just a regular expressions to filter all
installed
plugins against. It cannot def
Hi Arkadi,
thanks for reporting that. Can you open a Jira ticket [1] to address this
bug?
It's rather a bug of the plugin parse-tika and should be solved there,
cf. https://issues.apache.org/jira/browse/TIKA-1240
A plugin should be able to load all required classes.
Thanks,
Sebastian
[1] https:
Hi Steven,
> is the ordering of dedup and index wrong
No, that's correct: it would be not really efficient to first index duplicates
and then remove them afterwards.
If I understand right the db_gone pages have previously been indexed
(and were successfully fetched), right?
> but "bin/nutch dedu
t;
> fetcher.server.delay
> 0.1
> The number of seconds the fetcher will delay between
>successive requests to the same server. Note that this might get
>overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>fetcher.threads.per.queue is set to 1.
>
&
Hi Arthur,
principally your approach should work. But as all config files the indexing
url filter file
is loaded from class path. An absolute path does not work:
...
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
If the file is properly deployed to $NUTCH_HOME/conf/ in l
t; The Queen's Anniversary Prizes 1994, 2002 & 2013
> THE Awards Winners 2007-2013
>
> Elite without being elitist
>
> Follow us on Twitter http://twitter.com/uniofleicester or
> visit our Facebook page https://facebook.com/UniofLeicester
>
>
> On Mon, 6 Jul 2015, Sebast
Hi Sarah,
> I got through sections 8.1 and 8.2 and suddenly the tutorial jumps to
> “Whole-Web crawling”
> and information about very large crawls.
you're right, this could be misleading.
In fact, there is little difference between crawling a single site or "the
whole web",
it's merely the seed
Hi Arthur,
> Any tips on debugging regular expressions against url's would still be handy
> though.
> Any nice way to take all links and run them through the regex-urlfilter.txt
> file
> in isolation to see which come out?
The easiest way would be pipe a list of URLs to checked to the
URLFilter
e run either of
> these tests with this option?
>
> Thanks,
> Arthur.
>
> On 2015-07-20 20:36, Sebastian Nagel wrote:
>> Hi Arthur,
>>
>>> Any tips on debugging regular expressions against url's would still be
>>> handy though.
>>
.ac.uk
>
> The Queen's Anniversary Prizes 1994, 2002 & 2013
> THE Awards Winners 2007-2013
>
> Elite without being elitist
>
> Follow us on Twitter http://twitter.com/uniofleicester or
> visit our Facebook page https://facebook.com/UniofLeicester
>
>
> On T
Hi Arkadi,
does the problem persist?
Which version of Nutch are you using?
Can you point to one file or URL to reproduce it?
Thanks,
Sebastian
On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> Hi Arkadi,
>
> thanks for reporting that. Can you open a Jira ticket [1] to address
Hi Markus,
+1 / why not? It will be rarely used, I guess.
And it was surely ok, to remove test classes
and dependencies from the "normal" package
to make the job file smaller (NUTCH-1803).
Maybe the main question is whether to
provide a test artifact "officially", or just
add a target to publish
Hi Alex,
> Some of the pages on the site requires login. I have enabled
> HttpFormAuthentication in the protocal-httpclient plugin. However, looks
> like the login page title gets indexed into Solr instead of the actual
> page's title.
Does this mean that one segment contains multiple records und
Dear all,
on behalf of the Nutch PMC it is my pleasure to announce
that Asitang Mishra has joined the Nutch team as committer
and PMC member. Asitang, please feel free to introduce
yourself and to tell the Nutch community about your
interests and your relation to Nutch.
Congratulations and welcom
ti-thread fetcher, I meant fetcher.threads.per.queue > 1. (In my
> case, I set it to 5). I left fether.parse to the default value (false).
> Parsing is done as a separate step after fetching.
>
> Thanks again for your time. Any further guidance would be greatly
> appreciated!
>
Hi,
Nutch 1.10 is supposed to run with Hadoop 1.2.0.
1.10 (to be released soon) will run with 2.4.0,
and probably also with newer Hadoop versions.
If you need Nutch with a recent Hadoop version
right now, you could build it by yourself from trunk.
Cheers,
Sebastian
2015-09-11 16:14 GMT+02:00 Im
Dear all,
on behalf of the Nutch PMC it is my pleasure to announce
that Sujen Shah has been voted in as committer and member
of the Nutch PMC. Sujen, would you mind to introduce
yourself to the Nutch community and tell in just a few
words about your interests and your plans regarding Nutch?
Cong
Great! Reads well, straight-forward, and I didn't find any missing detail!
Thanks, Julien!
2015-09-23 11:26 GMT+02:00 Julien Nioche :
> Hi everyone,
>
> Just to let you know that we've just published a new tutorial on how to use
> Nutch (and StormCrawler) to crawl and index documents into AWS Cl
Hi Girish,
> in the hadoop.log i see “robots.txt whitelist not configured"
This means that the property is somehow not set properly.
Shouldn't it be http.robot.rules.whitelist", see below?
Also make sure that the modified nutch-site.xml is deployed.
If you modify it in conf/ you have to run "an
+1
- tests pass
- verified signatures
- run a test crawl using HBase 0.98.14
The documentation [1] needs to be updated for Gora 0.6.1, right?
I also had to copy hbase-common to $NUTCH_HOME/runtime/local/lib/
but that's probably it's not exactly the same HBase version used by Gora.
Sebastian
[1
Hi Sherban,
> Right now it finds 0 URLs with no errors.
Can you specify what's going wrong. It could
be everything, even a configuration problem.
What did you crawl? Using which storage back-end?
Thanks,
Sebastian
On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
> Hi Lewis,
>
> -1 until I verif
ent.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,292 INFO ht
Hi,
there as been a similar question on the Tika mailing list recently:
http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E
If you get Tika to OCR the embedded images, the parse-tika
plugin will proba
Hi,
sorry for the late reply.
I've once prepared an overview and also a flow diagram as part of
http://www.slideshare.net/sebastian_nagel/aceu2014-snagelwebcrawlingnutch
crawl_parse: all crawling-related data from the parsing step used to update
CrawlDb:
outlinks, scores, signatures, meta data.
Hi,
sorry, but I didn't try this by myself, just had
in mind that there has been a thread on the Tika
mailing list.
> What is difference between ./plugins/parse-tika/parse-tika.jar and
> ./plugins/parse-tika/tika-parsers-1.8.jar ?
parse-tika.jar contains the classes of Nutch's parse-tika plugin
jar
Needs some debugging to find out what is wrong.
Please, feel free to file a bug report on
https://issues.apache.org/jira/browse/NUTCH
Thanks,
Sebastian
On 10/09/2015 06:21 PM, Sebastian Nagel wrote:
> Hi,
>
> sorry, but I didn't try this by myself, just had
> in mind that the
tp.proxy.port = 8080
> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 1
> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-1
Hi Arkadi,
> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),
Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> fetches target content, parses and saves it, but loses on the indexing stage.
Can you give a concrete example?
While testing NUTCH-2
www.atnf.csiro.au/observers/index.html as seed, it will be
> fetched, parsed and indexed successfully even if you set depth to 1.
>
> Regards,
> Arkadi
>
>> -Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: Thursday, 29 Octo
and crawl
a sufficient number of rounds.
Cheers,
Sebastian
On 11/06/2015 05:09 AM, arkadi.kosmy...@csiro.au wrote:
> Hi Sebastian,
>
> I meant #1 and used if http.redirect.max == 3.
>
> Thanks,
> Arkadi
>
>> -Original Message-
>> From: Sebastian Nage
Hi,
you're right. This will be fixed in Nutch 1.11.
Thanks,
Sebastian
On 11/09/2015 10:07 PM, Frumpus wrote:
> Ok, it seems as though I have run into a version of this problem:
>
>
> [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA
>
> | |
> | | | | | |
> | [NUTCH-2041]
Dear all,
it is my pleasure to announce that Michael Joyce has joined us
as a committer and member of the Nutch PMC. Congratulations on your
new role within the Apache Nutch community! And thanks for
your contributions and efforts so far, hope to see more!
Michael, would you mind telling us about
Hi,
Nutch will probably follow the link and fetch test.html
prefixed by the base URL.
The default is to ignore the '#' and everything after:
it's normally a page anchor which must be removed
to avoid duplicate content.
That's the default. Have a look at
https://wiki.apache.org/nutch/AdvancedAj
Hi,
Nutch should convert the & in the href attribute
to a bare ampersand and keep it for all succeeding
operations.
What version of Nutch is used?
Are there changes to the default configuration?
Trial with a dummy test document on a local Apache httpd:
% cat /var/www/test_amp.html
test
Hi Andrés, hi Roannel,
that's correct but the question was why the effective
delay is "bigger" than the configured 2.5 sec.
Nutch implements the delay as sleeping time after
one document has been fetched / before the next
document is fetched. The observed 4-5 sec. include
the time spent for fetch
Hi,
> only crawls 2 URLs at a time
Sounds like the site has pages from two different hosts
(by URL). There are a couple of properties to adjust
the load on a single host. Have a look at conf/nutch-default.xml,
the property "fetcher.threads.per.queue" and the properties
nearby.
Cheers,
Sebastian
Hi,
there is no need for Nutch to detect redirect loops:
(A) per default (with http.redirect.max == 0) Nutch just records the
redirect targets
and fetches them in the next round. The redirect backwards which is found in
the next round is not fetched again because it has already been fetched.
(B)
201 - 300 of 719 matches
Mail list logo