Re: how to get the depth of url in nutch

2014-08-10 Thread Sebastian Nagel
Hi, > I am thinking of writing a custom indexFilter plugin that returns an empty > document if the parsed content meets the condition above. If null is returned, a document is skipped from indexing. > However, I do not know how to get the depth of a Url. So, I looked into the > scoring-depth plu

Re: How to index the plugin field in nutch with solr?

2014-08-12 Thread Sebastian Nagel
Hi, > except that some fileds in the schma.xml are not indexed to solr. > The fields in " " and " " are indexed > to solr, but other fields, such as the fields in "" , are not. > what is the problem? Or any other work should be do for that? Of course, these plugins must be also activated in pro

Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-16 Thread Sebastian Nagel
+1 * src package: compiles, tests pass * bin package: successfully run small test crawl and indexed to Solr On 08/13/2014 07:31 AM, Lewis John Mcgibbney wrote: > Hi user@ & dev@, > > This thread is a VOTE for releasing Apache Nutch 1.9. The release candidate > comprises the following components

Re: Use nutch as a distributed monitoring solution, any idea?

2014-08-16 Thread Sebastian Nagel
Hi, in general, it should be possible to adapt Nutch to this task: 1 inject 100k URLs * fixed fetch interval for each can be defined in seed list: url \t nutchFetchIntervalMDName= 2 generate fetch list(s) * select pages which need to be checked now * partion by host (and/or parser) 3

Re: java.lang.NullPointerException at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source)

2014-08-16 Thread Sebastian Nagel
Hi Steve, does the job file contain the original parse-html from Nutch 1.5.1? I cannot sync the stack with http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup (nor with the current trunk / 1.9), e.g. pars

Re: Nutch not crawling all documents in a directory

2014-08-19 Thread Sebastian Nagel
Hi Paul, documents in a directory are first just links. There is a limit on the max. number of links per page. You may guess: the default is 100 :) Increase it, or even set it to -1, see below. Cheers, Sebastian db.max.outlinks.per.page 100 The maximum number of outlinks that we'll proce

Re: New documents not being added by nutch

2014-08-22 Thread Sebastian Nagel
Hi Paul, > Not sure why nutch is not adding new URL's. Is it because > http://localhost/doccontrol is not the "root" and will only be scanned > again in 30 days time? Every document, even seeds (including "root") is re-crawled after 30 days per default. > I thought the db.update.additions.allow

Re: jsessionid not being remvoed from the url

2014-09-22 Thread Sebastian Nagel
> Looks like this should have been removed , is the regex in > regex-normalize.xml correct ? > Yes. It removes various session ids, see src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test Can you give a concrete example of a session id not removed? Which Nutch version is used? Tha

Re: jsessionid not being remvoed from the url

2014-09-23 Thread Sebastian Nagel
/www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p;jsessionid=7936CA95263E9C78B735E5EBE827BDDA.bbolsp-app04-163?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8 > > > > On Mon, Sep 22, 2014 at 4:12 PM, Sebastian Nagel < >

Re: nutch 1.8 pdf crawl issue

2014-09-29 Thread Sebastian Nagel
Hi, that's caused by a "robots:noindex" in the info dict of the PDF. Tika puts this into the metadata and Nutch then empties title and content. I haven't been aware of this way of excluding non-HTML documents, so we have to check whether this is a but or not. The intention of authors/creators o

Re: nutch 1.8 pdf crawl issue

2014-09-30 Thread Sebastian Nagel
, A Laxmi wrote: > Hi Sebastian, > > How do we know it has "robots:noindex"? The link I am referring to is - > http://www.fs.fed.us/global/iitf/pubs/ja_iitf_2012_holm001.pdf > > Thanks for your help! > > > > On Mon, Sep 29, 2014 at 5:38 AM, Sebastian Na

Re: propagating injected metadata only to child URLs?

2014-10-07 Thread Sebastian Nagel
Hi, > Having looked at the wiki, NUTCH-655, and NUTCH-855, it seems like using > the urlmeta plugin out of the box would not achieve this, because the > metadata would be propagated to all outlinks (which presumably would > include its parent, et al.). > > Is this correct? If so, is there any buil

Re: problem with language identification in nutch 1.5.1

2014-10-20 Thread Sebastian Nagel
Hi, > If i do parsechecker to http://www.cubadebate.cu/ the output language > is gl but this is not well because the language is spanish. Confirmed also for current trunk with default settings: detected language is "Galician" (gl). Confusion between similar/related languages (e.g., Spanish and

Re: Integrating Nutch search functionality into a Java application

2014-10-24 Thread Sebastian Nagel
Hi, as mentioned on the wiki page: This page is extremely out of date. It is not useful for modern versions of Nutch. Of course, you have first to crawl and index some content. But you should use a recent version of Nutch in combination with Solr or ElasticSearch. Best, Sebastian On 10/16/20

Re: Link original url with the final redirected url

2014-10-27 Thread Sebastian Nagel
Hi Vijay, > When I use segment reader and dump data, I am not able to link the original > url with the redirect > page that is actually fetched. That's a non-trivial but interesting problem. Just a few thoughts, I have no ready solution at hand. Maybe there is one, but I'm unable to get on it.

Re: Nutch 2.X question

2014-11-06 Thread Sebastian Nagel
Hi Amit, in Nutch 2.x there are no segments and there is no LinkDB. Every data is hold in one single "WebTable". Usually, you want to keep the most recent version of each document (one row in the table). Depending on the storage back-end and its configuration there may be multiple versions stored

Re: Removing Common Web Page Header and Footer from content

2014-11-13 Thread Sebastian Nagel
Hi, exclusion of DOM elements is not (yet) part of the Nutch package (1.9). You need to patch Nutch, see https://issues.apache.org/jira/browse/NUTCH-585 Sebastian 2014-11-12 9:31 GMT+01:00 Jigal van Hemert | alterNET internet BV < ji...@alternet.nl>: > On 11 November 2014 09:12, Moumita Dhar0

Re: problem using nutch 1.9- PKIX validator

2014-11-13 Thread Sebastian Nagel
Hi, protocol-http also supports https with Nutch 1.9 (with some limitations, see NUTCH-1676). Can you try it without httpclient? Thanks, Sebastian 2014-11-11 20:42 GMT+01:00 Eyeris RodrIguez Rueda : > Hello all. > > A few days ago I started using nutch 1.9 but i have a problem tryng to use > p

Re: fetcher.throttle.bandwidth

2014-11-25 Thread Sebastian Nagel
Hi, if it's about a recent Nutch version: there is no such property. (sorry, if it's taken from http://wiki.apache.org/nutch/FetchOptions: this information is really outdated) With Nutch 1.9 the following properties are available which will cause threads to be started and stopped to come close t

Re: fetcher.throttle.bandwidth

2014-11-26 Thread Sebastian Nagel
that page... > > On Tue, Nov 25, 2014 at 12:08 PM, Sebastian Nagel < > wastl.na...@googlemail.com> wrote: > > > Hi, > > > > if it's about a recent Nutch version: there is no such property. > > (sorry, if it's taken from http://wiki.apache.org/nutch

Re: Nutch 1.9 Fetchers Hung

2014-11-28 Thread Sebastian Nagel
Hi Issam, hi Markus, the warning that there are hung threads is shown also in 1.8. With NUTCH-1182 the hung threads are logged (if they are alive): - URL in process / being fetched - with DEBUG logging: stack where thread is hanging If the problem persists, would it be possible to see more contex

Re: 302

2014-11-30 Thread Sebastian Nagel
Hi Murali, > We have set the number of redirection property to 5. By http.redirect.max = 5, right? Just edit $NUTCH_HOME/conf/log4j.properties : log4j.logger.org.apache.nutch.fetcher.Fetcher=DEBUG,cmdstdout Redirects are then logged by Fetcher. Btw., even with http.redirect.max == 0 redirects

Re: questions about nutch 1.9

2014-12-11 Thread Sebastian Nagel
Hi Eyeris, > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that > the spider jump this step if i don´t set solr parameter ? Yes, that's possible in recent trunk of 1.x, see NUTCH-1832 (in doubt, it should be possible to update/replace only bin/crawl): Just pass an empty

Re: Nutch 2.3-latest + HBase 0.94.14 build fails

2014-12-14 Thread Sebastian Nagel
Hi, a late response: we finally got the same problem on some of our build machines. Please, follow the thread on dev@nutch: http://mail-archives.apache.org/mod_mbox/nutch-dev/201412.mbox/%3C548CA860.5040808%40googlemail.com%3E Thanks, Sebastian On 11/28/2014 04:45 PM, Little Wing wrote: > Hi, >

Re: Identifying results from two distinct crawls in Nutch 2.2.1

2014-12-17 Thread Sebastian Nagel
Hi, what about the -crawlId option available with all bin/nutch tools (inject, fetch, parse, etc.) and also for bin/crawl? This should start a new table (keyspace, schema, or however it's called) _webpage. Best, Sebastian On 12/16/2014 09:17 PM, Tamer Yousef wrote: > Hi All: > I do have nutc

Re: nutch 2.2.1 inject error on Windows

2014-12-17 Thread Sebastian Nagel
Hi, this issue (NUTCH-1566) with spaces on paths is already fixed in 1.9, but not in 2.2.1. It will be fixed in 2.3. Ev. you can replace bin/nutch in 2.2.1 with the version taken from 2.x trunk (http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/nutch). Alternatively, apply the fix accor

Re: Build error Nutch 2.2.1

2014-12-22 Thread Sebastian Nagel
Hi, the log messages do not indicate any error: the sonar antlib is only required to run % ant sonar (see https://issues.apache.org/jira/browse/NUTCH-1109) If the build really does not succeed, you'll find the reason more close to the message BUILD FAILED Can you provide more context to locali

Re: Build error Nutch 2.2.1

2014-12-22 Thread Sebastian Nagel
"name\":\"inlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}"); > > [javac] > ^ > [javac] > /opt/nutch/apache-nutch-2.2.1/src/java/org/apache/nutch/storage/Host.java:51: > error: cannot fin

Re: nutch 2.2.1 inject error on Windows

2014-12-23 Thread Sebastian Nagel
Hi Hesham, in conversations/threads, please, reply always to the list: you'll get help from other list members, and the discussion may help other users with the same or similar problem (now or later in the list archive). > Can I run Nutch 2.2.1 with Cygwin on windows 8.1 or Windows Server 2012 R2

Re: nutch 2.2.1 inject error on Windows‏

2014-12-23 Thread Sebastian Nagel
Hi Hesham, if working with Shell scripts on Windows, take care that Unix line breaks are used exclusively. The Bash shell is "sensitive" in this respect. Sebastian On 12/23/2014 02:38 AM, Hesham Hussein wrote: > When I used > > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/nutch >

Re: Questions about parse checker and indexing solr with nutch 1.9

2014-12-23 Thread Sebastian Nagel
Hi Steve, > https://issues.apache.org/jira/browse/NUTCH-1076 > Is this the reason indexing isn't working for me when I crawl a file system? Possibly, but at a first glance I would try the current trunk. A lot of issues have been fixed regarding protocol-file, in addition to the redirect issues: N

Re: [VOTE] Release Apache Nutch 2.3

2015-01-11 Thread Sebastian Nagel
+1 - successful small test crawl with HBase 0.94.26 - verified signatures On 01/09/2015 09:58 AM, Lewis John Mcgibbney wrote: > Hi user@ & dev@, > > This thread is a VOTE for releasing Apache Nutch 2.3. > Quite incredibly we addressed 143 issues as per the release report > http://s.apache.org/nu

Re: Proper regex-urlfilter syntax to filter out certain numbers in urls

2015-01-13 Thread Sebastian Nagel
Hi, the regular expression looks good. Which conf/regex-urlfilter.txt has been changed? runtime/local/conf/regex-urlfilter.txt ? If conf/regex-urlfilter.txt is changed you need to run "ant runtime" again to install the configuration changes into runtime/local/conf. For distributed mode you need

Re: Parser not returning any results

2015-01-14 Thread Sebastian Nagel
Hi Kartik, I've tried the same URL and parsing worked well with Nutch 1.x (trunk). Which Nutch version is used? The error indicates that the fetch didn't succeed with HTTP status 200 which may happen (it could be a temporary failure). If no failure is indicated in the logs, it's possible to get

Re: [VOTE] Release Apache Nutch 2.3

2015-01-20 Thread Sebastian Nagel
Hi Talat, > - AdaptiveFetchSchedular do not work. In default settings float, it needs > integer. Confirmed, in nutch-default.xml these two properties are defined as floats but read as integers. Configuration.getInt(name) then returns the default value. db.fetch.schedule.adaptive.min_interval

Re: OutlinkExtractor is not considering the relative URLs as outlinks.

2015-01-27 Thread Sebastian Nagel
Hi, > I am trying to crawl the webpages using Nutch-2.1, but I am not getting > relative urls as outlinks when parsing the HTML content of webpage. > Web page is having relative URLs as below : > Is the page HTML? Then outlinks are extracted via markup, e.g. Relative links are always m

Re: Nutch IRI URIs

2015-01-30 Thread Sebastian Nagel
Hi, > that can be done via a URL filter in Nutch, Should be "URL normalizer", right? I did this once by adding rules to regex-normalize.xml. If the URLs are in a certain language with a limited set on non-ASCII letters (that's the case for Turkish), this will result in a dozen of extra rules. B

Re: InvertLinks Performance Nutch 1.6

2015-02-02 Thread Sebastian Nagel
Hi Iain, is the link inversion done with URL normalization/filtering. That could potentially take long if there are many links probably in combination with complex filters or long URLs (which make the regex filter slow). Filtering/normalization is on per default. You have to disable it explicitly

Re: Nutch project

2015-02-08 Thread Sebastian Nagel
Dear Reza Nazarpour, Nutch is an open, community-driven project. That's why a I loop this communication forward to the Nutch mailing list (user@nutch.apache.org). > which is a brilliant piece of work. On behalf of all contributors and volunteers: thank you very much! > without a thorough documen

Re: InvertLinks Performance Nutch 1.6

2015-02-08 Thread Sebastian Nagel
ry 2, 2015 11:36 AM > To: user@nutch.apache.org > Subject: RE: InvertLinks Performance Nutch 1.6 > > Thanks Sebastian -- I had not turned off filtering/normalization and did not > appreciate they could be a significant contribution. I will give that a try. > > -Original Me

Re: How to apply patch for HTTPPostAuthentication

2015-02-10 Thread Sebastian Nagel
Hi Tizy, you mean https://issues.apache.org/jira/browse/NUTCH-827 ? 1. download the latest patch 2. checkout/download the Nutch sources - better use trunk (upcoming 1.10): the patch may not apply cleanly to 1.9 3. apply the patch, see http://wiki.apache.org/nutch/HowToContribute#Applyi

Re: How to crawl specific pages of a website

2015-02-10 Thread Sebastian Nagel
Hi, > So I add the following rule in regex-urlfilter.txt > +^https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/*/$ This regex allows https://thinkarchitect.wordpress.com/2015/02/06/ but does not allow https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-

Re: domain vs regexurl filter

2015-02-14 Thread Sebastian Nagel
Hi Lex, > Fundamentally these are the same? Both used to limit generated URLs. All kinds of URL filters are the same in this point: explicitly include or exclude URLs from being crawled/followed. > If I include both in plugins.include will both be used? Yes, both will be used. > And if so in wha

Re: How to crawl specific pages of a website

2015-02-16 Thread Sebastian Nagel
your help, > > I tried to run *bin/nutch org.apache.nutch.net.URLFilterChecker > -allCombined* to test my regex-urlfilter.txt, it take long-time without no > results. > > What should I do? Is there any methods to test my regex in Nutch? > > > On Wed, Feb 11, 2015 at 3:55

[ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Jorge Luis Betancourt Gonzalez has been voted in as committer and member of the Nutch PMC. Jorge, would you mind telling us about yourself, what you've done so far with Nutch, which areas you think you'd like to get involved,

Re: Error SSLHandshakeException Crawling sites with https

2015-02-23 Thread Sebastian Nagel
Alternatively, have a look at this description how to manually add the certificates: http://stackoverflow.com/questions/6659360/how-to-solve-javax-net-ssl-sslhandshakeexception-error On 02/23/2015 05:02 PM, Eyeris RodrIguez Rueda wrote: > Hello Martin. > I think that the problem is with httpclient

Re: custom parser (xpath)

2015-02-25 Thread Sebastian Nagel
Hi Dzmitry, have a look on https://issues.apache.org/jira/browse/NUTCH-1870 Work is ongoing (I'm short before pushing an improved patch). Help in testing and improving the patches is always welcome! :) It's currently only for 1.x, but plugins are relatively easy to port. Best, Sebastian On 02/

Re: "Not a File" Error on Re-Crawling

2015-03-10 Thread Sebastian Nagel
Hi Slavik, assumed that /user/ubuntu/urls/ contains seed URLs it should not contain also the CrawlDb. The path in the error message /user/ubuntu/urls/crawldb suggests that Injector tries to read URLs from crawldb which is (a) a directory and (b) contains binary data. Sebastian On 03/10/2015

Re: Problems with redirect handling: redirect count exceeded

2015-03-19 Thread Sebastian Nagel
Hi Marko, even with http.redirect.max == 0 Nutch follows redirect but they are like ordinary links recorded for fetch in the next round(s). > The first fetch seems to download something, but the second generate job > doesn't appear to produce a new segment, Are the redirect targets accepted by

Re: Redirect exceeded

2015-03-20 Thread Sebastian Nagel
Hi, that's a bug which will be fixed in Nutch 1.10, see https://issues.apache.org/jira/browse/NUTCH-1939 As a work-around it's possible to set http.redirect.max = 0 and to follow redirects in the next cycle. Cheers, Sebastian On 03/20/2015 08:28 PM, Roannel Fernandez Hernandez wrote: > Hello,

Re: Problems with redirect handling: redirect count exceeded

2015-03-20 Thread Sebastian Nagel
See also https://issues.apache.org/jira/browse/NUTCH-1939 (it's a bug in Nutch 1.9) On 03/19/2015 10:10 PM, Sebastian Nagel wrote: > Hi Marko, > > even with > http.redirect.max == 0 > Nutch follows redirect but they are like ordinary links > recorded for fetch in the n

[ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-22 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Mo Omer has been voted in as committer and member of the Nutch PMC. Mo, would you mind telling us about yourself, what you've done so far with Nutch, which areas you think you'd like to get involved, etc...? Congratulations and welcome on board! Regar

Re: Problem with redirect

2015-03-23 Thread Sebastian Nagel
Hi Jackie, as a work-around you could set http.redirect.max = 0 Nutch will follow redirects then in the next cycle. Best, Sebastian 2015-03-23 12:44 GMT+01:00 Richardson, Jacquelyn F. : > Hi, > > I am having trouble getting Nutch 1.9 to handle redirects. I found a > patch (https://issues.apa

Re: Crawl External Sites to Depth of 1

2015-03-31 Thread Sebastian Nagel
Hi, assumed that the external URLs are not known beforehand I don't see a simple solution - you need to add a custom scoring filter plugin. If the URLs are known it's easy: check the property db.ignore.external.links. In Nutch 1.x there is the plugin scoring-depth which allows you to specify a

Re: Optimize nutch performance.

2015-04-11 Thread Sebastian Nagel
Hi, to ensure politeness by guaranteed intervals between accesses to the same host, all URLs of one single host (or optionally IP address) are placed in one queue which is processed by a single task. The longest queue determines the time required to execute one fetch cycle. If the URLs crawled sp

Re: URL Structure & Rounds/Crawl Depth

2015-04-11 Thread Sebastian Nagel
Hi Scott, cycles/rounds/depth is roughly equivalent to the number of hops/links to reach a document starting from one of the seeds. It has nothing in common with the depth in the server's file system hierarchy. If there is a link from http://www.bizjournals.com/triangle/ to e.g. http://www.bizjo

Re: Mimetype detection for JSON

2015-04-14 Thread Sebastian Nagel
Hi Iain, > I have copied tika-mimetypes.xml from the tika jar file and installed a copy > in my configuration directory. I have updated nutch-site.xml to point to > this file and the log entries indicate that this is being found. ... and the property mime.type.magic is true (default)? > >

Re: Mimetype detection for JSON

2015-04-15 Thread Sebastian Nagel
s Sebastian. > > mime.type.magic is true. > > I don’t have control over the web server, so cannot test with > application/javascript > > Time for some deeper debugging it seems. Will update the list with findings. > > -Original Message- > From: Sebastian Nage

Re: Mimetype detection for JSON

2015-04-16 Thread Sebastian Nagel
e. >>> >>> Can anyone familiar with the Tika implementation tell me if there is a way >>> to update Nutch's MimeUtil.java to instantiate Tika to use the >>> configuration file from Nutch? Or would it be better just to update the >>> configurat

Re: A bug in org.apache.nutch.parse.ParseUtil?

2015-04-17 Thread Sebastian Nagel
Hi Arkadi, agreed that's a bug. > if ( parseResult != null ) parseResult.filter() ; parseResult.isSuccess() would do the check without modifying the ParseResult In case, that also fall-back parsers fail it could useful to return one (the first? the last?) failed ParseResult. Luckily the parse

Re: Help about parsing the title of resources with Nutch 1.9

2015-04-23 Thread Sebastian Nagel
Hi Yulio, in this case Nutch behaves just correct ("polite"): When I run parsechecker I get: Parse Metadata: robots=noindex,nofollow ... because of the meta tags: Because of this robots directive Nutch empties content, title and outlinks of this page. Best, Sebastian On 04/23/2015 07:40 PM

[ANNOUNCE] New Nutch committer and PMC - Guiseppe Totaro

2015-04-24 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Guiseppe Totaro has joined us as committer and member of the Nutch PMC. Congratulations on your new role within the Apache Nutch community! Guiseppe, would you mind telling us about yourself, and what you are doing with Nutch, what you plan to do, etc

Re: [VOTE] Release Apache Nutch 1.10

2015-04-29 Thread Sebastian Nagel
+1 - download bin package - verified signature - run small test crawl (local mode) and index to Solr On 04/29/2015 11:54 PM, Lewis John Mcgibbney wrote: > Hi user@ & dev@,This thread is a VOTE for releasing Apache Nutch 1.10. > The release candidate comprises the following components.* A staging

Re: problem with plugin.includes and indexingfilter.order properties

2015-06-10 Thread Sebastian Nagel
Hi, > I have read that if indexingfilter.order property is empty so the order is > defined by > plugin.includes property but for some reason this is NOT happening(maybe a > bug?). The property plugin.includes is just a regular expressions to filter all installed plugins against. It cannot def

Re: A parser failure on a single document may fail crawling job

2015-06-26 Thread Sebastian Nagel
Hi Arkadi, thanks for reporting that. Can you open a Jira ticket [1] to address this bug? It's rather a bug of the plugin parse-tika and should be solved there, cf. https://issues.apache.org/jira/browse/TIKA-1240 A plugin should be able to load all required classes. Thanks, Sebastian [1] https:

Re: Gone content not reported to Solr

2015-07-03 Thread Sebastian Nagel
Hi Steven, > is the ordering of dedup and index wrong No, that's correct: it would be not really efficient to first index duplicates and then remove them afterwards. If I understand right the db_gone pages have previously been indexed (and were successfully fetched), right? > but "bin/nutch dedu

Re: Gone content not reported to Solr

2015-07-06 Thread Sebastian Nagel
t; > fetcher.server.delay > 0.1 > The number of seconds the fetcher will delay between >successive requests to the same server. Note that this might get >overriden by a Crawl-Delay from a robots.txt and is used ONLY if >fetcher.threads.per.queue is set to 1. > &

Re: Filtering at index time (with a different regex-urlfilter.txt from crawl)

2015-07-10 Thread Sebastian Nagel
Hi Arthur, principally your approach should work. But as all config files the indexing url filter file is loaded from class path. An absolute path does not work: ... -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt If the file is properly deployed to $NUTCH_HOME/conf/ in l

Re: Gone content not reported to Solr

2015-07-13 Thread Sebastian Nagel
t; The Queen's Anniversary Prizes 1994, 2002 & 2013 > THE Awards Winners 2007-2013 > > Elite without being elitist > > Follow us on Twitter http://twitter.com/uniofleicester or > visit our Facebook page https://facebook.com/UniofLeicester > > > On Mon, 6 Jul 2015, Sebast

Re: Nutch 1.x Tutorial missing pieces?

2015-07-14 Thread Sebastian Nagel
Hi Sarah, > I got through sections 8.1 and 8.2 and suddenly the tutorial jumps to > “Whole-Web crawling” > and information about very large crawls. you're right, this could be misleading. In fact, there is little difference between crawling a single site or "the whole web", it's merely the seed

Re: Reindexing segments into Solr

2015-07-20 Thread Sebastian Nagel
Hi Arthur, > Any tips on debugging regular expressions against url's would still be handy > though. > Any nice way to take all links and run them through the regex-urlfilter.txt > file > in isolation to see which come out? The easiest way would be pipe a list of URLs to checked to the URLFilter

Re: Reindexing segments into Solr

2015-07-21 Thread Sebastian Nagel
e run either of > these tests with this option? > > Thanks, > Arthur. > > On 2015-07-20 20:36, Sebastian Nagel wrote: >> Hi Arthur, >> >>> Any tips on debugging regular expressions against url's would still be >>> handy though. >>

Re: Gone content not reported to Solr

2015-07-23 Thread Sebastian Nagel
.ac.uk > > The Queen's Anniversary Prizes 1994, 2002 & 2013 > THE Awards Winners 2007-2013 > > Elite without being elitist > > Follow us on Twitter http://twitter.com/uniofleicester or > visit our Facebook page https://facebook.com/UniofLeicester > > > On T

Re: A parser failure on a single document may fail crawling job

2015-07-23 Thread Sebastian Nagel
Hi Arkadi, does the problem persist? Which version of Nutch are you using? Can you point to one file or URL to reproduce it? Thanks, Sebastian On 06/26/2015 03:26 PM, Sebastian Nagel wrote: > Hi Arkadi, > > thanks for reporting that. Can you open a Jira ticket [1] to address

Re: Nutch tests from Maven

2015-07-29 Thread Sebastian Nagel
Hi Markus, +1 / why not? It will be rarely used, I guess. And it was surely ok, to remove test classes and dependencies from the "normal" package to make the job file smaller (NUTCH-1803). Maybe the main question is whether to provide a test artifact "officially", or just add a target to publish

Re: Issue when fetching with multiple threads

2015-09-08 Thread Sebastian Nagel
Hi Alex, > Some of the pages on the site requires login. I have enabled > HttpFormAuthentication in the protocal-httpclient plugin. However, looks > like the login page title gets indexed into Solr instead of the actual > page's title. Does this mean that one segment contains multiple records und

[ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-09 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Asitang Mishra has joined the Nutch team as committer and PMC member. Asitang, please feel free to introduce yourself and to tell the Nutch community about your interests and your relation to Nutch. Congratulations and welcom

Re: Issue when fetching with multiple threads

2015-09-10 Thread Sebastian Nagel
ti-thread fetcher, I meant fetcher.threads.per.queue > 1. (In my > case, I set it to 5). I left fether.parse to the default value (false). > Parsing is done as a separate step after fetching. > > Thanks again for your time. Any further guidance would be greatly > appreciated! >

Re: Compatible Hadoop version with Nutch 1.10

2015-09-14 Thread Sebastian Nagel
Hi, Nutch 1.10 is supposed to run with Hadoop 1.2.0. 1.10 (to be released soon) will run with 2.4.0, and probably also with newer Hadoop versions. If you need Nutch with a recent Hadoop version right now, you could build it by yourself from trunk. Cheers, Sebastian 2015-09-11 16:14 GMT+02:00 Im

[ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Sujen Shah has been voted in as committer and member of the Nutch PMC. Sujen, would you mind to introduce yourself to the Nutch community and tell in just a few words about your interests and your plans regarding Nutch? Cong

Re: Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Sebastian Nagel
Great! Reads well, straight-forward, and I didn't find any missing detail! Thanks, Julien! 2015-09-23 11:26 GMT+02:00 Julien Nioche : > Hi everyone, > > Just to let you know that we've just published a new tutorial on how to use > Nutch (and StormCrawler) to crawl and index documents into AWS Cl

Re: Regarding whitelist for robots.txt

2015-09-26 Thread Sebastian Nagel
Hi Girish, > in the hadoop.log i see “robots.txt whitelist not configured" This means that the property is somehow not set properly. Shouldn't it be http.robot.rules.whitelist", see below? Also make sure that the modified nutch-site.xml is deployed. If you modify it in conf/ you have to run "an

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-27 Thread Sebastian Nagel
+1 - tests pass - verified signatures - run a test crawl using HBase 0.98.14 The documentation [1] needs to be updated for Gora 0.6.1, right? I also had to copy hbase-common to $NUTCH_HOME/runtime/local/lib/ but that's probably it's not exactly the same HBase version used by Gora. Sebastian [1

Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-04 Thread Sebastian Nagel
Hi Sherban, > Right now it finds 0 URLs with no errors. Can you specify what's going wrong. It could be everything, even a configuration problem. What did you crawl? Using which storage back-end? Thanks, Sebastian On 10/02/2015 03:02 AM, Drulea, Sherban wrote: > Hi Lewis, > > -1 until I verif

Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-05 Thread Sebastian Nagel
ent.Http - http.content.limit = 65536 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo > Solr Crawler/Nutch-2.4-SNAPSHOT > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2015-10-01 18:27:30,292 INFO ht

Re: OCR images from PDF with Tika

2015-10-08 Thread Sebastian Nagel
Hi, there as been a similar question on the Tika mailing list recently: http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E If you get Tika to OCR the embedded images, the parse-tika plugin will proba

Re: Apache Nutch Output structure

2015-10-08 Thread Sebastian Nagel
Hi, sorry for the late reply. I've once prepared an overview and also a flow diagram as part of http://www.slideshare.net/sebastian_nagel/aceu2014-snagelwebcrawlingnutch crawl_parse: all crawling-related data from the parsing step used to update CrawlDb: outlinks, scores, signatures, meta data.

Re: OCR images from PDF with Tika

2015-10-09 Thread Sebastian Nagel
Hi, sorry, but I didn't try this by myself, just had in mind that there has been a thread on the Tika mailing list. > What is difference between ./plugins/parse-tika/parse-tika.jar and > ./plugins/parse-tika/tika-parsers-1.8.jar ? parse-tika.jar contains the classes of Nutch's parse-tika plugin

Re: OCR images from PDF with Tika

2015-10-09 Thread Sebastian Nagel
jar Needs some debugging to find out what is wrong. Please, feel free to file a bug report on https://issues.apache.org/jira/browse/NUTCH Thanks, Sebastian On 10/09/2015 06:21 PM, Sebastian Nagel wrote: > Hi, > > sorry, but I didn't try this by myself, just had > in mind that the

Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-15 Thread Sebastian Nagel
tp.proxy.port = 8080 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 1 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo > Solr Crawler/Nutch-2.4-SNAPSHOT > 2015-1

Re: Bug: redirected URLs lost on indexing stage?

2015-10-28 Thread Sebastian Nagel
Hi Arkadi, > In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0 > fetches target content, parses and saves it, but loses on the indexing stage. Can you give a concrete example? While testing NUTCH-2

Re: Bug: redirected URLs lost on indexing stage?

2015-11-02 Thread Sebastian Nagel
www.atnf.csiro.au/observers/index.html as seed, it will be > fetched, parsed and indexed successfully even if you set depth to 1. > > Regards, > Arkadi > >> -Original Message- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: Thursday, 29 Octo

Re: Bug: redirected URLs lost on indexing stage?

2015-11-06 Thread Sebastian Nagel
and crawl a sufficient number of rounds. Cheers, Sebastian On 11/06/2015 05:09 AM, arkadi.kosmy...@csiro.au wrote: > Hi Sebastian, > > I meant #1 and used if http.redirect.max == 3. > > Thanks, > Arkadi > >> -Original Message- >> From: Sebastian Nage

Re: nutch 1.10 crawl fails at indexing with Input path does not exist .../linkdb/current

2015-11-09 Thread Sebastian Nagel
Hi, you're right. This will be fixed in Nutch 1.11. Thanks, Sebastian On 11/09/2015 10:07 PM, Frumpus wrote: > Ok, it seems as though I have run into a version of this problem: > > > [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA > > | | > | | | | | | > | [NUTCH-2041]

[ANNOUNCE] New Nutch committer and PMC - Michael Joyce

2015-11-10 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Michael Joyce has joined us as a committer and member of the Nutch PMC. Congratulations on your new role within the Apache Nutch community! And thanks for your contributions and efforts so far, hope to see more! Michael, would you mind telling us about

Re: Nutch doesnt crawl relative links that doesn't start with leading /

2015-11-10 Thread Sebastian Nagel
Hi, Nutch will probably follow the link and fetch test.html prefixed by the base URL. The default is to ignore the '#' and everything after: it's normally a page anchor which must be removed to avoid duplicate content. That's the default. Have a look at https://wiki.apache.org/nutch/AdvancedAj

Re: Nutch - Stop converting & in the url to &

2015-11-10 Thread Sebastian Nagel
Hi, Nutch should convert the & in the href attribute to a bare ampersand and keep it for all succeeding operations. What version of Nutch is used? Are there changes to the default configuration? Trial with a dummy test document on a local Apache httpd: % cat /var/www/test_amp.html test

Re: fetcher.server.delay configuration not working

2015-11-23 Thread Sebastian Nagel
Hi Andrés, hi Roannel, that's correct but the question was why the effective delay is "bigger" than the configured 2.5 sec. Nutch implements the delay as sleeping time after one document has been fetched / before the next document is fetched. The observed 4-5 sec. include the time spent for fetch

Re: Nutch only crawls 2 URLs at a time

2015-12-09 Thread Sebastian Nagel
Hi, > only crawls 2 URLs at a time Sounds like the site has pages from two different hosts (by URL). There are a couple of properties to adjust the load on a single host. Have a look at conf/nutch-default.xml, the property "fetcher.threads.per.queue" and the properties nearby. Cheers, Sebastian

Re: How To Stop Crawling Pges With "Page Redirect Loop"

2015-12-16 Thread Sebastian Nagel
Hi, there is no need for Nutch to detect redirect loops: (A) per default (with http.redirect.max == 0) Nutch just records the redirect targets and fetches them in the next round. The redirect backwards which is found in the next round is not fetched again because it has already been fetched. (B)

<    1   2   3   4   5   6   7   8   >