indexer -nocommit option

2016-05-26 Thread Joseph Naegele
Hi folks, I'm looking for clarification on the index "-nocommit" option: The description says: "do the commits once and for all the reducers in one go (optional)", which sounds unintuitive. The relevant code in IndexerJob.java looks like this: // do the commits once and for all the

improving distributed indexing performance

2016-06-13 Thread Joseph Naegele
Hi folks, I'm in the process of indexing a large number of docs using Nutch 1.11 and the indexer-elastic plugin. I've observed slow indexing performance and narrowed it down to the map phase and first part of the reduce phase taking 80% of the total runtime per segment. Here are some statistics:

RE: improving distributed indexing performance

2016-06-14 Thread Joseph Naegele
u could try not to use them for indexing. > LinkDb is optional since long, for the CrawlDb there is > https://issues.apache.org/jira/browse/NUTCH-2184 > > Sebastian > > On 06/13/2016 06:55 PM, Joseph Naegele wrote: > > Hi folks, > > > > I'm in the proc

RE: improving distributed indexing performance

2016-06-14 Thread Joseph Naegele
ap step: > > Map input records=5115370813 > > Map output records=5115370813 > > Reduce input records=5115370813 > > Reduce output records=2401924 > > > > That would mean that either your segment contains a large number of > > "unindexable" docum

RE: improving distributed indexing performance

2016-06-13 Thread Joseph Naegele
to use them for indexing. LinkDb is optional since long, for the CrawlDb there is https://issues.apache.org/jira/browse/NUTCH-2184 Sebastian On 06/13/2016 06:55 PM, Joseph Naegele wrote: > Hi folks, > > I'm in the process of indexing a large number of docs using Nutch 1.11 > and t

RE: indexer -nocommit option

2016-05-27 Thread Joseph Naegele
d. Solr, on its side, has configuration that could triggers commit after every N documents or M milliseconds or both. Also, IIRC, Indexer still sends a single commit command at the end of the job as well regardless which, I think, is also configurable. On 05/26/2016 07:40 AM, Joseph Naegele wr

RE: Nutch 2.x for large-scale crawls

2016-06-20 Thread Joseph Naegele
p. > 4–6, http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf > > [3] > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html > > > On 06/17/2016 03:00 PM, Joseph Naegele wrote: > > Hi folks, > > > > I am curious as to wheth

Crawling while collecting resources

2016-02-08 Thread Joseph Naegele
My goal is to use Nutch "normally" to craw, parse, extract links and index said textual content but with the added goal of fetching and saving *all* resources found at outlinks. It is my understanding that there is no straightforward method for collecting resources this way, i.e. an extension

ScoringFilters and LinkRank interoperability

2016-02-22 Thread Joseph Naegele
Hi everyone, I have a couple questions about Nutch's LinkRank tools. The wiki docs for using the WebGraph/LinkRank tools appear to be useful but I have the following questions: 1. The docs say, like PageRank, all links start with a common score. Does this mean LinkRank is not

RE: Crawling while collecting resources

2016-02-15 Thread Joseph Naegele
to be parsed and indexed. Thanks, Joe -Original Message- From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] Sent: Monday, February 08, 2016 7:29 PM To: 'user@nutch.apache.org' <user@nutch.apache.org> Subject: Crawling while collecting resources My goal is to use Nutch "normal

CSS parser

2016-04-06 Thread Joseph Naegele
Hi everyone, Would anyone find useful a parser for collecting outlinks from CSS (stylesheets)? As far as I can tell Tika doesn't offer this (it looks like Tika 1.12 parses CSS as plain text, correct me if I'm wrong). Modern CSS often contains "url(.)" links to content needed to properly

collect script tags using parse-tika

2016-04-05 Thread Joseph Naegele
Hi all, I asked this on the Tika user list, but I want to bring it up here as well: The parse-tika plugin is appealing because it offers the ability to use Boilerpipe, however it doesn't parse

protocol-http or protocol-httpclient?

2016-03-08 Thread Joseph Naegele
I'm using Nutch 1.11. The "plugin.includes" section of nutch-default.xml still states that the protocol-httpclient plugin may present intermittent problems. Is this still the case? What are the problems? There doesn't appear to be any problem crawling HTTPS using the protocol-http plugin. Why do

Plugin name significant when dependent on other plugins

2016-04-26 Thread Joseph Naegele
Here's an odd one (Nutch 1.11): I haven't tested this with other extension points, but if you extend or depend on the "protocol-http" plugin in a new plugin, the name of the new plugin is significant when ProtocolFactory loads the correct plugin for fetching. In other words: Create a plugin

pros/cons of many nodes

2016-05-16 Thread Joseph Naegele
Hi folks, Would anyone be willing to share a few pros/cons of using many nodes vs. 1 very powerful machine for large-scale crawling? Of course many advantages and disadvantages overlap with Hadoop and distributed computing in general, but what I'm actually looking for are good reasons not to

startUp/shutDown methods for plugins

2016-05-06 Thread Joseph Naegele
Hi folks, I'm using Nutch 1.11. Is it possible to implement plugin instance startUp/shutDown methods for normal extension points? This would allow for cleaning up resources at the end of a plugin instance's lifetime. Thanks, Joe

RE: startUp/shutDown methods for plugins

2016-05-10 Thread Joseph Naegele
Markus, you're correct and I admit my use case is a bit exceptional. The one case I thought I needed a "shutdown" hook was when using adaptive threading and/or the fetch bandwidth threshold. When crawling large segments (multi-hour fetches) I assumed that if a slow fetcher is killed it would be

How to monitor mapreduce Reporter at runtime

2016-04-18 Thread Joseph Naegele
Hi, Is it possible to monitor the Reporter at runtime, maybe via a Hadoop option? I'm specifically looking for the Fetcher's Reporter status (pages/s, bandwidth, etc.). Using Nutch 1.11. Thanks, Joe

RE: General question about subdomains

2017-02-08 Thread Joseph Naegele
Markus, The example URLs I sent are all the same IP address. This isn't always the case, however, so you're correct that partitioning by IP won't help us. Additionally, we'd like to avoid resolving the IPs of these domains in the first place since most of them resolve to the same IP. We're

RE: General question about subdomains

2017-02-09 Thread Joseph Naegele
Thanks Markus. I'll put together a list shortly. Is your classifier plugin open-source or available to share? It sounds interesting and very useful. --- Joe Naegele Grier Forensics -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, February 09,

RE: General question about subdomains

2017-01-13 Thread Joseph Naegele
Markus, Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains. This particular domain is "hjsjp.com". It's Chinese-owned

General question about subdomains

2017-01-11 Thread Joseph Naegele
This is more of a general question, not Nutch-specific: Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be

RE: General question about subdomains

2017-01-13 Thread Joseph Naegele
the fetch step HTH Julien On 11 January 2017 at 14:21, Joseph Naegele <jnaeg...@grierforensics.com> wrote: > This is more of a general question, not Nutch-specific: > > Our crawler discovered some URLs pointing to a number of subdomains of a > Chinese-owned [strmy domain.