Hi folks, I'm looking for clarification on the index "-nocommit" option:
The description says: "do the commits once and for all the reducers in one
go (optional)", which sounds unintuitive. The relevant code in
IndexerJob.java looks like this:
// do the commits once and for all the
Hi folks,
I'm in the process of indexing a large number of docs using Nutch 1.11 and
the indexer-elastic plugin. I've observed slow indexing performance and
narrowed it down to the map phase and first part of the reduce phase taking
80% of the total runtime per segment. Here are some statistics:
u could try not to use them for indexing.
> LinkDb is optional since long, for the CrawlDb there is
> https://issues.apache.org/jira/browse/NUTCH-2184
>
> Sebastian
>
> On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> > Hi folks,
> >
> > I'm in the proc
ap step:
> > Map input records=5115370813
> > Map output records=5115370813
> > Reduce input records=5115370813
> > Reduce output records=2401924
> >
> > That would mean that either your segment contains a large number of
> > "unindexable" docum
to use them for indexing.
LinkDb is optional since long, for the CrawlDb there is
https://issues.apache.org/jira/browse/NUTCH-2184
Sebastian
On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> Hi folks,
>
> I'm in the process of indexing a large number of docs using Nutch 1.11
> and t
d. Solr, on its
side, has configuration that could triggers commit after every N documents
or M milliseconds or both. Also, IIRC, Indexer still sends a single commit
command at the end of the job as well regardless which, I think, is also
configurable.
On 05/26/2016 07:40 AM, Joseph Naegele wr
p.
> 4–6, http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf
>
> [3]
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
>
> On 06/17/2016 03:00 PM, Joseph Naegele wrote:
> > Hi folks,
> >
> > I am curious as to wheth
My goal is to use Nutch "normally" to craw, parse, extract links and index
said textual content but with the added goal of fetching and saving *all*
resources found at outlinks. It is my understanding that there is no
straightforward method for collecting resources this way, i.e. an extension
Hi everyone,
I have a couple questions about Nutch's LinkRank tools. The wiki docs for
using the WebGraph/LinkRank tools appear to be useful but I have the
following questions:
1. The docs say, like PageRank, all links start with a common score.
Does this mean LinkRank is not
to be parsed and indexed.
Thanks,
Joe
-Original Message-
From: Joseph Naegele [mailto:jnaeg...@grierforensics.com]
Sent: Monday, February 08, 2016 7:29 PM
To: 'user@nutch.apache.org' <user@nutch.apache.org>
Subject: Crawling while collecting resources
My goal is to use Nutch "normal
Hi everyone,
Would anyone find useful a parser for collecting outlinks from CSS
(stylesheets)?
As far as I can tell Tika doesn't offer this (it looks like Tika 1.12 parses
CSS as plain text, correct me if I'm wrong). Modern CSS often contains
"url(.)" links to content needed to properly
Hi all,
I asked this on the Tika user list, but I want to bring it up here as well:
The parse-tika plugin is appealing because it offers the ability to use
Boilerpipe, however it doesn't parse
I'm using Nutch 1.11. The "plugin.includes" section of nutch-default.xml
still states that the protocol-httpclient plugin may present intermittent
problems. Is this still the case? What are the problems?
There doesn't appear to be any problem crawling HTTPS using the
protocol-http plugin. Why do
Here's an odd one (Nutch 1.11):
I haven't tested this with other extension points, but if you extend or
depend on the "protocol-http" plugin in a new plugin, the name of the new
plugin is significant when ProtocolFactory loads the correct plugin for
fetching.
In other words:
Create a plugin
Hi folks,
Would anyone be willing to share a few pros/cons of using many nodes vs. 1
very powerful machine for large-scale crawling? Of course many advantages
and disadvantages overlap with Hadoop and distributed computing in general,
but what I'm actually looking for are good reasons not to
Hi folks,
I'm using Nutch 1.11. Is it possible to implement plugin instance
startUp/shutDown methods for normal extension points? This would allow for
cleaning up resources at the end of a plugin instance's lifetime.
Thanks,
Joe
Markus, you're correct and I admit my use case is a bit exceptional. The one
case I thought I needed a "shutdown" hook was when using adaptive threading
and/or the fetch bandwidth threshold. When crawling large segments (multi-hour
fetches) I assumed that if a slow fetcher is killed it would be
Hi,
Is it possible to monitor the Reporter at runtime, maybe via a Hadoop
option? I'm specifically looking for the Fetcher's Reporter status (pages/s,
bandwidth, etc.). Using Nutch 1.11.
Thanks,
Joe
Markus,
The example URLs I sent are all the same IP address. This isn't always the
case, however, so you're correct that partitioning by IP won't help us.
Additionally, we'd like to avoid resolving the IPs of these domains in the
first place since most of them resolve to the same IP.
We're
Thanks Markus. I'll put together a list shortly. Is your classifier plugin
open-source or available to share? It sounds interesting and very useful.
---
Joe Naegele
Grier Forensics
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Thursday, February 09,
Markus,
Interestingly enough, we do use OpenDNS to filter undesirable content,
including parked content. In this case, however, the domain in question isn't
tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
This particular domain is "hjsjp.com". It's Chinese-owned
This is more of a general question, not Nutch-specific:
Our crawler discovered some URLs pointing to a number of subdomains of a
Chinese-owned [strmy domain. It then proceeded to discover millions more URLs
pointing to other subdomains (hosts) of the same domain. Most of the names
appear to be
the fetch step
HTH
Julien
On 11 January 2017 at 14:21, Joseph Naegele <jnaeg...@grierforensics.com>
wrote:
> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a
> Chinese-owned [strmy domain.
23 matches
Mail list logo