Re: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

2017-02-06 Thread Michael Coffey
You can create a core manually in the file system, in a specific place where solr looks for cores when it starts up. I have mine in /opt/solr/server/sol. It at least works in solr 5.4.1 (I haven't tried others). The core needs a conf dir and a properties file. The properties file should

webgraph speed

2017-03-01 Thread Michael Coffey
Hello nutchers! I am trying to compute linkrank scores without spending excessive time on the task. My version of the crawl script contains the following line, which is similar to a commented-out line in the bin/crawl script in the 1.12 distribution. __bin_nutch webgraph $commonOptions -filter

Re: crawling speed when polite

2016-11-05 Thread Michael Coffey
Yes, after a couple of rounds there are many, many hosts in the crawldb. Here are statistics after a bunch of rounds. It seems like we should be able to have a bunch of  threads going. 16/11/05 06:38:45 INFO crawl.CrawlDbReader: Statistics for CrawlDb: /orgs/data/crawldb 16/11/05 06:38:45 INFO

crawling speed when polite

2016-11-04 Thread Michael Coffey
Can anyone point me to some good information on how to optimize crawling speed while maintaining politeness? My current situation is that Nutch is running reliably for me on a single hadoop node. Before bringing up additional nodes, I want to make it go reasonably fast on this one node. At the

Re: indexing to Solr

2016-11-21 Thread Michael Coffey
nt: Monday, November 21, 2016 10:34 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org

What is the best version of Solr to use with Nutch 1.12?

2016-11-16 Thread Michael Coffey
What is the best version of Solr to use with Nutch 1.12? Does the 6.3.0 version work well?

indexing to Solr

2016-11-18 Thread Michael Coffey
Where can I find up-to-date information on indexing to Solr? When I search the web, I find tutorials that use the deprecated solrindex command. I also find questions where people want to know why it doesn't work. I have a good nutch 1.12 installation on a working hadoop cluster and a Solr 6.3.0

nutch 1.12 and Solr 6.3.0

2016-11-18 Thread Michael Coffey
I decided to plunge ahead with Solr indexing, but so far it doesn't work. The first error I got is listed below. Could it be that I am running JDK 7 on the nutch server and JDK 8 on the Solr server. As far as I know Nutch 1.x won't work with JDK 8 and Solr 6.3 wont work with JDK less than 8.

Re: How can I Score?

2016-11-15 Thread Michael Coffey
ovember 15, 2016 12:09 AM Subject: Re: How can I Score? Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.or

How can I Score?

2016-11-12 Thread Michael Coffey
When the generator is used with -topN, it is supposed to choose the highest-scoring urls. In my case, all the urls in my db have a score of zero, except the ones injected. How can I cause scores to be computed and stored? I am using the standard crawl script. Do I need to enable the various

Re: nutch 1.12 and Solr 6.3.0

2016-11-19 Thread Michael Coffey
I think this is what Lewis and Furkan know as NUTCH-2267. I get the same problem with Solr 5.5.3. I really would like to know which versions of nutch/solar work together "out of the box". From: Michael Coffey <mcof...@yahoo.com.INVALID> To: "user@nutch.apache.org&qu

Best version of Hadoop for Nutch 2.3.1

2016-10-31 Thread Michael Coffey
What is the best version of Hadoop to use with Nutch 2.3.1? I see that the runtime/local/lib contains .jar files for hadoop 2.5.2. Does that mean I should use 2.5.2, or would a newer version like 2.6.5 be even better?

Re: Nutch 1.x or 2.x

2016-10-31 Thread Michael Coffey
When you say that 1.x is more stable, what does that mean? From: Markus Jelsma To: "user@nutch.apache.org" Sent: Monday, October 31, 2016 9:39 AM Subject: RE: Nutch 1.x or 2.x Hello - if you want to crawl big, performance is not

Re: Nutch 1.x on hadoop

2016-11-02 Thread Michael Coffey
found it yet. From: Julien Nioche <lists.digitalpeb...@gmail.com> To: "user@nutch.apache.org" <user@nutch.apache.org>; Michael Coffey <mcof...@yahoo.com> Sent: Wednesday, November 2, 2016 9:51 AM Subject: Re: Nutch 1.x on hadoop Michael, See http://

db.ignore.external.links

2016-11-03 Thread Michael Coffey
Does db.ignore.external.links accept only relative urls? I am crawling a site, let's call it http://www.xyz.com. It contains things like http://www.xyz.com/business.html; >. Those urls don't end up in the crawldb, but ones with relative urls do. Is this normal, or am I confused?

Re: Nutch 1.x on hadoop

2016-11-04 Thread Michael Coffey
That makes a lot of sense. I had a problem with the tracking UI that I had to solve by disabling IPV6 om my machine. Now it is working better! From: Julien Nioche <lists.digitalpeb...@gmail.com> To: "user@nutch.apache.org" <user@nutch.apache.org>; Michael Coff

Nutch 1.x on hadoop

2016-11-02 Thread Michael Coffey
I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3. I get a class not found exception for org.apache.nutch.crawl.Crawl, as in the following attempt. $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed

Re: Nutch 1.x or 2.x

2016-10-30 Thread Michael Coffey
Newbie question: I am trying to decide between Nutch 1.x or 2.x. The application is to crawl a large portion of the www using a massive number (thousands) of small machines (<= 2GB RAM each). I like the idea of the simpler architecture and pluggable storage backend of 2.x. However, I am

Re: Fetcher "hung while processing"

2016-12-09 Thread Michael Coffey
Sebastian On 12/09/2016 02:15 AM, Michael Coffey wrote: > I sometimes get a bunch of warning messages that say Thread #x hung while > processing > Is this just a normal thing to see occasionally, or should I look to find > some resolution? I do have an example where the same hos

Re: Fetcher "hung while processing"

2016-12-16 Thread Michael Coffey
y sometimes? Thanks, Sebastian On 12/09/2016 04:58 PM, Michael Coffey wrote: > The property fetcher.parse is false and I pass -noParsing to the fetch > command. What other post-fetch actions are there? > > >      From: Sebastian Nagel <wastl.na...@googlemail.com> &g

Re: indexing to Solr

2016-12-17 Thread Michael Coffey
4 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Fri, 18 Nov 2

Re: indexing to Solr

2016-12-17 Thread Michael Coffey
mcgibbney <lewi...@apache.org> To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Monday, November 21, 2016 10:34 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffe

Fetcher "hung while processing"

2016-12-08 Thread Michael Coffey
I sometimes get a bunch of warning messages that say Thread #x hung while processing Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there

Re: nutch 1.12 and Solr 5.4.1

2016-12-19 Thread Michael Coffey
CloudSolrServer, ConcurrentUpdateSolrServer,     HttpSolrServer or LBHttpSolrServer respectively.     solr.server.url   http://solr5-00:8983/solr/nutch-0     Defines the Solr URL into which data should be indexed using the   indexer-solr plugin.   From: Michael Coffey <m

Re: nutch 1.12 and Solr 5.4.1

2016-12-20 Thread Michael Coffey
in the master branch. I am willing to work on some Java code, if necessary, to help resolve this. At this point, I don't know what to try next, other than switching to ElasticSearch. From: Michael Coffey <mcof...@yahoo.com.INVALID> To: "user@nutch.apache.org" <user@nutch.apache.

Re: nutch 1.12 and Solr 5.4.1

2016-12-22 Thread Michael Coffey
Is it possible to get around this problem by using an older version of Solr or Nutch or both? From: Michael Coffey <mcof...@yahoo.com.INVALID> To: "user@nutch.apache.org" <user@nutch.apache.org>; Furkan KAMACI <furkankam...@gmail.com>; Michael Coffey <mco

Re: nutch 1.12 and Solr 5.4.1

2016-12-22 Thread Michael Coffey
/server/solr-webapp/webapp/WEB-INF/lib/httpclient-4.4.1.jar thanks again From: Furkan KAMACI <furkankam...@gmail.com> To: Michael Coffey <mcof...@yahoo.com> Cc: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Thursday, December 22, 2016 10:29 AM Subject: R

Re: nutch 1.12 and Solr 5.4.1

2016-12-19 Thread Michael Coffey
ent versions of Solr has not helped (6.3.0, 5.5.3, 5.4.1). FWIW, I have same version of Java on both machines. OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode) From: Michael Coffey <mcof...@yahoo.

Re: nutch 1.12 and Solr 5.4.1

2016-12-19 Thread Michael Coffey
kam...@gmail.com> To: Michael Coffey <mcof...@yahoo.com>; user@nutch.apache.org Sent: Monday, December 19, 2016 4:13 PM Subject: Re: nutch 1.12 and Solr 5.4.1 Hi Michael, Could you check the version of solrj at your Nutch and compare it with version of Solr at your server? Kind Re

Re: Speed of linkDB

2017-04-04 Thread Michael Coffey
maller, esp. if the linkdb includes also internal links. Best, Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote: > In my situation, I find that linkdb merge takes much more time than fetch and > parse combined, even though fetch is fully polite. > > What is the standard advice

Speed of linkDB Merge

2017-04-02 Thread Michael Coffey
In my situation, I find that linkdb merge takes much more time than fetch and parse combined, even though fetch is fully polite. What is the standard advice for making linkdb-merge go faster? I call invertlinks like this: __bin_nutch invertlinks "$CRAWL_PATH"/linkdb

readdb to dump a specific url

2017-03-03 Thread Michael Coffey
I want to find out what the crawldb knows about some specific urls. According to the nutch wiki, I should use nutch readdb with the -url option. But when I do a command like the following, I get nasty "can't find class" exceptions. $NUTCH_HOME/runtime/deploy/bin/nutch readdb

crawlDb speed around deduplication

2017-04-27 Thread Michael Coffey
In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job). In my situation, the "crawldb" job launched by

idexer "possible analysis error"

2017-05-01 Thread Michael Coffey
I know this might be more of a SOLR question, but I bet some of you know the answer. I've been using Nutch1.12 + SOLR 5.4.1 successfully for several weeks, but suddenly I am having frequent problems. My recent changes have been (1) indexing two segments at a time, instead of one, and (2)

Re: crawlDb speed around deduplication

2017-05-01 Thread Michael Coffey
er running the dedup job with the common options fixes your problem. That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: > In the standard crawl script, there is a _bin_nutch updatedb command and, > soon afte

Re: idexer "possible analysis error"

2017-05-03 Thread Michael Coffey
ming solr for the misleading messages! Hi Michael, What do you have in your Solr logs? Kind Regards, Furkan KAMACI 2 May 2017 Sal, saat 02:45 tarihinde Michael Coffey <mcof...@yahoo.com.invalid> şunu yazdı: > I know this might be more of a SOLR question, but I bet some of you know >

Re: crawlDb speed around deduplication

2017-05-03 Thread Michael Coffey
as shown by the Hadoop resource manager webapp, see screenshot. It's also indicated from where a configuration property is set. Best, Sebastian On 05/02/2017 12:57 AM, Michael Coffey wrote: > Thanks, I will do some testing with $commonOptions applied to dedup. I > suspect that the dedup-

tuning for speed

2017-05-16 Thread Michael Coffey
I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select. Can anyone provide an outline of such a

tuning for speed

2017-05-12 Thread Michael Coffey
I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select. Can anyone provide an outline of such a

RE: generating and updating segments

2017-05-23 Thread Michael Coffey
I really need overlapping crawl cycles to take advantage of a non-standard hardware platform I am required to use. I will rephrase my question: If I set generate.update.crawldb=true, will I be able call the generator more than once without explicitly calling  the crawldb-update in between those

generating and updating segments

2017-05-22 Thread Michael Coffey
In search of more effective parallelism, I have been experimenting with different schemes for organizing the nutch jobs. I would like to know if the Generator can work in a way that supports what I'm trying to do. Here is a pseudocode description of one approach. I use variables named curSegs

RE: generating and updating segments

2017-05-24 Thread Michael Coffey
Yes, it's true, it certainly does makes things more complicated. For example, now that I turn on generate.update.crawldb, suddenly there is a permissions problem where it tries to create a temp file directory mapred/temp/generate-temp-. Is that a known bug? Hi, Yes, enable

depth scoring filter

2017-09-19 Thread Michael Coffey
I am trying do develop a news crawler and I want to prohibit it from wandering too far away from the seed list that I provide. It seems like I should use the DepthScoringFilter, but I am having trouble getting it to work. After a few crawl cycles, all the _depth_ metadata say either 1 or 1000.

Re: depth scoring filter

2017-09-21 Thread Michael Coffey
van Hemert | alterNET internet BV <ji...@alternet.nl> To: user <user@nutch.apache.org> Sent: Tuesday, September 19, 2017 11:43 PM Subject: Re: depth scoring filter Hi, On 20 September 2017 at 06:36, Michael Coffey <mcof...@yahoo.com.invalid> wrote: > I am trying do de

Re: deletions from index

2017-10-02 Thread Michael Coffey
So, I had these numbers in my index: Num Docs: 189550Max Docs: 285531 Deleted Docs: 95981 Then I did a crawl and index, which told meindexed (add/update): 13,423 And now I have these numbers in my index: Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely confused. I don't

Re: invalid utf8 chars when indexing or cleaning

2017-08-29 Thread Michael Coffey
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug that was fixed in version 1.4. Some more bits of information: the indexer job rarely fails (only 1 of the last 99 segments) but the cleaning job fails every time now. Once again, this is Nutch 1.12 and Solr 5.4.1. I

querying crawldb

2017-09-12 Thread Michael Coffey
Hello Nutchians, I need to be able to query a (nutch 1.x) crawldb for read-only search/sort/summarize purposes, based on combinations of status, fetch_time, score, and things like that. What is a good tool or process for doing such things? Up until now, I've been doing readdb-dump and then

Re: depth scoring filter

2017-09-26 Thread Michael Coffey
09/22/2017 04:57 AM, Michael Coffey wrote: > I am still having trouble with the depth scoring filter, and now I have a > simpler test case. It does work, somewhat, when I give it a list of 50 seed > URLs, but when I give it a very short list, it fails. > I have tried depth.max values i

Re: inject deletes urls from crawldb

2017-09-28 Thread Michael Coffey
If the Inject command does filtering, then the documentation should say so. The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any filtering or normalization. I find it very counter-intuitive that an injection operation would delete existing data. Should I edit that

inject deletes urls from crawldb

2017-09-27 Thread Michael Coffey
Perhaps my strangest question yet! Why does Inject delete URLs from the crawldb and how can I prevent it? I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process. (before injecting)17/09/27 19:22:33 INFO

deletions from index

2017-10-02 Thread Michael Coffey
With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of

invalid utf8 chars when indexing or cleaning

2017-08-24 Thread Michael Coffey
Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean. Messages during indexing look like this. 17/08/24 19:18:59 INFO mapreduce.Job:  map 100% reduce 99% 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_07_2, Status : FAILED

Re: invalid utf8 chars when indexing or cleaning

2017-08-31 Thread Michael Coffey
8 chars when indexing or cleaning > >  From the logs looks like the error is coming from the Solr side, do you > mind checking/sharing the logs on your Solr server? Can you pin point which > URL is causing the issue? > Best Regards, Jorge > > On Tue, Aug 29, 2017 at 9:2

addBinaryContent and string length must be a multiple of four

2017-10-17 Thread Michael Coffey
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186 I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I

Re: addBinaryContent and string length must be a multiple of four

2017-10-23 Thread Michael Coffey
Thanks for the reply! I'm not sure the best way to illustrate the issue, as I struggle with solr log management within docker. However, here are a few URLs that have exhibited the problem. In each case, Solr complains "Error adding field 'binaryContent'" ... "msg=String length must be a

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
I am curious, is it possible to send boilerpipe output to Solr as a separate "plaintext" field, in addition to the usual "content" field (rather than replacing it)? If so, would someone please give an overview of how to do it?

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
Also, try the boilerpipe demo online at https://boilerpipe-web.appspot.com/ From: Markus Jelsma To: "user@nutch.apache.org" Sent: Wednesday, November 15, 2017 2:06 PM Subject: RE: [MASSMAIL]RE: Removing

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
I found a lot of detail about the boilerpipe algortithm in http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf Seems like very short paragraphs can be a problem, since one of the primary features used for determining boilerplate is the length of a given text block. I would

readseg dump and non-ASCII characters

2017-11-14 Thread Michael Coffey
Greetings Nutchlings, I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things

Re: readseg dump and non-ASCII characters

2017-11-15 Thread Michael Coffey
TML encoding (the code is available     in Nutch) and then convert the byte[] content using the right encoding. Best, Sebastian On 11/15/2017 02:20 AM, Michael Coffey wrote: > Greetings Nutchlings, > I have been using readseg-dump successfully to retrieve content crawled by > nutch, but I

Re: need to override refetch intervals

2017-11-27 Thread Michael Coffey
intervals and scores in the crawl db. From: Michael Coffey <mcof...@yahoo.com.INVALID> To: User <user@nutch.apache.org> Sent: Friday, November 24, 2017 3:13 PM Subject: need to override refetch intervals In order to achieve the most timely crawling o

need to override refetch intervals

2017-11-24 Thread Michael Coffey
In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Michael Coffey
I bet that problem affects a lot of people. It certainly has affected me. Why isn't essential filtering ON by default? The bin/crawl script doesn't even have a way for the operator to specify any filltering. And nowhere, in the tutorial, is it mentioned that you need to specify "-filter" to

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-30 Thread Michael Coffey
follows redirects) It's most efficient not filter the CrawlDb. It's costly to apply the filters again and again: the CrawlDb might be huge (up to billions of URLs), and/or filter rules can be complex. The default does the necessary but avoid unnecessary work. Best, Sebastian On 11/29/2017 05:07 P

purging low-scoring urls

2017-12-04 Thread Michael Coffey
Is it possible to purge low-scoring urls from the crawldb? My news crawl has many thousands of zero-scoring urls and also many thousands of urls with scores less than 0.03. These urls will never be fetched because they will never make it into the generator's topN by score. So, all they do is

Re: readseg dump and non-ASCII characters

2017-12-14 Thread Michael Coffey
07. 4. a more reliable solution would require to detect the HTML encoding (the code is available in Nutch) and then convert the byte[] content using the right encoding. Best, Sebastian On 11/15/2017 02:20 AM, Michael Coffey wrote: > Greetings Nutchlings, > I have been using readseg-du

Re: Removing header,Footer and left menus while crawling

2017-11-14 Thread Michael Coffey
That is a very interesting note. I have been wanting something like that. I use the python-based "newspaper" package but it is not directly compatible with the nutch/hadoop infrastructure. From: Jorge Betancourt To: user@nutch.apache.org Cc:

Blacklisting TLDs

2018-06-14 Thread Michael Coffey
I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work. My domainblacklist-urlfilter.txt contains lines like the following. cn jp line.me albooked.com booked.co.il The TLDs do not get

Re: RE: random sampling of crawlDb urls

2018-05-01 Thread Michael Coffey
Just to clarify: .99 does NOT work fine. It should have rejected most of the records when I specified "((Math.random())>=.99)". I have used expressions not involving Math.random. For example, I can extract records above a specific score with "score>1.0". But the random thing doesn't work even

random sampling of crawlDb urls

2018-05-01 Thread Michael Coffey
I want to extract a random sample of URLS from my big crawldb. I think I should be able to do this using readdb -dump with a Jexl expression, but I haven't been able to get it to work. I have tried several variations of the following command. $NUTCH_HOME/runtime/deploy/bin/nutch readdb

Re: addBinaryContent and string length must be a multiple of four

2017-10-20 Thread Michael Coffey
I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file. -- original message -- I think I have an instance of the known bug

dealing with redirects from http to https

2018-03-09 Thread Michael Coffey
I am having a problem crawling some sites that seem to be transitioning to https. All their links contain http urls and the fetcher gets response code 301 and content that says "the document has moved" because the actual content is accessible only via https. This has been happening for a few

Re: dealing with redirects from http to https

2018-03-09 Thread Michael Coffey
ter which may cause that the redirect targets are filtered? On 03/09/2018 08:39 PM, Michael Coffey wrote: > I am having a problem crawling some sites that seem to be transitioning to > https. All their links contain http urls and the fetcher gets response code > 301 and content that says &quo

spilled records from reducer

2018-04-12 Thread Michael Coffey
Greetings Nutchlings, I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records. Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes

Re: spilled records from reducer

2018-04-13 Thread Michael Coffey
temporary) is on SSDs try different compression settings (CrawlDb and temporary data), see   mapreduce.output.fileoutputformat.compress.codec   mapreduce.map.output.compress   mapreduce.map.output.compress.codec Best, Sebastian On 04/13/2018 02:52 AM, Michael Coffey wrote: > Greetings Nutchlings, &g

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Michael Coffey
I think you will find that you need different rules for each website and that some amount of maintenance will be needed as the websites change their practices.

how could I identify obsolete segments?

2018-03-23 Thread Michael Coffey
Greetings Nutchlings, How can I identify segments that are no longer useful, now that I have been using AdaptiveFetchSchedule for several months? I have db.fetch.interval.max = 31536000 (365 days), but I know that tons of pages get re-fetched every 30-60 days because I have

Re: how could I identify obsolete segments?

2018-03-23 Thread Michael Coffey
But all the old segment data is still sitting there in hdfs. On Friday, March 23, 2018, 1:34:21 PM PDT, Sebastian Nagel <> wrote: Hi Michael, when segments are merged only the most recent record of one URL is kept. Sebastian On 03/23/2018 09:25 PM, Michael Coffey wrote: >

Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-12 Thread Michael Coffey
Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 2019, and there are many newer versions available. For example, 3.1.4 came out in 2020, and there are 3.2.x and 3.3.x versions that came out this