We happily use that filter just as it is shipped with Nutch. Just enabling it
in plugin.includes works for us. To ease testing you can use the bin/nutch
org.apache.nutch.net.URLFilterChecker to test filters.
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Wed
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is
not true for 1.x, see NUTCH-1482.
https://issues.apache.org/jira/browse/NUTCH-1482
-Original message-
From:Tony Mullins tonymullins...@gmail.com
Sent: Wed 12-Jun-2013 14:37
To: user@nutch.apache.org
work
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Tue 11-Jun-2013 01:42
To: user user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
Marcus, do you mind sharing a sample nutch-site.xml?
On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
Hi,
Yes, you should write a plugin that has a parse filter and indexing filter. To
ease maintenance you would want to have a file per host/domain containing XPath
expressions, far easier that switch statements that need to be recompiled. The
indexing filter would then index the field values
boilerpipe any more? So what do you
suggest as an alternative?
On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
we don't use Boilerpipe anymore so no point in sharing. Just set those two
configuration options in nutch-site.xml as
property
.
On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi,
Yes, you should write a plugin that has a parse filter and indexing
filter. To ease maintenance you would want to have a file per host/domain
containing XPath expressions, far easier that switch
@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
So what in your opinion is the most effective way of removing boilerplates
in Nutch crawls?
On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Yes, Boilerpipe is complex and difficult
Those settings belong to nutch-site. Enable BP and set the correct extractor
and it should work just fine.
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Sun 09-Jun-2013 20:47
To: user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove
Please don't break existing scripts and support lower and uppercase.
Markus
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Fri 31-May-2013 19:11
To: user@nutch.apache.org
Subject: Re: Generator -adddays
Seems like a small cli syntax bug.
Please
You can either use robots.txt or modify the Fetcher. Fetcher has a
FetchItemQueue for each queue, this also records the CrawlDelay for that queue.
A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it
sets the CrawlDelay for the queue. You can have a lookup table here that
Hi,
For some reason the fetcher sometimes produces corrupts unreadable segments. It
then exists with exception like problem advancing post, or negative array
size exception etc.
java.lang.RuntimeException: problem advancing post rec#702
at
Hi,
The 1.x indexer takes a -normalize parameter and there you can rewrite your
URL's. Judging from your patterns the RegexURLNormalizer should be sufficient.
Make sure you use the config file containing that pattern only when indexing,
otherwise they'll end up in the CrawlDB and segments. Use
Rodney,
Those are valid URL's but you clearly don't need them. You can either use
filters to get rid of them or normalize them away. Use the
org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test
your config.
Markus
-Original message-
From:Rodney Barnett
If Nutch exits with an error then the segment is bad, a failing thread is not
an error that leads to a failed segments. This means the segment is properly
fetched but just that some records failed. Those records will be eligible for
refetch.
Assuming you use the crawl command, the updatedb
To: user@nutch.apache.org
Subject: Re: Does Nutch Checks Whether A Page crawled before or not
Where does Nutch stores that information?
2013/3/21 Markus Jelsma-2 [via Lucene]
ml-node+s472066n4049568...@n3.nabble.com
Nutch selects records that are eligible for fetch. It's either due
Feng Lu, welcome! :)
-Original message-
From:Julien Nioche lists.digitalpeb...@gmail.com
Sent: Mon 18-Mar-2013 13:23
To: user@nutch.apache.org
Cc: d...@nutch.apache.org
Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
Hi Feng,
Congratulations on becoming a
Hi
You can't do this with -slice but you can merge segments and filter them. This
would mean you'd have to merge the segments for each domain. But that's far too
much work. Why do you want to do this? There may be better ways in achieving
you goal.
-Original message-
From:Jason S
The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF files
or very large HTML files. If you generate fetch lists of a reasonable size
there won't be a problem most of the time. And if you want to crawl
Hi,
Regarding politeness, 3 threads per queue is not really polite :)
Cheers
-Original message-
From:jc jvizu...@gmail.com
Sent: Fri 01-Mar-2013 15:08
To: user@nutch.apache.org
Subject: Re: a lot of threads spinwaiting
Hi Roland and lufeng,
Thank you very much for your
require to fetch the page before the time interval is passed?
Thanks very much
- David
On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
If you want records to be fetched at a fixed interval its easier to inject
them with a fixed fetch interval
that?
Feng Lu : Thank you for the reference link.
Thanks - David
On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
The default or the injected interval? The default interval can be set in
the config (see nutch-default for example). Per URL's can
Yes, it will support that until you run out of memory. But having a million
expressions is not going to work nicely. If you have a lot of expressions but
can divide them into domains i would patch the filter so it will only execute
filters that or for a specific domain.
-Original
Yes, my first options is differents files to differents domains.
The point is how can I link the files with each domain? Do I need do
some changes in Nutch code or the project have a feature for do
that?
On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote:
Yes
Well, you can always the DomainStatistics utilities to get the raw numbers on
hosts, domains and TLD's but this won't tell you whether a domain has been
fully crawled because the crawling frontier can always change.
You can be sure that everything (disregarding url filters) has been crawled if
Something seems to be missing here. It's clear that 1.x has more features and
is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better
if you are going to crawl on a very large scale but i still haven't seen any
numbers to support this assumption. Nutch 1.x can easily
Yes.
-Original message-
From:Amit Sela am...@infolinks.com
Sent: Tue 19-Feb-2013 13:40
To: user@nutch.apache.org
Subject: Crawl script quot;numberOfRoundsquot;
Is the crawl script numberOfRounds argument is the equivalent of depth
argument in the crawl command ?
Thanks.
Those are added by IndexerMapReduce (or 2.x equivalent) and index-basic. They
contain the crawl datum's signature, the time stamp (see index-basic) and crawl
datum score. If you think you don't need them, you can safely omit them.
-Original message-
From:alx...@aim.com alx...@aim.com
You can use the subcollection indexing filter to set a value for URL's that
match a string. With it you can distinquish even if they are on the same host
and domain.
-Original message-
From:mbehlok m_beh...@hotmail.com
Sent: Wed 13-Feb-2013 21:20
To: user@nutch.apache.org
Subject:
Hi- Also enough space in your /tmp directory?
Cheers
-Original message-
From:Alexei Korolev alexei.koro...@gmail.com
Sent: Mon 11-Feb-2013 09:27
To: user@nutch.apache.org
Subject: DiskChecker$DiskErrorException
Hello,
Already twice I got this error:
2013-02-08
A parsing fetcher does everything in the mapper. Please check the output()
method around line 1012 onwards:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
Parsing, signature, outlink processing (using code in ParseOutputFormat) all
happens
Oh, i'd like to add that the biggest problem is memory and the possibility for
a parser to hang, consume resources and time out everything else and destroying
the segment.
-Original message-
From:Weilei Zhang zhan...@gmail.com
Sent: Sat 09-Feb-2013 23:40
To: user@nutch.apache.org
-Original message-
From:kemical mickael.lume...@gmail.com
Sent: Fri 08-Feb-2013 10:53
To: user@nutch.apache.org
Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat
Hi,
I've been looking for some time now the reasons of Parse reduce taking a lot
of
The /tmp directory is not cleaned up IIRC. You're safe to empty it as long a
you don't have a job running ;)
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Fri 08-Feb-2013 20:48
To: user@nutch.apache.org
Subject: Re: Could not find any valid local
Hadoop stores temporary files there such as shuffling map output data, you need
it! But you can rf -r it after a complete crawl cycle. Do not clear it while a
job is running, it's going to miss it's temp files.
-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Fri
Try setting -numFetchers N on the generator.
-Original message-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Mon 28-Jan-2013 11:57
To: user@nutch.apache.org
Subject: Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
A higher number of per host threads,
Hi
-Original message-
From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
Sent: Mon 28-Jan-2013 17:01
To: user@nutch.apache.org
Subject: Solr dinamic fields
Hi:
I'm currently working on a plattform for crawl a large amount of PDFs files.
Using nutch (and tika) I'm able
Hi - i've not yet committed a fix for:
https://issues.apache.org/jira/browse/NUTCH-1449
This will allow you to stop documents from being indexed from within your
indexing filter. Order can be configured using the indexing.filter.order or
something configuration directive.
-Original
If you use 1.x and don't merge segments you still have older versions of
documents. There is no active versioning in Nutch 1x except segment naming and
merging, if you use it.
-Original message-
From:Tejas Patil tejas.patil...@gmail.com
Sent: Wed 23-Jan-2013 09:25
To:
Hi,
-deleteGone relies on segment information to delete records, which is faster
and indeed somewhat on-the-fly. solclean command relies on CrawlDB information
and will always work, even if you lost your segment or just periodically delete
old segments.
Cheers
-Original message-
Hi,
In Nutch a `synthetic token` maps to a field/value pair. You need an indexing
filter to read the key/value pair from the parsed metadata and add it as a
field/value pair to the NutchDocument. You may also need a custom parser filter
to extract the data from somewhere and store it to the
/WritingPluginExample),
but I'll add one.
Cheers,
Sebastian
2012/11/30 Markus Jelsma markus.jel...@openindex.io:
Hi
In our case it is really in the segment, and ends up in the index. Are
there any known issues with parse filters? In that filter we do set the
Parse object as class attribute
a shared instance variable references to DOM nodes slipped from one call
of filter() to the other.
Is there a possibility to ensure that every instance of ParseUtil has it's
own plugin instances?
Would be worth to check.
Cheers,
Sebastian
On 01/16/2013 06:55 PM, Markus Jelsma wrote
Nice!
Thanks
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Mon 14-Jan-2013 20:28
To: d...@nutch.apache.org
Cc: user@nutch.apache.org
Subject: Re: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil
Welcome aboard Tejas
Best
Lewis
On
-Original message-
From:Bayu Widyasanyata bwidyasany...@gmail.com
Sent: Sun 13-Jan-2013 07:34
To: user@nutch.apache.org
Subject: Re: How segments is created?
On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote:
Well, if you know that the front
Seems the job file is not deployed to all task trackers and i'm not sure why.
Can you try using the nutch script to run your fetcher?
-Original message-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Thu 27-Dec-2012 13:29
To: user@nutch.apache.org
Subject: code changes not
On Thu, Dec 27, 2012 at 7:13 PM, Sourajit Basak
sourajit.ba...@gmail.comwrote:
How do you use the nutch script on a cluster ?
On Thu, Dec 27, 2012 at 6:25 PM, Markus Jelsma markus.jel...@openindex.io
wrote:
Can you try using the nutch script to run your fetcher?
Hi - Nutch 1.5 has a -deleteGone switch for the SolrIndexer job.This will
delete permanent redirects and 404's that have been discovered during the
crawl. 1.6 also has a -deleteRobotsNoIndex that will delete pages that have a
robots meta tag with a noindex value.
-Original message-
Hi - it depends on the estimated size of your data and the available hardware.
You can simply get the current 1.0.x stable or 1.1.x beta Hadoop version, both
will run fine. The choice is which Nutch to use, 1.x is very stable and has
more features and can be used for very large scale crawls
Hi - curTime does not exceed fetchTime, thus the record is not eligible for
fetch.
-Original message-
From:Jan Philippe Wimmer i...@jepse.net
Sent: Mon 17-Dec-2012 13:31
To: user@nutch.apache.org
Subject: Re: shouldFetch rejected
Hi again.
i still have that issue. I start
The 1.x indexer can filter and normalize.
-Original message-
From:Julien Nioche lists.digitalpeb...@gmail.com
Sent: Mon 17-Dec-2012 15:11
To: user@nutch.apache.org
Subject: Re: How to extend Nutch for article crawling
Hi
See comments below
1. Add article list pages into
Hi - you have to get rid of those URL's via URL filters. If you cannot filter
them out you can set the fetcher time limit (see nutch-default) to limit the
time the fetcher runs or set the fetcher minumum throughput (see
nutch-default). The latter will abort the fetcher if less than N
-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Mon 10-Dec-2012 10:55
To: user@nutch.apache.org
Cc: Markus Jelsma markus.jel...@openindex.io
Subject: Re: fetcher partitioning
Could anyone review this patch for using a pluggable custom partitioner ?
For the time, I have just copied over
,
Sourajit
On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Sourajit,
Looks fine at a first glance. A partitioner does not partition between
threads, only mappers. It also makes little sense because in the fetcher
number of threads can be set plus the queue
Thanks Lewis! :)
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Sat 08-Dec-2012 22:56
To: annou...@apache.org; user@nutch.apache.org
Cc: d...@nutch.apache.org
Subject: [ANNOUNCE] Apache Nutch 1.6 Released
Hi All,
The Apache Nutch PMC are
-Original message-
From:Pratik Garg saytopra...@gmail.com
Sent: Wed 05-Dec-2012 19:17
To: user@nutch.apache.org
Cc: Chirag Goel goel.chi...@gmail.com
Subject: New Scoring
Hi,
Nutch provides a default and new Scoring method for giving score to the
pages. I have couple of
Informáticas
_
-Mensaje original-
De: Markus Jelsma [mailto:markus.jel...@openindex.io]
Enviado el: lunes, 03 de diciembre de 2012 1:42 PM
Para: user@nutch.apache.org
Asunto: RE: hung threads in big nutch crawl process
See how the indexchecker fetches URL's:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java?view=markup
-Original message-
From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
Sent: Fri 30-Nov-2012 16:46
To: user@nutch.apache.org
.
On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
I checked the code. You're probably not pointing it to a valid path or
perhaps the build is wrong and you haven't used ant clean before building
Nutch. If you keep having trouble you may want to check out
Impossible to say but perhaps there are more non-200 fetched records. Carefully
look at the fetcher logs and inspect the crawldb with the readdb -stats
command.
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Thu 29-Nov-2012 07:04
To: user user@nutch.apache.org
...
(1) update config file to restrict domain crawls - (2) run command that
crawls a domain with changes from config file while not having to rebuild
job file - (3) index to Solr
What would the (general) command be for step (2) is my question.
On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma
Hi,
This is a difficult problem in MapReduce and because of the fact that one image
URL may be embedded in many documents. There are various methods you could use
to aggregate the records but none i can think of will work very well or are
straightforward to implement.
I think the most
Subject: Re: The quot;topNquot; parameter in nutch crawl
How would you characterize the crawling algorithm? Depth-first,
breath-first, or some heuristic-based?
On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi,
None of all three. the topN-parameter
alphabetically). It just picks the first
eligible URL in the sorted list. You really should take a good look at the
Generator code, it'll answer most of your questions.
On Thu, Nov 29, 2012 at 3:03 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Nutch does neither. If scoring is used
Trunk is a directory in svn in which actual development is happening:
http://svn.apache.org/viewvc/nutch/trunk/
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Tue 27-Nov-2012 01:46
To: user user@nutch.apache.org
Subject: trunk
In a different thread, Markus suggested
Hi - are you sure you have tabs separating the target and the mapped mimes? Use
the nutch indexchecker tool to quickly test if it works.
-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Tue 27-Nov-2012 21:18
To: user@nutch.apache.org
Subject: RE: problem with
original -
De: Markus Jelsma markus.jel...@openindex.io
Para: user@nutch.apache.org
Enviados: Martes, 27 de Noviembre 2012 15:33:20
Asunto: RE: problem with text/html content type of documents appears
application/xhtml+xml in solr index
Hi - are you sure you have tabs separating the target
again
yes that's wht i've been doing. but ant itself won't produce the official
binary release.
On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
just ant will do the trick.
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Mon
...@gmail.com
Sent: Sun 25-Nov-2012 08:42
To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org
Subject: Re: Indexing-time URL filtering again
This does seem a bug. Can anybody help?
On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com wrote:
Markus, could you
Hi - trunk's more indexing filter can map mime types to any target. With it you
can map both (x)html mimes to text/html or to `web page`.
https://issues.apache.org/jira/browse/NUTCH-1262
-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Sun 25-Nov-2012 00:48
To:
the patch command
correctly, and re-build nutch. But the problem still persists...
On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma markus.jel...@openindex.io
wrote:
No, this is no bug. As i said, you need either to patch your Nutch or get
the sources from trunk. The -filter parameter
an application/xhtml+xml to text/html in solr index.
-Mensaje original-
De: Markus Jelsma [mailto:markus.jel...@openindex.io]
Enviado el: domingo, 25 de noviembre de 2012 4:33 AM
Para: user@nutch.apache.org
Asunto: RE: problem with text/html content type of documents appears
application
!
On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
You should provide the log output.
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Sun 25-Nov-2012 17:27
To: user@nutch.apache.org
Subject: Re: Indexing-time URL filtering again
:30 PM, Markus Jelsma markus.jel...@openindex.io
wrote:
You seems to have an NPE caused by your regex rules, for some weird
reason. If you can provide a way to reproduce you can file an issue in
Jira. This NPE should also occur if your run the regex tester.
nutch
Hi,
I just tested a small index job that usually writes 1200 records to Solr. It
works fine if i specify -. in a filter (index nothing) and point to it with
-Durlfilter.regex.file=path like you do. I assume you mean by `it doesn't
work` that it filters nothing and indexes all records from the
See:
http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Fri 23-Nov-2012 03:29
To: user@nutch.apache.org
Subject: doubts about some propierties on nutch-site.xml file
Hi all.
Hi
-Original message-
From:kiran chitturi chitturikira...@gmail.com
Sent: Sun 18-Nov-2012 18:38
To: user@nutch.apache.org
Subject: Best practices for running Nutch
Hi!
I have been running crawls using Nutch for 13000 documents (protocol http)
on a single machine and it goes on
That's because the object is not set in the constructor. You can access
Configuration after setConf() is called. So defer your work in the constructor
to this method.
public void setConf(Configuration conf) {
this.conf = conf;
}
-Original message-
From:Sourajit Basak
you can override some URL Filter paths in nutch site or with command line
options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can
also set NUTCH_HOME and keep everything separate if you're running it locally.
On Hadoop you'll need separate job files.
-Original
Hi - this should not happen. The only thing i can imagine is that the update
step doesn't succeed but that would mean nothing is going to be indexed either.
You can inspect an URL using the readdb tool, check before and after.
-Original message-
From:vetus ve...@isac.cat
Sent: Thu
Hi - Sure, check the db.parsemeta.to.crawldb configuration directive.
-Original message-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Wed 14-Nov-2012 08:10
To: user@nutch.apache.org
Subject: adding custom metadata to CrawlDatum during parse
Is it possible to add custom
In trunk the modified time is based on whether or not the signature has
changed. It makes little sense relying on HTTP headers because almost no CMS
implements it correctly and it messes (or allows to be messed with on purpose)
with an adaptive schedule.
In trunk you can use the Inlink and Inlinks classes. The first for each inline
and the latter to add the Inlink objects to.
Inlinks inlinks = new Inlinks()
inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch));
The inlink URL is the key in the key/value pair so you won't see that
Hi - Please use the -noFilter option. It is usually useless to filter in the
generator because they've already been filtered in the parse step and or update
step.
-Original message-
From:Mohammad wrk mhd...@yahoo.com
Sent: Mon 12-Nov-2012 18:43
To: user@nutch.apache.org
Subject:
configuration for about 4 days and all of a sudden one crawl
causes a jump of 100 minutes?
Cheers,
Mohammad
From: Markus Jelsma markus.jel...@openindex.io
To: user@nutch.apache.org user@nutch.apache.org
Sent: Monday, November 12, 2012 11:19:11 AM
Try cleaning your build.
-Original message-
From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com
Sent: Thu 08-Nov-2012 07:23
To: user@nutch.apache.org
Subject: Tika Parsing not working in the latest version of 2.X?
Just tried the latest 2.X after being away for a
-D as a valid command parameter for solrindex.
On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
Ah, i understand now.
The indexer tool can filter as well in 1.5.1 and if you enable the regex
filter and set a different regex configuration file when indexing
Hi - the timestamp is just the time when a page is being indexed. Not very
useful except for deduplication. If you want to index some publishing date you
must first identify the source of that date and get it out of webpages. It's
possible to use og:date or other meta meta tags or perhaps other
-Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Fri 02-Nov-2012 10:04
To: user@nutch.apache.org
Subject: URL filtering: crawling time vs. indexing time
I feel like this is a trivial question, but I just can't get my ahead
around it.
I'm using nutch 1.5.1 and solr
://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
you
start crawling at mysite.com, you'll get zero results, as there is no
match.
On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
-Original message-
From:Joe Zhang smartag
Hi,
There are binary versions of 1.5.1 but not 2.x.
http://apache.xl-mirror.nl/nutch/1.5.1/
About the scripts, you have to build nutch and then go to runtime/local
directory to run bin/nutch.
Cheers
-Original message-
From:Dr. Thomas Zastrow p...@thomas-zastrow.de
Sent: Thu
Cheers!
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Thu 01-Nov-2012 18:30
To: user@nutch.apache.org
Subject: Re: [crawler-common] infoQ article Apache Nutch 2 Features and
Product Roadmap
Nice one Julien. Its nothing short of a privilege to be
Hi - Yes, the fetch time is the time when the record is eligible for fetch
again.
Cheers,
-Original message-
From:Stefan Scheffler sscheff...@avantgarde-labs.de
Sent: Sat 27-Oct-2012 14:49
To: user@nutch.apache.org
Subject: fetch time
Hi,
When i dump out the crawl db, there
Hi Морозов,
It's a directory containing Hadoop map file(s) that stores key/value pairs.
Hadoop Text class is the key and Nutch' Content class is the value. You would
need Hadoop to easily process the files
Hi,
You cannot recover the mapper output as far as i know. But anyway, one should
never have a fetcher running for three days. It's far better to generate a
large amount of smaller segments and fetch them sequentially. If an error
occurs, only a small portion is affected. We never run fetchers
Hi,
-Original message-
From:kiran chitturi chitturikira...@gmail.com
Sent: Thu 25-Oct-2012 20:49
To: user@nutch.apache.org
Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type
application/pdf
Hi,
i have built Nutch 2.x in eclipse using this tutorial (
?
http://wiki.apache.org/nutch/FAQ
On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
You cannot recover the mapper output as far as i know. But anyway, one
should never have a fetcher running for three days. It's far better to
generate a large
-Oct-2012 00:34
To: user@nutch.apache.org
Cc: dkavr...@gmail.com; Markus Jelsma markus.jel...@openindex.io
Subject: Re: RegEx URL Normalizer
Hi,
I am interested in doing this i.e. only strip out parameters from url
if some other string is found as well, in my case it will be a domain
name. I
Hi - Hadoop can write more records per second than Solr can analyze and store,
especially with multiple reducers (threads in Solr). SolrCloud is notoriously
slow when it comes to indexing compared to a stand-alone setup. However, this
should not be a problem at all as your not dealing with
Hi
-Original message-
From:Thilina Gunarathne cset...@gmail.com
Sent: Tue 23-Oct-2012 00:38
To: user@nutch.apache.org
Subject: Re: Best practice to index a large crawl through Solr?
Hi Markus,
Thanks a lot for the info.
Hi - Hadoop can write more records per second than Solr
Hi Ye,
-Original message-
From:Ye T Thet yethura.t...@gmail.com
Sent: Thu 18-Oct-2012 15:46
To: user@nutch.apache.org
Subject: Fetcher Thread
Hi Folks,
I have two questions about the Fetcher Thread in Nutch. The value
fetcher.threads.fetch in configuration file determines the
601 - 700 of 1614 matches
Mail list logo