RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma
We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed

RE: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Markus Jelsma
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is not true for 1.x, see NUTCH-1482. https://issues.apache.org/jira/browse/NUTCH-1482 -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Wed 12-Jun-2013 14:37 To: user@nutch.apache.org

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma
Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma
. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? So what in your opinion is the most effective way of removing boilerplates in Nutch crawls? On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, Boilerpipe is complex and difficult

RE: using Tika within Nutch to remove boiler plates?

2013-06-10 Thread Markus Jelsma
Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove

RE: Generator -adddays

2013-05-31 Thread Markus Jelsma
Please don't break existing scripts and support lower and uppercase. Markus -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 31-May-2013 19:11 To: user@nutch.apache.org Subject: Re: Generator -adddays Seems like a small cli syntax bug. Please

RE: How to achieve different fetcher.server.delay configuration for different hosts/sub domains?

2013-05-28 Thread Markus Jelsma
You can either use robots.txt or modify the Fetcher. Fetcher has a FetchItemQueue for each queue, this also records the CrawlDelay for that queue. A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it sets the CrawlDelay for the queue. You can have a lookup table here that

Fetcher corrupting some segments

2013-05-27 Thread Markus Jelsma
Hi, For some reason the fetcher sometimes produces corrupts unreadable segments. It then exists with exception like problem advancing post, or negative array size exception etc. java.lang.RuntimeException: problem advancing post rec#702 at

RE: rewriting urls that are index

2013-04-22 Thread Markus Jelsma
Hi, The 1.x indexer takes a -normalize parameter and there you can rewrite your URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. Make sure you use the config file containing that pattern only when indexing, otherwise they'll end up in the CrawlDB and segments. Use

RE: Period-terminated hostnames

2013-04-18 Thread Markus Jelsma
Rodney, Those are valid URL's but you clearly don't need them. You can either use filters to get rid of them or normalize them away. Use the org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test your config. Markus -Original message- From:Rodney Barnett

RE: How to Continue to Crawl with Nutch Even An Error Occurs?

2013-03-20 Thread Markus Jelsma
If Nutch exits with an error then the segment is bad, a failing thread is not an error that leads to a failed segments. This means the segment is properly fetched but just that some records failed. Those records will be eligible for refetch. Assuming you use the crawl command, the updatedb

RE: Does Nutch Checks Whether A Page crawled before or not

2013-03-20 Thread Markus Jelsma
To: user@nutch.apache.org Subject: Re: Does Nutch Checks Whether A Page crawled before or not Where does Nutch stores that information? 2013/3/21 Markus Jelsma-2 [via Lucene] ml-node+s472066n4049568...@n3.nabble.com Nutch selects records that are eligible for fetch. It's either due

RE: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-18 Thread Markus Jelsma
Feng Lu, welcome! :) -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Mon 18-Mar-2013 13:23 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer Hi Feng,  Congratulations on becoming a

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma
Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal. -Original message- From:Jason S

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Markus Jelsma
The default heap size of 1G is just enough for a parsing fetcher with 10 threads. The only problem that may rise is too large and complicated PDF files or very large HTML files. If you generate fetch lists of a reasonable size there won't be a problem most of the time. And if you want to crawl

RE: a lot of threads spinwaiting

2013-03-01 Thread Markus Jelsma
Hi, Regarding politeness, 3 threads per queue is not really polite :) Cheers -Original message- From:jc jvizu...@gmail.com Sent: Fri 01-Mar-2013 15:08 To: user@nutch.apache.org Subject: Re: a lot of threads spinwaiting Hi Roland and lufeng, Thank you very much for your

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma
require to fetch the page before the time interval is passed? Thanks very much - David On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma markus.jel...@openindex.iowrote: If you want records to be fetched at a fixed interval its easier to inject them with a fixed fetch interval

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma
that? Feng Lu : Thank you for the reference link. Thanks - David On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma markus.jel...@openindex.iowrote: The default or the injected interval? The default interval can be set in the config (see nutch-default for example). Per URL's can

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. -Original

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
Yes, my first options is differents files to differents domains. The point is how can I link the files with each domain? Do I need do some changes in Nutch code or the project have a feature for do that? On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: Yes

RE: Nutch status info on each domain individually

2013-02-25 Thread Markus Jelsma
Well, you can always the DomainStatistics utilities to get the raw numbers on hosts, domains and TLD's but this won't tell you whether a domain has been fully crawled because the crawling frontier can always change. You can be sure that everything (disregarding url filters) has been crawled if

RE: Differences between 2.1 and 1.6

2013-02-25 Thread Markus Jelsma
Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you are going to crawl on a very large scale but i still haven't seen any numbers to support this assumption. Nutch 1.x can easily

RE: Crawl script numberOfRounds

2013-02-19 Thread Markus Jelsma
Yes. -Original message- From:Amit Sela am...@infolinks.com Sent: Tue 19-Feb-2013 13:40 To: user@nutch.apache.org Subject: Crawl script quot;numberOfRoundsquot; Is the crawl script numberOfRounds argument is the equivalent of depth argument in the crawl command ? Thanks.

RE: fields in solrindex-mapping.xml

2013-02-16 Thread Markus Jelsma
Those are added by IndexerMapReduce (or 2.x equivalent) and index-basic. They contain the crawl datum's signature, the time stamp (see index-basic) and crawl datum score. If you think you don't need them, you can safely omit them. -Original message- From:alx...@aim.com alx...@aim.com

RE: Nutch identifier while indexing.

2013-02-13 Thread Markus Jelsma
You can use the subcollection indexing filter to set a value for URL's that match a string. With it you can distinquish even if they are on the same host and domain. -Original message- From:mbehlok m_beh...@hotmail.com Sent: Wed 13-Feb-2013 21:20 To: user@nutch.apache.org Subject:

RE: DiskChecker$DiskErrorException

2013-02-11 Thread Markus Jelsma
Hi- Also enough space in your /tmp directory? Cheers -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Mon 11-Feb-2013 09:27 To: user@nutch.apache.org Subject: DiskChecker$DiskErrorException Hello, Already twice I got this error: 2013-02-08

RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma
A parsing fetcher does everything in the mapper. Please check the output() method around line 1012 onwards: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup Parsing, signature, outlink processing (using code in ParseOutputFormat) all happens

RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma
Oh, i'd like to add that the biggest problem is memory and the possibility for a parser to hang, consume resources and time out everything else and destroying the segment. -Original message- From:Weilei Zhang zhan...@gmail.com Sent: Sat 09-Feb-2013 23:40 To: user@nutch.apache.org

RE: Best Practice to optimize Parse reduce step / ParseoutputFormat

2013-02-08 Thread Markus Jelsma
-Original message- From:kemical mickael.lume...@gmail.com Sent: Fri 08-Feb-2013 10:53 To: user@nutch.apache.org Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat Hi, I've been looking for some time now the reasons of Parse reduce taking a lot of

RE: Could not find any valid local directory for output/file.out

2013-02-08 Thread Markus Jelsma
The /tmp directory is not cleaned up IIRC. You're safe to empty it as long a you don't have a job running ;) -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 08-Feb-2013 20:48 To: user@nutch.apache.org Subject: Re: Could not find any valid local

RE: Could not find any valid local directory for output/file.out

2013-02-08 Thread Markus Jelsma
Hadoop stores temporary files there such as shuffling map output data, you need it! But you can rf -r it after a complete crawl cycle. Do not clear it while a job is running, it's going to miss it's temp files. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Fri

RE: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-28 Thread Markus Jelsma
Try setting -numFetchers N on the generator. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 28-Jan-2013 11:57 To: user@nutch.apache.org Subject: Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1 A higher number of per host threads,

RE: Solr dinamic fields

2013-01-28 Thread Markus Jelsma
Hi -Original message- From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Sent: Mon 28-Jan-2013 17:01 To: user@nutch.apache.org Subject: Solr dinamic fields Hi: I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able

RE: conditional indexing

2013-01-23 Thread Markus Jelsma
Hi - i've not yet committed a fix for: https://issues.apache.org/jira/browse/NUTCH-1449 This will allow you to stop documents from being indexed from within your indexing filter. Order can be configured using the indexing.filter.order or something configuration directive. -Original

RE: Nutch support with regards to Deduplication and Document versioning

2013-01-23 Thread Markus Jelsma
If you use 1.x and don't merge segments you still have older versions of documents. There is no active versioning in Nutch 1x except segment naming and merging, if you use it. -Original message- From:Tejas Patil tejas.patil...@gmail.com Sent: Wed 23-Jan-2013 09:25 To:

RE: solrindex deleteGone vs solrclean

2013-01-23 Thread Markus Jelsma
Hi, -deleteGone relies on segment information to delete records, which is faster and indeed somewhat on-the-fly. solclean command relies on CrawlDB information and will always work, even if you lost your segment or just periodically delete old segments. Cheers -Original message-

RE: Synthetic Tokens

2013-01-21 Thread Markus Jelsma
Hi, In Nutch a `synthetic token` maps to a field/value pair. You need an indexing filter to read the key/value pair from the parsed metadata and add it as a field/value pair to the NutchDocument. You may also need a custom parser filter to extract the data from somewhere and store it to the

RE: Wrong ParseData in segment

2013-01-16 Thread Markus Jelsma
/WritingPluginExample), but I'll add one. Cheers, Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi In our case it is really in the segment, and ends up in the index. Are there any known issues with parse filters? In that filter we do set the Parse object as class attribute

RE: Wrong ParseData in segment

2013-01-16 Thread Markus Jelsma
a shared instance variable references to DOM nodes slipped from one call of filter() to the other. Is there a possibility to ensure that every instance of ParseUtil has it's own plugin instances? Would be worth to check. Cheers, Sebastian On 01/16/2013 06:55 PM, Markus Jelsma wrote

RE: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil

2013-01-14 Thread Markus Jelsma
Nice! Thanks -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Mon 14-Jan-2013 20:28 To: d...@nutch.apache.org Cc: user@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil Welcome aboard Tejas Best Lewis On

RE: How segments is created?

2013-01-13 Thread Markus Jelsma
-Original message- From:Bayu Widyasanyata bwidyasany...@gmail.com Sent: Sun 13-Jan-2013 07:34 To: user@nutch.apache.org Subject: Re: How segments is created? On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote: Well, if you know that the front

RE: code changes not reflecting when deployed on hadoop

2012-12-27 Thread Markus Jelsma
Seems the job file is not deployed to all task trackers and i'm not sure why. Can you try using the nutch script to run your fetcher? -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Thu 27-Dec-2012 13:29 To: user@nutch.apache.org Subject: code changes not

RE: code changes not reflecting when deployed on hadoop

2012-12-27 Thread Markus Jelsma
On Thu, Dec 27, 2012 at 7:13 PM, Sourajit Basak sourajit.ba...@gmail.comwrote: How do you use the nutch script on a cluster ? On Thu, Dec 27, 2012 at 6:25 PM, Markus Jelsma markus.jel...@openindex.io wrote: Can you try using the nutch script to run your fetcher?

RE: Nutch approach for DeadLinks

2012-12-26 Thread Markus Jelsma
Hi - Nutch 1.5 has a -deleteGone switch for the SolrIndexer job.This will delete permanent redirects and 404's that have been discovered during the crawl. 1.6 also has a -deleteRobotsNoIndex that will delete pages that have a robots meta tag with a noindex value. -Original message-

RE: About the version of the nutch

2012-12-24 Thread Markus Jelsma
Hi - it depends on the estimated size of your data and the available hardware. You can simply get the current 1.0.x stable or 1.1.x beta Hadoop version, both will run fine. The choice is which Nutch to use, 1.x is very stable and has more features and can be used for very large scale crawls

RE: shouldFetch rejected

2012-12-17 Thread Markus Jelsma
Hi - curTime does not exceed fetchTime, thus the record is not eligible for fetch. -Original message- From:Jan Philippe Wimmer i...@jepse.net Sent: Mon 17-Dec-2012 13:31 To: user@nutch.apache.org Subject: Re: shouldFetch rejected Hi again. i still have that issue. I start

RE: How to extend Nutch for article crawling

2012-12-17 Thread Markus Jelsma
The 1.x indexer can filter and normalize. -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Mon 17-Dec-2012 15:11 To: user@nutch.apache.org Subject: Re: How to extend Nutch for article crawling Hi See comments below 1. Add article list pages into

RE: identify domains from fetch lists taking lot of time.

2012-12-14 Thread Markus Jelsma
Hi - you have to get rid of those URL's via URL filters. If you cannot filter them out you can set the fetcher time limit (see nutch-default) to limit the time the fetcher runs or set the fetcher minumum throughput (see nutch-default). The latter will abort the fetcher if less than N

RE: fetcher partitioning

2012-12-10 Thread Markus Jelsma
- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 10-Dec-2012 10:55 To: user@nutch.apache.org Cc: Markus Jelsma markus.jel...@openindex.io Subject: Re: fetcher partitioning Could anyone review this patch for using a pluggable custom partitioner ? For the time, I have just copied over

RE: fetcher partitioning

2012-12-10 Thread Markus Jelsma
, Sourajit On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma markus.jel...@openindex.iowrote: Sourajit, Looks fine at a first glance. A partitioner does not partition between threads, only mappers. It also makes little sense because in the fetcher number of threads can be set plus the queue

RE: [ANNOUNCE] Apache Nutch 1.6 Released

2012-12-10 Thread Markus Jelsma
Thanks Lewis! :) -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sat 08-Dec-2012 22:56 To: annou...@apache.org; user@nutch.apache.org Cc: d...@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch 1.6 Released Hi All, The Apache Nutch PMC are

RE: New Scoring

2012-12-05 Thread Markus Jelsma
-Original message- From:Pratik Garg saytopra...@gmail.com Sent: Wed 05-Dec-2012 19:17 To: user@nutch.apache.org Cc: Chirag Goel goel.chi...@gmail.com Subject: New Scoring Hi, Nutch provides a default and new Scoring method for giving score to the pages. I have couple of

RE: hung threads in big nutch crawl process

2012-12-03 Thread Markus Jelsma
Informáticas _ -Mensaje original- De: Markus Jelsma [mailto:markus.jel...@openindex.io] Enviado el: lunes, 03 de diciembre de 2012 1:42 PM Para: user@nutch.apache.org Asunto: RE: hung threads in big nutch crawl process

RE: Fetch content inside nutch parse

2012-11-30 Thread Markus Jelsma
See how the indexchecker fetches URL's: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java?view=markup -Original message- From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Sent: Fri 30-Nov-2012 16:46 To: user@nutch.apache.org

RE: Indexing-time URL filtering again

2012-11-29 Thread Markus Jelsma
. On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: I checked the code. You're probably not pointing it to a valid path or perhaps the build is wrong and you haven't used ant clean before building Nutch. If you keep having trouble you may want to check out

RE: size of crawl

2012-11-29 Thread Markus Jelsma
Impossible to say but perhaps there are more non-200 fetched records. Carefully look at the fetcher logs and inspect the crawldb with the readdb -stats command. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 07:04 To: user user@nutch.apache.org

RE: Nutch efficiency and multiple single URL crawls

2012-11-29 Thread Markus Jelsma
... (1) update config file to restrict domain crawls - (2) run command that crawls a domain with changes from config file while not having to rebuild job file - (3) index to Solr What would the (general) command be for step (2) is my question. On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma

RE: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread Markus Jelsma
Hi, This is a difficult problem in MapReduce and because of the fact that one image URL may be embedded in many documents. There are various methods you could use to aggregate the records but none i can think of will work very well or are straightforward to implement. I think the most

RE: The topN parameter in nutch crawl

2012-11-29 Thread Markus Jelsma
Subject: Re: The quot;topNquot; parameter in nutch crawl How would you characterize the crawling algorithm? Depth-first, breath-first, or some heuristic-based? On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, None of all three. the topN-parameter

RE: The topN parameter in nutch crawl

2012-11-29 Thread Markus Jelsma
alphabetically). It just picks the first eligible URL in the sorted list. You really should take a good look at the Generator code, it'll answer most of your questions. On Thu, Nov 29, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.iowrote: Nutch does neither. If scoring is used

RE: trunk

2012-11-27 Thread Markus Jelsma
Trunk is a directory in svn in which actual development is happening: http://svn.apache.org/viewvc/nutch/trunk/ -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 27-Nov-2012 01:46 To: user user@nutch.apache.org Subject: trunk In a different thread, Markus suggested

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-27 Thread Markus Jelsma
Hi - are you sure you have tabs separating the target and the mapped mimes? Use the nutch indexchecker tool to quickly test if it works. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Tue 27-Nov-2012 21:18 To: user@nutch.apache.org Subject: RE: problem with

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-27 Thread Markus Jelsma
original - De: Markus Jelsma markus.jel...@openindex.io Para: user@nutch.apache.org Enviados: Martes, 27 de Noviembre 2012 15:33:20 Asunto: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi - are you sure you have tabs separating the target

RE: Indexing-time URL filtering again

2012-11-26 Thread Markus Jelsma
again yes that's wht i've been doing. but ant itself won't produce the official binary release. On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma markus.jel...@openindex.iowrote: just ant will do the trick. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Mon

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
...@gmail.com Sent: Sun 25-Nov-2012 08:42 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Subject: Re: Indexing-time URL filtering again This does seem a bug. Can anybody help? On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com wrote: Markus, could you

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
Hi - trunk's more indexing filter can map mime types to any target. With it you can map both (x)html mimes to text/html or to `web page`. https://issues.apache.org/jira/browse/NUTCH-1262 -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Sun 25-Nov-2012 00:48 To:

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
the patch command correctly, and re-build nutch. But the problem still persists... On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma markus.jel...@openindex.io wrote: No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
an application/xhtml+xml to text/html in solr index. -Mensaje original- De: Markus Jelsma [mailto:markus.jel...@openindex.io] Enviado el: domingo, 25 de noviembre de 2012 4:33 AM Para: user@nutch.apache.org Asunto: RE: problem with text/html content type of documents appears application

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
! On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma markus.jel...@openindex.iowrote: You should provide the log output. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 17:27 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch

RE: Indexing-time URL filtering again

2012-11-22 Thread Markus Jelsma
Hi, I just tested a small index job that usually writes 1200 records to Solr. It works fine if i specify -. in a filter (index nothing) and point to it with -Durlfilter.regex.file=path like you do. I assume you mean by `it doesn't work` that it filters nothing and indexes all records from the

RE: doubts about some propierties on nutch-site.xml file

2012-11-22 Thread Markus Jelsma
See: http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Fri 23-Nov-2012 03:29 To: user@nutch.apache.org Subject: doubts about some propierties on nutch-site.xml file Hi all.

RE: Best practices for running Nutch

2012-11-19 Thread Markus Jelsma
Hi -Original message- From:kiran chitturi chitturikira...@gmail.com Sent: Sun 18-Nov-2012 18:38 To: user@nutch.apache.org Subject: Best practices for running Nutch Hi! I have been running crawls using Nutch for 13000 documents (protocol http) on a single machine and it goes on

RE: custom plugin's constructor unable to access hadoop conf

2012-11-16 Thread Markus Jelsma
That's because the object is not set in the constructor. You can access Configuration after setConf() is called. So defer your work in the constructor to this method. public void setConf(Configuration conf) { this.conf = conf; } -Original message- From:Sourajit Basak

RE: site-specific crawling policies

2012-11-16 Thread Markus Jelsma
you can override some URL Filter paths in nutch site or with command line options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can also set NUTCH_HOME and keep everything separate if you're running it locally. On Hadoop you'll need separate job files. -Original

RE: re-Crawl re-fetch all pages each time

2012-11-15 Thread Markus Jelsma
Hi - this should not happen. The only thing i can imagine is that the update step doesn't succeed but that would mean nothing is going to be indexed either. You can inspect an URL using the readdb tool, check before and after. -Original message- From:vetus ve...@isac.cat Sent: Thu

RE: adding custom metadata to CrawlDatum during parse

2012-11-14 Thread Markus Jelsma
Hi - Sure, check the db.parsemeta.to.crawldb configuration directive. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Wed 14-Nov-2012 08:10 To: user@nutch.apache.org Subject: adding custom metadata to CrawlDatum during parse Is it possible to add custom

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma
In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule.

RE: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Markus Jelsma
In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks() inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch)); The inlink URL is the key in the key/value pair so you won't see that

RE: very slow generator step

2012-11-12 Thread Markus Jelsma
Hi - Please use the -noFilter option. It is usually useless to filter in the generator because they've already been filtered in the parse step and or update step. -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Mon 12-Nov-2012 18:43 To: user@nutch.apache.org Subject:

RE: very slow generator step

2012-11-12 Thread Markus Jelsma
configuration for about 4 days and all of a sudden one crawl causes a jump of 100 minutes? Cheers, Mohammad   From: Markus Jelsma markus.jel...@openindex.io To: user@nutch.apache.org user@nutch.apache.org Sent: Monday, November 12, 2012 11:19:11 AM

RE: Tika Parsing not working in the latest version of 2.X?

2012-11-08 Thread Markus Jelsma
Try cleaning your build. -Original message- From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com Sent: Thu 08-Nov-2012 07:23 To: user@nutch.apache.org Subject: Tika Parsing not working in the latest version of 2.X? Just tried the latest 2.X after being away for a

RE: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Markus Jelsma
-D as a valid command parameter for solrindex. On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma markus.jel...@openindex.iowrote: Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing

RE: timestamp in nutch schema

2012-11-04 Thread Markus Jelsma
Hi - the timestamp is just the time when a page is being indexed. Not very useful except for deduplication. If you want to index some publishing date you must first identify the source of that date and get it out of webpages. It's possible to use og:date or other meta meta tags or perhaps other

RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
-Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr

RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag

RE: Information about compiling?

2012-11-01 Thread Markus Jelsma
Hi, There are binary versions of 1.5.1 but not 2.x. http://apache.xl-mirror.nl/nutch/1.5.1/ About the scripts, you have to build nutch and then go to runtime/local directory to run bin/nutch. Cheers -Original message- From:Dr. Thomas Zastrow p...@thomas-zastrow.de Sent: Thu

RE: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap

2012-11-01 Thread Markus Jelsma
Cheers! -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Thu 01-Nov-2012 18:30 To: user@nutch.apache.org Subject: Re: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap Nice one Julien. Its nothing short of a privilege to be

RE: fetch time

2012-10-27 Thread Markus Jelsma
Hi - Yes, the fetch time is the time when the record is eligible for fetch again. Cheers, -Original message- From:Stefan Scheffler sscheff...@avantgarde-labs.de Sent: Sat 27-Oct-2012 14:49 To: user@nutch.apache.org Subject: fetch time Hi, When i dump out the crawl db, there

RE: Format of content file in segments?

2012-10-27 Thread Markus Jelsma
Hi Морозов, It's a directory containing Hadoop map file(s) that stores key/value pairs. Hadoop Text class is the key and Nutch' Content class is the value. You would need Hadoop to easily process the files

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers

RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Markus Jelsma
Hi, -Original message- From:kiran chitturi chitturikira...@gmail.com Sent: Thu 25-Oct-2012 20:49 To: user@nutch.apache.org Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf Hi, i have built Nutch 2.x in eclipse using this tutorial (

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
? http://wiki.apache.org/nutch/FAQ On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large

RE: RegEx URL Normalizer

2012-10-22 Thread Markus Jelsma
-Oct-2012 00:34 To: user@nutch.apache.org Cc: dkavr...@gmail.com; Markus Jelsma markus.jel...@openindex.io Subject: Re: RegEx URL Normalizer Hi, I am interested in doing this i.e. only strip out parameters from url if some other string is found as well, in my case it will be a domain name. I

RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma
Hi - Hadoop can write more records per second than Solr can analyze and store, especially with multiple reducers (threads in Solr). SolrCloud is notoriously slow when it comes to indexing compared to a stand-alone setup. However, this should not be a problem at all as your not dealing with

RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma
Hi -Original message- From:Thilina Gunarathne cset...@gmail.com Sent: Tue 23-Oct-2012 00:38 To: user@nutch.apache.org Subject: Re: Best practice to index a large crawl through Solr? Hi Markus, Thanks a lot for the info. Hi - Hadoop can write more records per second than Solr

RE: Fetcher Thread

2012-10-18 Thread Markus Jelsma
Hi Ye, -Original message- From:Ye T Thet yethura.t...@gmail.com Sent: Thu 18-Oct-2012 15:46 To: user@nutch.apache.org Subject: Fetcher Thread Hi Folks, I have two questions about the Fetcher Thread in Nutch. The value fetcher.threads.fetch in configuration file determines the

<    2   3   4   5   6   7   8   9   10   11   >