Re: Regex filter - sanity check

2010-11-15 Thread Sebastian Nagel
Hi Eric, I am using a regex filter to just crawl a directory of a remote host: http://www.oyez.org.\cases\ -. I guess the right rules are # accept only URLs starting with http://www.oyez.org/cases/ a +^http://www\.oyez\.org/cases/ # skip everything else -. The Java regular expressions

Re: Regex filter - sanity check

2010-11-16 Thread Sebastian Nagel
Hi Eric, So, the last part with prefix url is that I can add: http://Something.mydomain.com http://mydomainc.com/seasonal/law_school_sucks And that file will be tell nutch to follow those url's as prefixes? Is that correct? The plugin urlfilter-prefix will filter out all URLs which do

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Sebastian Nagel
Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, This may be the reason. Empty segments may break some of the crawler steps. But if I'm not wrong it looks like the updatedb-command is not quite correct:

Re: 1.4 release - newer hadoop jars

2011-09-30 Thread Sebastian Nagel
can you package 1.4 with updated hadoop jars? i have problems with running nutch in local mode. If i run multiple tasks at once, they delete each other temporary files. Its worth a try if newer hadoop libs will fix that. Hi Radim, I don't know whether current versions of hadoop fix this

Re: Fetcher NPE's

2011-10-26 Thread Sebastian Nagel
Hi Markus, the error resembles a problem I've observed some time ago but never managed to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is the same. Sebastian On 10/19/2011 05:01 PM, Markus Jelsma wrote: Hi, We sometimes see a

Re: Fwd: Nutch project and my Ph.D. thesis.

2011-11-25 Thread Sebastian Nagel
Hi Sergey, a late answer, but I just read your work and found it very interesting and inspiring, especially your description of a system for the automatic construction of URL filters. Why? - We recently had to setup for a customer URL filter and normalization rules to limit the number of crawled

Re: how to adjust 'content'

2011-12-14 Thread Sebastian Nagel
On 12/14/2011 07:41 AM, Avni, Itamar wrote: Regarding (1) I'd suggest plugin-in your own additional implementation for HtmlParseFilter, where you can manipulate the content as you like, and set it back on the returned ParseResult.ParseText. On 12/14/2011 02:11 AM, Hartl, Florian wrote: 2.

Re: Filter by content language ID

2012-01-03 Thread Sebastian Nagel
Hello Allessio, Basically, using (in the filter): String langID = (String) doc.getFieldValue(lang); I always have a 'null' returned, while the field is correctly added to the index in Solr. Looks like the language-identifier indexing filter is applied after your plug-in. Try to set

Re: Crawl only *.*.us

2012-01-08 Thread Sebastian Nagel
Hi Waleed, in nutch-default.xml: property nameplugin.includes/name valuedomain-urlfilter.txt/value /property No, you have to adapt the property so that among other plugins urlfilter-domain is accepted by the regular expression. E.g.: property nameplugin.includes/name

Re: Aborting with 10 hung threads -ver.2

2012-01-31 Thread Sebastian Nagel
Hi Remi, if this error only occurs for some sites it may be the case that these sites are hosting large documents and serving them slowly. If you do not limit the document size by http.content.limit you may have a look at: https://issues.apache.org/jira/browse/NUTCH-1182 and the properties

Re: Invalid uri?

2012-02-13 Thread Sebastian Nagel
Hi Kaveh, protocol-httpclient does not accept URLs containing white space and other characters which are, strictly speaking, forbidden in URLs and have to be escaped, see http://en.wikipedia.org/wiki/URI_encoding Most browsers accept these URLs and escape the forbidden characters tacitly.

Re: Invalid uri?

2012-02-14 Thread Sebastian Nagel
, Sebastian Nagel wrote: Hi Kaveh, protocol-httpclient does not accept URLs containing white space and other characters which are, strictly speaking, forbidden in URLs and have to be escaped, see http://en.wikipedia.org/wiki/URI_encoding Most browsers accept these URLs and escape the forbidden characters

Re: crawling sile system

2012-03-19 Thread Sebastian Nagel
Hi Alessio, you should set the property file.crawl.parent (see below) to false in your nutch-site.xml. Sebastian property namefile.crawl.parent/name valuetrue/value description The crawler is not restricted to the directories that you specified in the Urls file but it is jumping

Re: canonical tag support

2012-03-23 Thread Sebastian Nagel
Hi, there is already an issue open: https://issues.apache.org/jira/browse/NUTCH-710 I've struggled with the rel=canonical tag right now. About 70% of the documents of the crawled site had this tag set. The quick solution was to write a parse filter that extracts the tag and an indexing filter

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-23 Thread Sebastian Nagel
Could you explain what is meant by continuously running crawl cycles? Usually, you run a crawl with a certain depth, a max. number of cycles. If the depth is reached the crawler stops even if there are still unfetched URLs. If generator generates an empty fetch list in one cycle the crawler

Re: Relative urls, interpage href anchors

2012-03-27 Thread Sebastian Nagel
Hi, I had the same problem with this pattern. I think the pattern is intented to remove page anchors while keeping accidentially misplaced query parameters (behind the anchor). In my case, there have been anchor links of the form #action?param1param2 processed by some javascript code.

Re: Bottleneck of my crawls: NativeCodeLoader

2012-03-27 Thread Sebastian Nagel
Hi James, there is a description on how to install native libraries: lib/native/README.txt If installed appropriately native libs are loaded and the warnings will disappear. But are you sure that it's really the library loading that takes the time and not the step run after but without an

Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread Sebastian Nagel
Hi Remi, it's not a bug, the substitution pattern is wrong. A captured group $1 is used but nothing is captured. The pattern should be: patternhttp://google1.com/(.+)/pattern Now $1 is defined and contains the part matched by .+ Beside, the rule regex

Re: how does nutch handle cookies ?

2012-04-05 Thread Sebastian Nagel
Hi Rémy, I'm wondering about how nutch handle cookies defined while fetching a page. 1) are those cookies used when nutch is crawling urls generated from that page ? Generally, cookies are ignored. But have a look at https://issues.apache.org/jira/browse/NUTCH-827 Your problem is almost

Re: Is there a way to suppress Javascript outlinks in a page?

2012-04-10 Thread Sebastian Nagel
You could 1) exclude links to *.js documents by URL filters, e.g, add to regex-urlfilter.txt: # exclude JavaScript -\.js$ 2) exclude outlinks from link and script elements in general by adding these to property nameparser.html.outlinks.ignore_tags/name value/value descriptionComma

Re: Getting the parsed HTML content back

2012-04-19 Thread Sebastian Nagel
Hi Vikas, 1) Is there a way (another filter) in which I get the dom structure back which I can then parse with Xpath? The filter method takes the DOM as one of it's arguments (DocumentFragment doc). You can use it as input of XPathExpression.evaluate(doc, ...) CAVEAT: XPath is case-sensitive

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-01 Thread Sebastian Nagel
Hi Igor, no disk space on /tmp is one possible reason. The other is: (working in local mode). Are you running multiple instances of Nutch in parallel? If yes, these instances must use disjoint temp directories (hadoop.tmp.dir). There are multiple posts on this list about this topic. Sebastian

Re: Crawl sites with hashtags in url

2012-05-01 Thread Sebastian Nagel
Hi Roberto, as defined in ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt the hash ('#') is used to separate the fragment from the rest of the URL. The RFC explicitly delegates the semantics of the fragment to the media type of the document. In good old HTML the fragment is just an anchor and

Re: Apache Nutch release 1.5 RC2

2012-05-23 Thread Sebastian Nagel
Hi, -1 package name - starting with 1.1 the packages are named apache-nutch-1.x-* Shouldn't we follow this convention? - the top level folder inside is named differently in previous releases, either nutch-1.x (1.3-bin, 1.2-bin) or apache-nutch-1.x (1.4-bin, 1.4-src,

Re: XML parsing

2012-05-24 Thread Sebastian Nagel
Isn't tika responsible for XML parsing? Because I got this: parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml. Should I just include xml?

Re: RSS parser

2012-05-24 Thread Sebastian Nagel
(it's too late I know) Have you checked the property http.content.limit (default is only 64kB, RSS feeds are often larger). Looks like the content is truncated: Caused by: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 300: XML document structures must start and end

Re: New Nutch Committer and PMC member : Sebastian Nagel

2012-05-25 Thread Sebastian Nagel
and smaller improvements on the 1.x branch, and some documentation. Cheers, Sebastian On 05/25/2012 05:56 PM, Julien Nioche wrote: Dear all, It is my pleasure to announce that Sebastian Nagel has joined the Nutch PMC and is a new committer. Sebastian, would you mind telling us about yourself

Re: Cannot run program chmod

2012-05-30 Thread Sebastian Nagel
This is not really a problem of Nutch. The fork to run the command chmod failed because your machine does not have enough memory (RAM + swap). For more information you should google for error=12, Cannot allocate memory hadoop error=12 Possible solutions (assuming you are using Linux): look

Re: Getting seed url

2012-06-11 Thread Sebastian Nagel
Hi Sandeep, tracking the seed(s) for a document could be done by a scoring filter. The seed URL must be passed: 0 into CrawlDatum's meta by injectedScore() (alternatively, use additional fields in the seed file: url tab seed=url see Injector Javadoc) 1 in

Re: Nutch name spyder

2012-06-12 Thread Sebastian Nagel
Hello David, can you specify which version of Nutch you are using? I've run a local test crawl with Nutch 1.5 two weeks ago and just looked into the Apache log file. All seems correct: 127.0.0.1 - - [31/May/2012:22:25:46 +0200] GET /robots.txt HTTP/1.0 404 462 - sn-test-crawler/Nutch-1.5

Re: Getting seed url

2012-06-12 Thread Sebastian Nagel
On 12 June 2012 14:41, Julien Nioche lists.digitalpeb...@gmail.com wrote: That's the idea indeed. The urlmeta plugin allows to do that simply by setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for description etc...) On 11 June 2012 22:45, Sebastian Nagel wastl.na

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-12 Thread Sebastian Nagel
My current workaround would be to delete the .com and .au lines from the configuration file. You could also activate the option +P in suffix-urlfilter.txt: # uncomment the line below to filter on url path #+P The pattern are then exclusively applied to the path of the URL and not to host or

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sebastian Nagel
Hi Sandeep, It just fetches text Analytical Cytometry. It looks like the property http.content.limit is still on its default (64kB) which causes the document to be truncated right after Analytical Cytometry. Unfortunately, truncated content is not logged to make it easier to locate the reason,

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sebastian Nagel
it worked. I am able to get all the text. Thank you. It will be really helpful if you/others can guide me with relative url's and regular expression problem which I have mentioned in main post. Regards, Sandeep On Tue, Jun 19, 2012 at 4:28 PM, Sebastian Nagel wastl.na...@googlemail.com wrote

Re: [VOTE] Apache Nutch 1.5.1 Release Candidate

2012-06-26 Thread Sebastian Nagel
-1 The plugin urlnormalizer-host (NUTCH-1319 listed in CHANGES.txt) is missing in the bin package. It also does not build for the src package: it's missing in src/plugins/build.xml of 1.5.1. @Markus: You are right: up to 1.4 there was a top-level folder apache-nutch-1.x/ in the package (src and

Re: Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

2012-07-12 Thread Sebastian Nagel
Hi Ashish, As far i understood till now,Nutch triggers crawl in a loop as many times as depth value. Please suggest. Yes. For every step (until depth is reached): - generate a list of URLs to be fetched - fetch this list - parse documents and extract outlinks - write these outlink URLs as

Re: How does nutch reflect with HTTP status not 200?

2012-07-18 Thread Sebastian Nagel
Hi, If nutch fetches a page and get a HTTP status which is not 200(e.g. 203 307 404 ...), what will it do? First, HTTP status codes are abstracted to a protocol status: - HTTP codes with similar semantics (eg., 302, 303, 307) are mapped into one protocal status TEMP_MOVED - in addition,

Re: Integrating Nutch

2012-07-22 Thread Sebastian Nagel
conf.set(urlfilter.regex.file, C:/server/nutch/conf/regex-urlfilter.txt); conf.set(urlnormalizer.regex.file, C:/server/nutch/conf/regex-normalize.xml); I get no exceptions, but the following log entries show up: 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource:

Re: Javadoc incorrect or missing code in 1.5.1 Generator

2012-07-29 Thread Sebastian Nagel
Hi Lewis, the javadoc obviously belongs to the first method generate(Path, Path, int, long, long) This method also uses the two properties generate.filter and generate#normalise. But this method is only referenced by Crawl#run and Benchmark. The third method (whith the javadoc) is used by

Re: bin directory empty

2012-08-02 Thread Sebastian Nagel
Hi Luca, it's not normal, it's a bug: https://issues.apache.org/jira/browse/NUTCH-1436 Sorry. Take the tar.gz which contains the missing bin/nutch script. Sebastian On 08/02/2012 01:53 PM, Luca Cavanna wrote: Hello, I just downloaded the 1.5.1 nutch version and found out that the bin

Re: Integrating Nutch

2012-08-02 Thread Sebastian Nagel
One question though: Is there a way to get some more verbose information out of the crawl process than just the logging information? I intend something like the urls crawled, the ones waiting to be crawled, current status etc? Programmatically I can only infer at what stage the process is

Re: crawling site without www

2012-08-04 Thread Sebastian Nagel
Hi Alexei, Because users are lazy some browser automatically try to add the www (and other stuff) to escape from a server not found error, see http://www-archive.mozilla.org/docs/end-user/domain-guessing.html Nutch does no domain guessing. The urls have to be correct and the host name must be

Re: crawling site without www

2012-08-07 Thread Sebastian Nagel
Hi Alexei, I tried a crawl with your scrip fragment and Nutch 1.5.1 and the URLs http://mobile365.ru as seed. It worked, see annotated log below. Which version of Nutch do you use? Check the property db.ignore.external.links (default is false). If true the link from mobile365.ru to

Re: crawling site without www

2012-08-08 Thread Sebastian Nagel
Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Better: +^https?://(?:www\.)?mobile365\.ru/ or to catch all of mobile365.ru

Re: CHM Files and Tika

2012-08-09 Thread Sebastian Nagel
Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files)

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Sebastian Nagel
Hi, I just discovered this nice but really old site http://nutch.sourceforge.net/docs/en/ with translations for a dozen of languages. The proposition is still challenging. Sebastian On 08/09/2012 10:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sebastian Nagel
However, how is topN determined? It's just the top N unfetched pages sorted by decreasing score. Pages will be re-fetched only after some larger amount of time, 30 days per default, see property db.fetch.interval.default. If I am crawling inside a domain, there will be links from almost every

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sebastian Nagel
of the second cycle have theoretically 1 outlinks. Practically, many targets are shared, so you'll get much less outlinks. On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: However, how is topN determined? It's just the top N unfetched pages sorted

Re: CHM Files and Tika

2012-08-14 Thread Sebastian Nagel
Hi Jan, opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454 Thanks! Beyond the can't retrieve parser error: I've tried a couple of chm files (among them the test files from Tika) but I wasn't able to get Tika to extract content. % java -jar

Re: Escaping URL during redirection

2012-09-09 Thread Sebastian Nagel
Redirects are filtered and normalized. It works for 1.4/1.5 and should for trunk. One subtlety: there is an extra scope for normalization of redirects (fetcher). If scoped normalization rules/expressions are used don't forget to configure this scope with the appropriate regex-normalize rule file

Re: breakpoints in eclipse and nutch 1.5

2012-09-11 Thread Sebastian Nagel
Yes, very much appreciated. Line numbers change frequently between versions. Btw, I switched to use bin/nutch in combination with the Eclipse remote debugger. bin/nutch is very flexible to call exactly that tool you want to debug (parser, URL filter, some custom plugin) and creating launch

Re: tmp folder problem

2012-09-20 Thread Sebastian Nagel
Hi Matteo, have a look at the property hadoop.tmp.dir which allows you to direct the temp folder to another volume with more space on it. For local crawls: - do not share this folder for two simultaneously running Nutch jobs - you have to clean-up the temp folder, esp. after failed jobs (if

Re: Nutch not crawling jabong

2012-09-24 Thread Sebastian Nagel
Hi, there are plenty of reasons why a document is missing. See http://wiki.apache.org/nutch/DebugTool for a list of possible reasons (sorry, explanations are missing). About the example from jabong. I got 680 outlinks for http://www.jabong.com/men/shoes/mens-sports-shoes/ by calling % nutch

Re: Parse HTML Page with link generated by javascript

2012-10-03 Thread Sebastian Nagel
Hi Alexandre, I try to crawl a website with a menu generated with some javascript code. For exemple on this website: http://www.beautycenter-riebenbauer.at/ Nutch does not interpret java script but is has a link extractor for java script based on regular expressions, see plugin parse-js. It

Re: Error parsing html

2012-10-09 Thread Sebastian Nagel
I should mention, that I'm using Nutch in a Web-Application. It's possible though it's hard. While debugging I came across the runParser method in ParseUtil class in which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null. See

Re: same page fetched severals times in one crawl

2012-10-15 Thread Sebastian Nagel
Hi Pierre, I tried almost the same with just the default settings (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O). All went ok, no documents were crawled twice. I don't know what exactly went wrong and didn't find a definitive hint in your logs. Some suggestions: - the

Re: same page fetched severals times in one crawl

2012-10-16 Thread Sebastian Nagel
... Is it planned to have a script who already handle this generate-fetch-parse-updatedb loop with some tweak like maximum depth of the crawl, maximum time of the crawl ? On 15/10/2012 22:11, Sebastian Nagel wrote: Hi Pierre, I tried almost the same with just the default settings (only

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Sebastian Nagel
Hi Luca, I'm using Nutch 2.1 on Linux and I'm having similar problem of http://goo.gl/nrDLV, my Nutch is fetching same pages at each round. Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql) Then I ran

Re: Same pages crawled more than once and slow crawling

2012-10-24 Thread Sebastian Nagel
cd $HOME/nutch-1.X/runtime/local From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory. There is a difference whether you run Nutch 1.x from the bin or src package: the former does not contain a runtime/local folder. Sebastian - can you please revert the

Re: can nutch output xml?

2012-10-24 Thread Sebastian Nagel
Hi Mike, afaik, it can't. But it would be really useful for archiving, post-processing, data mining, etc. Have a look at NUTCH-1047 and NUTCH-1088. Currently, you would need to write a class XMLIndexWriter which implements the interface NutchIndexWriter and use it via

Re: Information about compiling?

2012-11-01 Thread Sebastian Nagel
Hi Thomas, I just improved the description on how to run Nutch from the source package in http://wiki.apache.org/nutch/NutchTutorial If you are using Nutch 2.x, you should follow http://wiki.apache.org/nutch/Nutch2Tutorial Thanks, Sebastian On 11/01/2012 10:54 AM, Markus Jelsma wrote: Hi,

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

2012-11-20 Thread Sebastian Nagel
Hi, As far as i know all URL's are long resolved before ever being passed to any filter. The parser is responsible for resolving relative to absolute. Well, my rules with explicit pattern matches for absolute URLs including the protocol and domain failed until I made the protocol and

Re: shouldFetch rejected

2012-11-25 Thread Sebastian Nagel
(because i deleted all the old crawl dirs). In the crawl log i see many page to fetch, but at the end all of them are rejected. Any ideas? Am 24.11.2012 16:36, schrieb Sebastian Nagel: I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl process

Re: Wrong ParseData in segment

2012-11-30 Thread Sebastian Nagel
Hi Markus, sounds somewhat similar to NUTCH-1252 but that was rather trivial and easy to reproduce. Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi, We've got an issue where one in a few thousand records partially contains another record's ParseMeta data. To be specific,

Re: nutch 2.1 command line options

2013-01-06 Thread Sebastian Nagel
While in 1.x all commands show a help when called as bin/nutch command this is not always the case for 2.x - a known inconsistency (NUTCH-1393). Unfortunately, until this issue is solved I see no other possibility than having a look at the sources to get a definitive list of command options and

Re: problem with nutch2.1 and redirect

2013-01-08 Thread Sebastian Nagel
Hi David, Nutch follows redirects. You should check the URL you are redirected to: http://search.ebscohost.com/login.aspx?direct=truescope=sitedb=a2hAN=84164637msid=943330409 If it is - not blocked by URL filters - or by db.ignore.external.links (because it's and external link) the redirect

Re: Wrong ParseData in segment

2013-01-16 Thread Sebastian Nagel
Hi Markus, right now I have seen this problem in a small test set of 20 documents: - various document types (HTML, PDF, XLS, zip, doc, ods) - small and quite large docs (up to 12 MB) - local docs via protocol-file - fetcher.parse = true - Nutch 1.4, local mode Somehow metadata from a one doc

Re: nutch/util/NodeWalker class is not thread safe

2013-01-16 Thread Sebastian Nagel
Hi, Any ideas if this can cause problems Yes, it can definitely cause problems. I've just observed such a problem in our custom plugin which traverses the DOM tree to extract nodes by CSS3 selectors. and how to make it thread safe? That's hard if not impossible. The inner states (current node,

Re: Wrong ParseData in segment

2013-01-16 Thread Sebastian Nagel
Hi Markus, However, i assumed the plugins were already in a thread-safe environment because each FetcherThread instance has it's own instance of ParseUtil. I had similar assumptions but the debug output to investigate my problem is straightforward (the number are object hash codes):

Re: Installation of NUTCH on windows7

2013-01-25 Thread Sebastian Nagel
Hi, that's a known problem with Hadoop on Windows / Cygwin: https://issues.apache.org/jira/browse/HADOOP-7682 I don't know whether there are is a reliable fix or a word-around but you should search for the error - you are not alone ;-) Sebastian On 01/25/2013 12:49 PM, Revathi R wrote: Hello

Re: Nutch Incremental Crawl

2013-02-01 Thread Sebastian Nagel
Hi David, So even If there is any modification made on a fetched page before this interval and the crawl job is run, it will still not be re-fetched/updated unless this interval is crossed. Yes. That's correct. is there any way to do immediate update? Yes, provided that you know which

Re: mime type text/plain

2013-02-02 Thread Sebastian Nagel
Hi, the given URL is a redirect (HTTP 303, at least, when I try) with no content (only the HTTP header). Tried with curl and Nutch's parsechecker tool: % bin/nutch parsechecker

Re: Nutch Incremental Crawl

2013-02-04 Thread Sebastian Nagel
, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi David, So even If there is any modification made on a fetched page before this interval and the crawl job is run, it will still not be re-fetched/updated unless this interval is crossed. Yes. That's correct. is there any way to do

Re: mime type text/plain

2013-02-04 Thread Sebastian Nagel
, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi, the given URL is a redirect (HTTP 303, at least, when I try) with no content (only the HTTP header). Tried with curl and Nutch's parsechecker tool: % bin/nutch parsechecker http://www.nytimes.com/2013/01/31/technology/chinese-hackers

Re: Is there a bug in the crawl script coming with nutch 1.6 ?

2013-02-19 Thread Sebastian Nagel
Hi Amit, hi Lewis, see NUTCH-1500 for details. You can take http://svn.apache.org/repos/asf/nutch/trunk/src/bin/crawl and replace (runtime/local/)bin/crawl of 1.6. It should work. Thanks, anyway! Sebastian On 02/19/2013 06:15 PM, Lewis John Mcgibbney wrote: Hi Amit, I think Seb fixed this

Re: Nutch 1.6 with Java - not loading correct configuration file

2013-02-21 Thread Sebastian Nagel
Hi, So where is Nutch in Java loading the configuration file from? (and how can I overwrite it) – configuration files are found via Java’s classpath – only the first instance of each file found in one of the directories of the classpath is used – settings in nutch-site.xml overwrite

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel
Hi Kiran, there are many possible reasons for the problem. Beside the limits on the number of processes the stack size in the Java VM and the system (see java -Xss and ulimit -s). I think in local mode there should be only one mapper and consequently only one thread spent for parsing. So the

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel
, kiran chitturi wrote: Thanks Sebastian for the suggestions. I came over this by using low value for topN(2000) than 1. I decided to use lower value for topN with more rounds. On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Kiran, there are many

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-04 Thread Sebastian Nagel
After all documents are fetched (and ev. parsed) the segment has to be written: finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) to the segment directory. If IO is a bottleneck this may take a while. Also looks like you have a lot of content! On 03/04/2013 06:03 AM, kiran

Re: DiskChecker$DiskErrorException

2013-03-04 Thread Sebastian Nagel
Hi Alexei, principally, in local mode you cannot run more than one Hadoop job concurrently, or you have to use disjoint hadoop.tmp.dir properties. There have been a few posts on this list about this topic. I'm not 100% sure whether the commands in your scripts are the reason because they should

Re: parsechecker and redirection

2013-03-25 Thread Sebastian Nagel
Hi Canan, hi Lewis, parsechecker cannot follow redirects, also in trunk / 1.x. It would be nice, at least, if parsechecker would report clearly that there is a redirect. Currently, you have to check content metadata for the redirect target which is easy to overlook. % nutch parsechecker

Re: parsechecker and redirection

2013-03-25 Thread Sebastian Nagel
Hi Lewis, let's address NUTCH-1038, NUTCH-1389, NUTCH-1419, and NUTCH-1501! On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote: Thanks for clarification on this one Seb. I was aware that you were clued up on this and hoped you would drrop in. On Monday, March 25, 2013, Sebastian Nagel

Re: crawl time for depth param 50 and topN not passed

2013-04-05 Thread Sebastian Nagel
Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a

Re: Permgen size keeps increasing

2013-04-09 Thread Sebastian Nagel
Hi, It just keeps increasing after each crawling. What does this precisely mean? (a) Are you running one crawl process with many cycles (depth) by launching bin/nutch crawl (org.apache.nutch.crawl.Crawl) (b) or in separate steps (inject, generate, fetch, parse, updatedb, ...)? For (a) see

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Sebastian Nagel
Hi, more information would be useful: - exact Nutch version (2.?) - how Nutch is called (eg, via bin/crawl) - details of the configuration, esp. -depth -topN http.content.limit fetcher.parse - storage back-end In general, something is wrong. Maybe, some oversized documents are crawled.

Re: Nutch 2 hanging after aborting hung threads

2013-04-23 Thread Sebastian Nagel
several of my files. Also, my server is running Nutch in local mode as well. I don't have a hadoop cluster. On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: It's not the documents AFAIK. I'm crawling the same server and it works on my local machine

Re: Nutch 2 hanging after aborting hung threads

2013-04-24 Thread Sebastian Nagel
the entire segment in memory. Has this changed with the move to HBase? Do the files get pushed as soon as they're fetched or does that happen at the end? Thanks. On Tue, Apr 23, 2013 at 3:52 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, if fetcher.parse is the default (=false

Re: Nutch to index filesystem meta data?

2013-05-13 Thread Sebastian Nagel
That's possible but not out-of-the-box. The available plugin protocol-file does the opposite - get the files raw content to be passed to a parser to extract plain-text content and meta data (author, etc.) - get some file-specific meta data (eg, modified time) You have to write your own plugin

Re: Fetcher corrupting some segments

2013-05-27 Thread Sebastian Nagel
Hi Markus, a similar problem was posted some time ago: http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-quot-problem-advancing-port-rec-quot-during-fetching-tt3994633.html#a3996554 Sebastian On 05/27/2013 11:06 AM, Markus Jelsma wrote: Hi, For some reason the fetcher

Re: IndexWriter Plugin Workflow

2013-06-12 Thread Sebastian Nagel
Hi, I'm writing a custom IndexWriter and I had some questions on the execution workflow. Have a look at NUTCH-1527 and NUTCH-1541. I notice that when I run my index writer plugin the following happens: - the describe String is printed - the .open method is called once - the .write

Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Sebastian Nagel
Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On

Re: Suffix URLFilter not working

2013-06-12 Thread Sebastian Nagel
Hi Peter, please do not hijack threads. Seed URLs must be fully specified including protocol, e.g.: http://nutch.apache.org/ but not apache.org Sebastian On 06/12/2013 05:08 PM, Peter Gaines wrote: I have installed version 2.2 of nutch on a CentIOS machine and am using the following

Re: Nutch not passing latest CrawlDatum to IndexingFilter plugin

2013-06-18 Thread Sebastian Nagel
Hi Liaokz, After debugging, I could confirm that in CrawlDbReducer.java, Nutch really return the latest CrawlDatum(at the line of output.collect(key, result); the member result has the latest data). I suppose the latest CrawlDatum is wrtten to CrawlDB. Isn't it right? No, or only partially: -

Re: confusion over fetch schedule

2013-06-23 Thread Sebastian Nagel
Hi Joe, Ideally, it should take higher priority than the default interval. This is particularly important for sites such as cnn.com, whether the leaf page doesn't really change, but the portal page is updated all the time. AdaptiveFetchSchedule does exactly this: if a page is found modified

Re: Parse reduce stage take forver

2013-06-24 Thread Sebastian Nagel
Hi, I once observed a similar problem: a few 1000 docs per cycle and among them a few hundred with quite many and long outlinks. Parsing was done in Fetcher (to avoid storing the raw content) and the recude step took hours. The segments (namely the subdirs containing outlinks) take the size of

Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Sebastian Nagel
+1 On 06/27/2013 08:00 PM, Lewis John Mcgibbney wrote: Hi, It would be greatly appreciated if you could take some time to VOTE on the release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is (amongst other things) a bug fix for NUTCH-1591 - Incorrect conversion of

Re: Fetch iframe from HTML (if exists)

2013-06-27 Thread Sebastian Nagel
Hi Amit, [#document-fragment: null] that does not mean that your DocumentFragment is empty. DocumentFragment.toString() does not print the DOM as XML. How to do this? Have a look at serializeToXML in http://svn.apache.org/viewvc/any23/trunk/core/src/main/java/org/apache/any23/extractor

Re: no digest field avaliable

2013-07-02 Thread Sebastian Nagel
Hi Christian, no field digest showing up in the indexchecker That's correct to some extend. The class of indexchecker is called IndexingFiltersChecker and it shows the fields added by the configured IndexingFilters. The field digest is added as a field by the class IndexerMapReduce. The digest

Re: no digest field avaliable

2013-07-03 Thread Sebastian Nagel
Hi Christian, with Nutch 1.7 and Solr 3.6.2 and same for [1], [2] the digest field appears and is filled well. Sebastian On 07/03/2013 09:14 AM, Christian Nölle wrote: Am 02.07.2013 22:29, schrieb Sebastian Nagel: no field digest showing up in the indexchecker That's correct to some extend

Re: limit to fetch only N pages from each host?

2013-07-05 Thread Sebastian Nagel
I think that should be topN in generate. No, as Markus said: generate.max.count does the job in combination with generate.count.mode. In combination with -depth it's possible to get a limited and almost evenly distributed number of pages per host/domain. property namegenerate.max.count/name

  1   2   3   4   5   6   >