Re: [jira] [Commented] (NUTCH-865) Format source code in unique style

2011-10-23 Thread Sebastian Nagel
Hi, I can only confirm that with Indigo the command line formatter fails on source files using generics. But when launched from the GUI it works. I imported the eclipse-codeformat.xml (Properties Java Code Style Formatter) and run it on the project node (Source Format). 292 files have been

Re: [jira] [Commented] (NUTCH-865) Format source code in unique style

2011-10-24 Thread Sebastian Nagel
at 12:05 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, I can only confirm that with Indigo the command line formatter fails on source files using generics. But when launched from the GUI it works. I imported the eclipse-codeformat.xml (Properties Java Code Style Formatter

Re: CrawlDatumStates diagram lost in wiki

2011-11-18 Thread Sebastian Nagel
Hi Lewis, We'll try to get this sorted out in due course that would be great! maybe you find it on Google images? I tried hard, but I didn't find it. Thanks, Sebastian On 11/18/2011 05:52 PM, Lewis John Mcgibbney wrote: Hi Sebastian, This has happened recently with quite a lot of

contribution to wiki

2011-12-05 Thread Sebastian Nagel
Hi, images and attachments of wiki pages are again viewable. Thanks! But I found that pages are now immutable (at least for me). According to http://wiki.apache.org/nutch/ContributorsGroup I would like to get permission to edit wiki pages. My wiki user name: SebastianNagel Bye, Sebastian

Re: Apache Nutch release 1.5 RC2

2012-05-24 Thread Sebastian Nagel
The tutorial will need to be updated to reflect this change. You are volunteering? ok On 05/24/2012 09:27 AM, Julien Nioche wrote: Hi Seb Moved to dev@ as more relevant [...] - bin should only have the content of runtime/local/ What about runtime/deploy/, esp. nutch-1.5.job ? Does it mean

Re: New Nutch Committer and PMC member : Sebastian Nagel

2012-05-25 Thread Sebastian Nagel
and smaller improvements on the 1.x branch, and some documentation. Cheers, Sebastian On 05/25/2012 05:56 PM, Julien Nioche wrote: Dear all, It is my pleasure to announce that Sebastian Nagel has joined the Nutch PMC and is a new committer. Sebastian, would you mind telling us about yourself

Re: [VOTE] Apache Nutch release 1.5 RC3

2012-05-31 Thread Sebastian Nagel
Hi Lewis, Minor nitpick : the directory /runtime is not necessary as it is built with ANT. Removing it would massively reduce the size of the archive. this applies also to the docs/ folder (15MB uncompressed) for the bin package: +1 Sebastian

Re: [VOTE] Apache Nutch release 1.5 RC3

2012-05-31 Thread Sebastian Nagel
... Thanks Lewis On Thu, May 31, 2012 at 9:29 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Lewis, Minor nitpick : the directory /runtime is not necessary as it is built with ANT. Removing it would massively reduce the size of the archive. this applies also to the docs

Re: [VOTE] Apache Nutch 1.5 release-1.5RC4

2012-05-31 Thread Sebastian Nagel
+1 Sebastian On 05/31/2012 10:37 PM, Lewis John Mcgibbney wrote: Good Evening Everyone, A candidate for the Apache Nutch 1.5 RC4 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5-rc4/ The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of

bin/nutch -core

2012-06-10 Thread Sebastian Nagel
Hi, bin/nutch (help/overview) still contains the option -core which is currently (trunk) not functional: % bin/nutch -core parsechecker Unrecognized option: -core Could not create the Java virtual machine. It's broken since https://issues.apache.org/jira/browse/NUTCH-843 But is it still

Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Sebastian Nagel
Hi Lewis, my first steps with 2.0 (to be continued, still struggling). Two points (I'll try to give a final vote tomorrow): 1 some guidance would be nice. README.txt points to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x (I'm using

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Sebastian Nagel
Hi Lewis, Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an update of Julien's (I think) page on GORA_HBase. Thsi will get you rocking with HBase. The changes between Cassandra, Accumulo and the other data stores are fairly trivial. I'll managed to perform a crawl with 2.0

Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Sebastian Nagel
We only supply src distributions... Does this principle apply to Nutch 2 as well? Maybe, yes. The situation with the current binary package is uncomfortable: I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running. 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: [VOTE] Apache Nutch 2.0 RC2

2012-06-18 Thread Sebastian Nagel
+1 with a documentation issue about the dependencies: simply copy its HBase core lib from the HBase installation into the local/lib directory. This works for me. Removing lib/hbase-0.90.4.jar and copying hbase-0.94.jar from the HBase installation into lib/ caused a Exception in thread main

ant build: central list of plugins

2012-06-26 Thread Sebastian Nagel
Plugins are registered multiple times in build.xml src/plugins/build.xml default.properties This is error-prone and there are already some inconsistencies (trunk): build.xml: lib-http (given twice in target release) urlfilter-prefix (given twice in target release) default.properties:

Re: [VOTE] Apache Nutch 1.5.1 RC#3

2012-07-08 Thread Sebastian Nagel
+1 Looks perfect and runs. Great work, Lewis! On 07/07/2012 11:07 PM, Lewis John Mcgibbney wrote: *PING* Hi Everyone, I know there have been a good few threads going around with a power of release candidates but I wonder if it is possible to get some feedback on the 1.5.1RC#3 below.

duplicate jar files by plugin dependencies

2012-08-09 Thread Sebastian Nagel
Hi, I just discovered that some jar files in the bin package (1.5.1) and also in nutch.job are packed twice: 2 commons-logging-1.1.1.jar lib parse-tika 2 geronimo-stax-api_1.0_spec-1.0.1.jar lib parse-tika 2 tagsoup-1.2.1.jar parse-html parse-tika 2

Re: svn commit: r1387356 - in /nutch/branches/2.x: CHANGES.txt build.xml

2012-09-18 Thread Sebastian Nagel
Great. On 09/18/2012 10:57 PM, Lewis John Mcgibbney wrote: Hi Seb, I totally forgot about this. I will forward port to 2.1 branch before pushing the release. Thanks Lewis. On Tue, Sep 18, 2012 at 9:52 PM, sna...@apache.org wrote: Author: snagel Date: Tue Sep 18 20:52:08 2012 New

Re: [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-09-27 Thread Sebastian Nagel
+1 * package looks good * sample crawl runs like a charm On 09/21/2012 05:07 PM, Lewis John Mcgibbney wrote: Hi Everyone, A candidate for Apache Nutch 2.1 is available at: http://people.apache.org/~lewismc/apache-nutch-2.1 The release candidate is a src.zip and src.tar.gz ONLY archive

Re: [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-01 Thread Sebastian Nagel
Forgot to say: I've run the test crawl with HBase 0.90.5 On 10/01/2012 04:34 PM, Julien Nioche wrote: Would be good to get thumb-ups from people who've tested crawls on other backends (Cassandra, Hbase) before pushing the release. I can't really give a +1 as I've just checked the most obvious

Re: setting modifiedTime in DefaultFetchSchedule

2012-11-15 Thread Sebastian Nagel
Hi Cesare, hmhh... Good catch! The modifiedTime is also set in CrawlDbReducer.reduce right after FetchSchedule.setFetchSchedule is called and the signature hasn't changed compared to the previous fetch, cf. NUTCH-1341. At a first glance, it looks like the modifiedTime is indeed never set with

Re: [DISCUSS] trunk release?

2012-11-22 Thread Sebastian Nagel
+1 to release Now we can hold the 6-month cycle. Chris is right: If we manage to address a couple of the critical issues early next year, we can release earlier. Sebastian On 11/22/2012 06:43 PM, Mattmann, Chris A (388J) wrote: Release early, release often :) I'd say I'd be happy to try and

Re: [VOTE] Apache Nutch 1.6 Release Candidate

2012-11-28 Thread Sebastian Nagel
+1 - source package builds, tests pass - successful test crawl with bin package (20+ URLs, Linux, local mode, Solr 3.6) On 11/23/2012 03:24 PM, lewis john mcgibbney wrote: Hi Everyone, A candidate for the Apache Nutch 1.6 RC#1 is available at:

Re: Outlinks in parse filter

2013-01-29 Thread Sebastian Nagel
Hi Markus, this would mean that urlfilter and urlnormalizer plugins are accessed from parse plugins. At a first glance, sounds somewhat oddish. But it's already the case for the feed parser. We would have to do it for all parse plugins. Since there not so many that's no argument against.

Re: Outlinks in parse filter

2013-02-01 Thread Sebastian Nagel
Hi Markus, we should be fine right? Yes, even better: FeedParser only contains URLNormalizers and URLFilters objects which get the references to plugin instances themselves via ObjectCache in the constructor. Btw., that's also the way the parse filter plugins are referenced, eg. TikaParser -

Re: Next Release Cycle

2013-04-15 Thread Sebastian Nagel
Hi Lewis, +1 it's time: May for 2.2 and beginning of June for 1.7 to adhere to the 6-month release cycle. After sorting major/critical issues for 1.7 with patches available, I've found: NUTCH-1245 NUTCH-1342 NUTCH-1430 NUTCH-926 NUTCH-1334 NUTCH-1467 which are worth to commit. I'll

Re: java.lang.RuntimeException: Filter org.apache.nutch.urlfilter.prefix.PrefixURLFilter not found.

2013-04-22 Thread Sebastian Nagel
Does the property plugin.includes include urlfilter-prefix? Default is only urlfilter-regex. On 04/22/2013 06:28 PM, naveen shukla wrote: Hi All, I got run time exception when i run following command *nutch org.apache.nutch.net.URLFilterChecker -filterName

Re: Unable to parse flv and epub file contents using nutch

2013-05-13 Thread Sebastian Nagel
Now, in order to get or save the files in their actual format, in your case, .flv or .epub files, you will have to write additional program (for example in Java). No, you don't have to: the plugin parse-tika can parse .epub and .flv - see http://tika.apache.org/1.2/formats.html - test it, eg:

fix version 1.7 removed in Jira

2013-05-22 Thread Sebastian Nagel
Hi, please take care not to remove the fix version when applying bulk changes, e.g., 2.2 = 2.3 Alternative fix versions (1.7) are not kept. Luckily Jira is quite powerful, I restored the 1.x fix version using this awful filter: project = NUTCH AND fixVersion in (2.3) AND status = Open AND

Re: [VOTE] Apache Nutch 2.2 Release Candidate

2013-06-04 Thread Sebastian Nagel
+1 (test with hbase) On 06/01/2013 01:17 AM, lewis john mcgibbney wrote: Good Friday Everyone, Glad to get to a stage where we can VOTE on the release of the Apache Nutch 2.2 artifacts. We solved a stack of issues: http://s.apache.org/LPB SVN source tag:

Re: [DISCUSS] Nutch 1.7 ready for release?

2013-06-09 Thread Sebastian Nagel
+1 go ahead! Sebastian On 06/08/2013 11:53 PM, Lewis John Mcgibbney wrote: Thread says it all troops. Best Lewis

Re: Fwd: Nutch Compilation Error with Eclipse

2013-06-11 Thread Sebastian Nagel
Hi Tejas, you should be able to add images as Attachments: there is a tab/link left of More Actions:. Cheers, Sebastian On 06/11/2013 01:30 AM, Tejas Patil wrote: Hi @nutch-dev, I want to put out this [0] tutorial over Nutch wiki. 1. Do you see anything wrong in it or any improvements ?

Re: Feed Plugin Crawl Links

2013-08-07 Thread Sebastian Nagel
Hi Richard, if understood right parse-tika does the job well? Extract content + all links including anchor texts? 1. The plugin parse-tika seems indeed better maintained than feed. 2. plugin feed is special as it treats a rss file as - one master document (the rss file) - many sub-documents

Re: Wiki entry: Tutorials on Nutch

2013-08-29 Thread Sebastian Nagel
Hi Carmen, I was wondering whether the message I sent a little while ago has been seen? Yes, looks like. Sorry. The user CarmenKlaussner has been added to contributors group. Following the common practice I left no space in the user name. Is that ok? Cheers, Sebastian On 08/29/2013 11:15

Re: Nutch 1.7 HTMLParseFilter plugin dev

2013-09-17 Thread Sebastian Nagel
Hello Ivan, Where are the logs? I suppose to see them on the console output thile running the hadoop jar nutch.job. Maybe that code is executing on the DataNode?? Yes, these logs should be on the nodes where the tasks have been run. Search for hadoop log location, the answer may depend on the

IOException reading segments with current trunk

2013-09-17 Thread Sebastian Nagel
Hi, recently I got some IO exceptions when reading older segments with recent trunk builds. Did anyone make similar observations? According to the stack it seems possible that NUTCH-1622 causes segments' parse_data to be incompatible between versions? Thanks, Sebastian java.io.IOException: IO

Re: IOException reading segments with current trunk

2013-09-19 Thread Sebastian Nagel
Thanks, good to know. We should add a warning to release notes and CHANGES.txt. Ideally, of course, reading segments should be backward compatible. Sebastian 2013/9/17 Markus Jelsma markus.jel...@openindex.io Yes, we've got trouble with it too, similar exception, but we're did not sync our

Missing nightly API Docs

2013-09-26 Thread Sebastian Nagel
Hi, links from http://nutch.apache.org/ to nightly API Docs https://builds.apache.org/job/Nutch-trunk/javadoc/ https://builds.apache.org/job/Nutch-nutchgora/javadoc/ are broken. Is it still generated? Does anyone know how to fix it? Thanks, Sebastian

Re: [DISCUSS] Release Trunk

2013-12-02 Thread Sebastian Nagel
Hi, +1 to release soon (this year, or early next year) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien

Re: [DISCUSS] Release Trunk

2014-02-14 Thread Sebastian Nagel
lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com wrote: Hi guys, At least 2 of the issues that Seb and I had mentioned have now been committed. What about releasing 1.8 from trunk? If so, any volunteers? Julien On 2 December 2013 21:02, Sebastian Nagel wastl.na

Re: Getting statistics about crawled pages

2014-02-19 Thread Sebastian Nagel
Hi Alparslan, You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We can develop an HTML parser plug-in to provide such an improvement. Nice resource and nice idea. For me that sounds like a combination of the ParserJob and the classic Hadoop word

[DISCUSS] Release 1.8?

2014-03-11 Thread Sebastian Nagel
Hi everyone, NUTCH-1113 and NUTCH-1706 are fixed, broken HostDb (NUTCH-1325) has been removed for now from trunk. No open issues marked for 1.8 are left and everything seems to work! Time to spin a new release candidate? Sebastian

Re: Generator OutOfMemoryError Fix

2014-03-30 Thread Sebastian Nagel
Hi Greg, I am wondering if it would it be possible to integrate this kind of change in the upstream code base? Yes, of course. Please, open an issue in Jira. Ideally, with a patch attached, see: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing

Re: Add Field to crawled content for indexing

2014-04-02 Thread Sebastian Nagel
Hi Yann, In Parse type, we don't have getData() so we can't add new metadata. ... So what is the new way to add custom field to index ? Maybe i miss something ... In 2.x data for custom fields can be added to the WebPage's metadata in ParseFilter via page.putToMetadata(Utf8 key, ByteBuffer

Re: upgrading protocol-httpclient to httpclient 4.1.1

2014-04-04 Thread Sebastian Nagel
Hi, does it mean you are (also) addressing NUTCH-1086? Would be great, since this issue is waiting for a solution since long! The reason I picked version 4.1.1 and not the latest is because I noticed it is already in the build/lib dir and I wasn't sure I can use two versions of the jar with

Re: Url validator rejected url because of 2 dots

2014-04-04 Thread Sebastian Nagel
Hi, Url validator plugin reject this kind of url because of .. . I had a look RFC 2396 and w3c standarts. There is no constraint about .. except these /../ and /.. kind of statements. Also Unix systems accept files containing two dots abc..xyz.txt. urlfilter-validator should be relaxed to

Re: upgrading protocol-httpclient to httpclient 4.1.1

2014-04-04 Thread Sebastian Nagel
eclipse: change ivy.xml, close the Eclipse project, call ant eclipse, open project again and press F5 Refresh Sebastian On 04/04/2014 10:56 PM, d_k wrote: On Fri, Apr 4, 2014 at 11:28 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, does it mean you are (also) addressing NUTCH

Re: Debugging Nutch from Windows

2014-04-23 Thread Sebastian Nagel
Hi Diaa, on Windows Cygwin is required as prerequisite to run Nutch (or other Hadoop-based applications). Cygwin will provide the program chmod. See: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows Sebastian On 04/23/2014 04:57 PM, Diaa Abdallah wrote: Hi, Is there a way to debug

Re: Why are web urls not assumed to be http

2014-04-26 Thread Sebastian Nagel
Hi Diaa, Why doesn't nutch assume that web links that have www. at the beginning are of the http protocol? It would be not a big problem to do so. The url normalizer provides scopes (inject, fetch, etc.): you only have to point the property urlnormalizer.regex.file.inject to a special

Re: About RankingJob for Giraph

2014-05-03 Thread Sebastian Nagel
Hi Talat, At the present our architecture of scoring plugins don't permit. The scoring plugin interface fits into the crawler work and data flow: links are feed into CrawlDb/Webtable, fetch lists are generated, etc. OPIC can be used because it's online. Other link rank algorithms define a

Re: Better Parser Plugin

2014-05-03 Thread Sebastian Nagel
Hi Talat, parse-html uses neko per default, or as alternative tagsoup. Tagsoup is also used by parse-tika. Which parser lib is used internally by parse-html can be set via property parser.html.impl. It will not harm to have more libs available (if they are compatible, also regarding license). If

Re: Better Parser Plugin

2014-05-05 Thread Sebastian Nagel
Hi Talat, thanks for the examples. I've also observed that Neko has some problems even with valid HTML5. Luckily, most pages do not use excessively the syntactic freedom HTML5 allows (not closing tags, leaving implicit tags away). Some problems can be easily fixed (eg., NUTCH-1733), and since

Re: Fixing Nutch 2.x Build on Jenkins

2014-06-19 Thread Sebastian Nagel
Hi Lewis, it seems to be related to NUTCH-1714: WebPage-owned maps (metadata, headers, etc.) are not initialized any more in the constructor. This causes also other tests to fail. The solution would be to replace WebPage page = new WebPage(); by WebPage page = WebPage.newBuilder().build();

Re: New Apache Nutch Site

2014-06-19 Thread Sebastian Nagel
Hi Lewis, looks great! What about the old http://nutch.apache.org/version_control.html Could be useful for users/developers not familiar with Apache resources. Of course, content could be updated. Also I miss the search box of the old version. Shouldn't we add it again? Sebastian On

Re: Fixing Nutch 2.x Build on Jenkins

2014-06-19 Thread Sebastian Nagel
Hi Lewis, a patch is ready, on my machine all tests pass now. Currently, I experience problems with Jira: feel free to open and resolve the issue. Cheers, Sebastian On 06/19/2014 07:58 PM, Lewis John Mcgibbney wrote: Hi Seb, On Thu, Jun 19, 2014 at 1:46 PM, dev-digest-h...@nutch.apache.org

Re: Nearing a 1.9 release?

2014-06-29 Thread Sebastian Nagel
+1 for a release during the next month I plan to address before a release: - 2 issues related to redirects NUTCH-926 https://issues.apache.org/jira/browse/NUTCH-926 NUTCH-1708 https://issues.apache.org/jira/browse/NUTCH-1708 - issues ready for commit: NUTCH-1605

Re: Build failed in Jenkins: Nutch-trunk #2683

2014-07-02 Thread Sebastian Nagel
Build and tests run successfully on my local machine. But it repeatedly fails on ubuntu* Jenkins machines. The error in resolve-test could be related to - changes to test dependencies (NUTCH-1802, NUTCH-1803) - or missing ivy libs in ant installations Any ideas? Sebastian 2014-07-02 6:33

Build failure Nutch-trunk

2014-07-04 Thread Sebastian Nagel
/tree of test, presumably as first dependency of compile-core-test, in parallel to compile-core. Right? I'll fix it over the weekend. But if anybody is faster... You're welcome! Cheers, Sebastian 2014-07-02 17:49 GMT+02:00 Sebastian Nagel wastl.na...@googlemail.com: Build and tests run

Problems running some ant targets on recent trunk

2014-07-16 Thread Sebastian Nagel
Hi, I have some problems running ant targets on recent trunk: % ant runtime fails if run from scratch (after ant clean) but it succeeds after ant test or ant nightly. in a plugin folder, e.g., src/plugin/parse-metatags % ant test The error causing the failure is always:

Re: Push Nutch 1.9

2014-07-30 Thread Sebastian Nagel
+1 sebastian 2014-07-30 10:56 GMT+02:00 Julien Nioche lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com: Hi Lewis https://issues.apache.org/jira/browse/NUTCH-1755 is more at a discussion stage and can be done later. I have moved it to 1.10 I've just

Nutch @ApacheCon Europe 2014

2014-07-31 Thread Sebastian Nagel
Hi, we're glad to announce that there will be two events dedicated to Nutch at the upcoming ApacheCon Europe http://events.linuxfoundation.org/events/apachecon-europe in Budapest, November 17 - 21, 2014. 1. an introductory talk about Nutch http://sched.co/1nyYa7b as part of the Lucene/Solr

Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-16 Thread Sebastian Nagel
+1 * src package: compiles, tests pass * bin package: successfully run small test crawl and indexed to Solr On 08/13/2014 07:31 AM, Lewis John Mcgibbney wrote: Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 1.9. The release candidate comprises the following components.

Re: Nutch won't fetch the whole page if the Transfer Dncoding is chunked

2014-09-17 Thread Sebastian Nagel
Hi, afaics, Julien is right. It's possible to check it via: bin/nutch parsechecker -Dhttp.content.limit=-1 -dumpText \ 'http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2' With -Dhttp.content.limit=65534 (also the default) the content is truncated. Best, Sebastian On 09/17/2014 11:32 AM,

Re: Making life easier with Oozie?

2014-09-24 Thread Sebastian Nagel
Hi Edoardo, To make things easy I've used the JavaMain action to execute the classes that the nutch scripts invokes, parametrized as necessary. Ok. That means that each step (inject, generate, fetch, etc.) runs in its own JVM. Right? One thing that I noticed is that I found configuring the

Re: Making life easier with Oozie?

2014-09-28 Thread Sebastian Nagel
of -D options defined externally (either by the bash script, oozie workflow, etc...) What do you think? Best, Edoardo On Wed, Sep 24, 2014 at 3:34 PM, Sebastian Nagel wastl.na...@googlemail.com mailto:wastl.na...@googlemail.com wrote: Hi Edoardo, To make things easy

Re: Can't crawl filesystem with protocol-file plugin - java.lang.NullPointerException

2014-10-27 Thread Sebastian Nagel
Hi, thanks for testing! 1. is /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ the real path. I.e., are there no symbolic links in the path? The command readlink -f /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ should show

Re: [Problem solved] Can't crawl filesystem with protocol-file plugin - java.lang.NullPointerException

2014-10-30 Thread Sebastian Nagel
the indexer-solr plugin in File: conf/nutch-site.xml which is not mentioned there. Please add it too, so future users could easily follow it step by step. Best, Mengying (Angela) Wang On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel wastl.na...@googlemail.com mailto:wastl.na

Re: [jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

2014-11-03 Thread Sebastian Nagel
Hi Albin, you mean NUTCH-1870, right? I'm in the process of reviewing your patch. Just stuck in preparing the boilerplate required to intregate parse-xsl into build, tests, javadoc. I've added the jaxb dependencies to ivy, but the xjb task fails. Presumably, because there is a version mismatch.

Re: Patch reviews for 2.X

2014-11-04 Thread Sebastian Nagel
Hi Lewis, NUTCH-1825 (protocol-http may hang for certain web pages) - I'm running tests in production since one week (with 1.x) I'll check for any regressions in detail and will commit the next days. I'll also in the process of committing NUTCH-1483 Can't crawl filesystem with

Re: [nsf-polar-usc-students] Nutch in Windows: Failed to set permissions of path

2014-11-22 Thread Sebastian Nagel
Hi, that's an Hadoop 1.x problem on Windows 7: https://issues.apache.org/jira/browse/HADOOP-7682 http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%3c51db1853.3040...@googlemail.com%3E Indeed, using Linux may be the simplest solution, simpler than to down/upgrade Hadoop.

Re: Build failed in Jenkins: Nutch-nutchgora #1264

2014-12-13 Thread Sebastian Nagel
Hi, this problem is reproducible after deleting % rm -rf ~/.ivy2/cache/org.restlet.jse The error stated [ivy:resolve] restlet: bad module name found in http://maven.restlet.org

Re: Build failed in Jenkins: Nutch-nutchgora #1264

2014-12-15 Thread Sebastian Nagel
/org.restlet.ext.jaxrs/2.2.3/org.restlet.ext.jaxrs-2.2.3.pom Thanks 2014-12-13 22:58 GMT+02:00 Sebastian Nagel wastl.na...@googlemail.com: Hi, this problem is reproducible after deleting % rm -rf ~/.ivy2/cache/org.restlet.jse The error stated [ivy:resolve] restlet: bad

Re: [VOTE] Release Apache Nutch 2.3

2015-01-20 Thread Sebastian Nagel
Hi Talat, - AdaptiveFetchSchedular do not work. In default settings float, it needs integer. Confirmed, in nutch-default.xml these two properties are defined as floats but read as integers. Configuration.getInt(name) then returns the default value.

Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed

2015-02-11 Thread Sebastian Nagel
Hi, the jetty-client-6.1.22.jar is a dependency needed only for testing. Consequently, it's placed in build/test/lib/ but only if you run the tests, resp. call % ant resolve-test There is also a target % ant eclipse which writes a complete Eclipse project configuration. Sometimes, if

[ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Jorge Luis Betancourt Gonzalez has been voted in as committer and member of the Nutch PMC. Jorge, would you mind telling us about yourself, what you've done so far with Nutch, which areas you think you'd like to get involved,

Re: Option to disable Robots Rule checking

2015-01-28 Thread Sebastian Nagel
Hi Markus, hi Chris, hi Lewis, -1 from me A well-documented property is just an invitation to disable robots rules. A hidden property is also no alternative because it will be soon documented in our mailing lists or somewhere on the web. And shall we really remove or reformulate Our software

Re: [VOTE] Release Apache Nutch 2.3

2015-01-11 Thread Sebastian Nagel
+1 - successful small test crawl with HBase 0.94.26 - verified signatures On 01/09/2015 09:58 AM, Lewis John Mcgibbney wrote: Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues as per the release report

Re: Problem with redirection

2015-03-22 Thread Sebastian Nagel
Hi Mahmoud, which version of Nutch 2.x is used exactly? Are all URLs in the redirect chain really accepted by URL filters? Do URL normalizers change URLs (esp. ;jsessionid=...)? Thanks, Sebastian On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote: Hi everyone, I have a problem with redirection

Re: TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-22 Thread Sebastian Nagel
Hi, maybe this thread is better at dev@tika since it's about building Tika. Btw., I can successfully build Tika trunk/1.8. Looks like something system-specific, similar to TIKA-1503: gdalinfo is installed, but fails to parse a certain file format. Thanks, Sebastian On 03/22/2015 08:26 AM,

Re: Filter rejecting url

2015-03-17 Thread Sebastian Nagel
Hi, the reason is clearly in the URL filters. The single injected URL does not pass the filter: InjectorJob: total number of urls rejected by filters: 1 InjectorJob: total number of urls injected after normalization and filtering: 0 Please, check which URL filters are activated via property

Re: HTTP Post Authentication

2015-03-12 Thread Sebastian Nagel
Hi Tizy, this should help: https://wiki.apache.org/nutch/HttpPostAuthentication http://svn.apache.org/repos/asf/nutch/trunk/conf/httpclient-auth.xml.template For more details you could also check https://issues.apache.org/jira/browse/NUTCH-827 https://issues.apache.org/jira/browse/NUTCH-1943

Re: [DISCUSS] Release Apache Nutch 1.10

2015-03-31 Thread Sebastian Nagel
Hi all, want to bring this on the agenda again. It's time - NUTCH-1925 (upgrade Tika) is done - we have 8 remaining issues [1] - they either are (relate to) new features (common crawl dumper, docker file, urlnormalizer-slash) - or minor issues or improvements which could possibly be

Re: questions about the webui packages

2015-02-24 Thread Sebastian Nagel
Hi, yes, there is a Nutch server providing a REST Api and a web app client to run Nutch (as result of our participation in GSoc 2014 by Fjodor Vershinin). There are some limitations: - only 2.x for now (please, follow NUTCH-1040 for a 1.x port) - not complete (e.g., cannot configure a crawl) For

[ANNOUNCE] New Nutch committer and PMC - Guiseppe Totaro

2015-04-24 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Guiseppe Totaro has joined us as committer and member of the Nutch PMC. Congratulations on your new role within the Apache Nutch community! Guiseppe, would you mind telling us about yourself, and what you are doing with Nutch, what you plan to do,

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-21 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42675133 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed

Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Sebastian Nagel
+0 What about the source package *-src.zip and the tar.gz packages (*-bin.tar.gz, *-src.tar.gz)? The PGP key B876884A is missing in http://www.apache.org/dist/nutch/KEYS It is contained in https://people.apache.org/keys/group/nutch.asc We should - either update the first (it's also not in

Re: [NUTCH] Invalid 'Modified time' in crawl db

2015-11-08 Thread Sebastian Nagel
Hi, that might look strange but it's not a bug. It could be improved, see below, simply because it's not obvious - I also stumbled over this point some time ago. It also pops up from time to time on the mailing lists, see references below. - when indexing the modified time (sent by the server)

Re: [DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-19 Thread Sebastian Nagel
Hi Chris, hi Markus, +1 to release now / during the next days > Going to try for a Tika 1.11 release candidate 1 today too. Does this mean to wait until Tika has been released and to update parse-tika as well? > NUTCH-2064 is too important to miss another release > especially if you are using

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42160948 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42160759 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42163943 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42164537 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42161108 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed

Re: Normalize before inject

2015-10-10 Thread Sebastian Nagel
Hi, Injector is normalizing, there is no extra setup required. In case, you want to have special rules for injected URLs (e.g., strip "index.html"), it's possible to configure a special rules files for this scope by: urlnormalizer.regex.file.inject regex-normalize-inject.xml Name of the

[ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-09 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Asitang Mishra has joined the Nutch team as committer and PMC member. Asitang, please feel free to introduce yourself and to tell the Nutch community about your interests and your relation to Nutch. Congratulations and

Re: Introducing myself (Aron Ahmadia)

2015-09-14 Thread Sebastian Nagel
Welcome, Aron! Thanks for the introduction and the many links! On 09/14/2015 04:25 PM, Aron Ahmadia wrote: > Hi Folks, > > Since I'll be spending some time with the Nutch REST API and the 1.x code > base, I figured I'd send a > quick introduction email to the Nutch developer's mailing list. >

[ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Sujen Shah has been voted in as committer and member of the Nutch PMC. Sujen, would you mind to introduce yourself to the Nutch community and tell in just a few words about your interests and your plans regarding Nutch?

Re: Subscribe

2015-09-30 Thread Sebastian Nagel
Hi Manali, please send the subscription mail to dev-subscr...@nutch.apache.org Thanks, Sebastian On 09/29/2015 10:34 PM, Manali Shah wrote: > Hello, > > I would like to subscribe to the mailing list. > > Best, > Manali

Re: Redirection in nutch

2015-10-04 Thread Sebastian Nagel
Hi, yes, this is a bug which has been fixed in the commit you mentioned but reappeared again. Sorry, see https://issues.apache.org/jira/browse/NUTCH-2124, you'll also find a patch there. The fix will be included in 1.11 for sure. Thanks, Sebastian On 10/03/2015 09:22 AM, Taichi Ho wrote: > Hi,

Re: Permission to edit Nutch Whitelist Robots.

2015-09-27 Thread Sebastian Nagel
Hi Ayesha, you should now be able to edit the content of the Nutch wiki. Cheers and happy editing! Sebastian On 09/27/2015 08:33 PM, Ayesha Sabah Hasan wrote: > Hi, > > My username is: ayeshahasan and i'd like to get permission to edit the Nutch > Wiki. > > Thanks, > Ayesha

  1   2   3   4   5   6   7   8   9   10   >