Re: IOException Hadoop

2012-11-14 Thread Lewis John Mcgibbney
Hi Prashant, Please take a look on either the Nutch or the Hadoop user@ lists. I've seen and reported on this previously so it should not be too hard to find. hth Lewis On Wed, Nov 14, 2012 at 6:07 PM, Prashant Ladha prashant.la...@gmail.com wrote: Hi, I am trying to setup Nutch via Eclipse.

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Additionally, please see this issue below and if you are able please provide feedback based on the patch. https://issues.apache.org/jira/browse/NUTCH-1486 hth Lewis On Tue, Nov 13, 2012 at 8:57 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: I'm not a regular Solr user, but here are some

Re: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Lewis John Mcgibbney
Nice one Gentlemen thank you very much. Best Lewis On Tue, Nov 13, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks()

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
your experience on the issue would be excellent. Best Lewis On Tue, Nov 13, 2012 at 1:13 PM, Erol Akarsu eaka...@gmail.com wrote: Lewis, Have you checked it to SVN? Where will I get this patch? Erol Akarsu On Tue, Nov 13, 2012 at 6:57 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu eaka...@gmail.com wrote: Lewis, I applied the patch you told me. I replaced schema.xml of sol4 installation with schme-sol4.xml. Solr 4.0 system is up and running and I can see its web page with http://localhost:8080/sol40. You would need to

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 3:45 PM, Erol Akarsu eaka...@gmail.com wrote: Where is this script? bin folder has only nutch script. https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl I am using nutch 2.1 not trunk. Does it make any difference on behavior of nutch script? I

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 4:22 PM, Erol Akarsu eaka...@gmail.com wrote: Nov 13, 2012 11:11:48 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id=[org.apache.nutch:http/, ] The

Re: rss feed plugin seems broken (1.5.1)

2012-11-09 Thread Lewis John Mcgibbney
Hi, Can you please open an issue for this. I can confirm that without adding some additional dependencies I get the following when attempting to parse an rss feed [0] which I have saved locally. lewis@lewis-desktop:~/ASF/trunk/runtime/local$ ./bin/nutch plugin feed

Re: Slides of Nutch talk at ApacheCon EU 2012

2012-11-09 Thread Lewis John Mcgibbney
Hi Julien, Link to from wiki maybe? Safe journey home. Lewis On Fri, Nov 9, 2012 at 9:44 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, For those of you who could not make it to the ApacheCon in Sinsheim, here are the slides of my talk on Nutch

Call to Arms Use Gora 0.3-SNAPSHOT dependencies in 2.x HEAD

2012-11-06 Thread Lewis John Mcgibbney
Hi All, We recently committed a rather major patch over in GORA which now provides a WebServices API, enabling Gora to persist data into (currently supported) Amazon's DynamoDB [0]. Other WebServices such as Google App Engine and Microsoft Azure, etc have also been discussed but these will be

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

2012-11-05 Thread Lewis John Mcgibbney
Hi Kiran, Thanks for the persistence on this one, it is greatly appreciated. Please feel free to open an issue... may be best to even open over on the GORA jira? Best Lewis On Mon, Nov 5, 2012 at 2:50 PM, kiran chitturi chitturikira...@gmail.com wrote: I have just tested with Hbase as

Re: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Lewis John Mcgibbney
http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options hth On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma markus.jel...@openindex.io wrote: Just try it. With -D you can override Nutch and Hadoop configuration properties. -Original message- From:Joe Zhang

Re: nutch 2.x and hbas

2012-11-02 Thread Lewis John Mcgibbney
Hi Kiran, Have you treated Nutch ivy.xml with the gora-hbase artifact then compile the code? The tutorial most certainly works. Lewis On Thu, Nov 1, 2012 at 9:59 PM, kiran chitturi chitturikira...@gmail.com wrote: Hi, I am trying to configure Nutch GORA with Hbase as shown in the tutorial

Re: Getting a NullPointerException in Nutch 2.1

2012-11-02 Thread Lewis John Mcgibbney
Hi, On Fri, Nov 2, 2012 at 2:43 PM, kiran chitturi chitturikira...@gmail.com wrote: I am not sure what versions of HBase are compatible with Nutch I would advise you to read the Nutch2Tutorial again. Install and configure HBase. You can get it here (N.B. Gora 0.2 uses HBase 0.90.4, however

Re: Getting a NullPointerException in Nutch 2.1

2012-11-02 Thread Lewis John Mcgibbney
Hi, On Fri, Nov 2, 2012 at 5:36 PM, cocofan coco...@mailbolt.com wrote: 2012-11-01 14:46:52,027 ERROR security.UserGroupInformation - PriviledgedActionException as:cocofan I've never seen this Exception before...honestly. cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:

Re: Working with Twitter

2012-11-01 Thread Lewis John Mcgibbney
Hi Dan, Actually no. It was a project I never ended up getting my teeth into (sigh). I am going to try this later on today though, so keep this thread alive and we will see where it goes. Lewis On Thu, Nov 1, 2012 at 10:00 AM, dan danv...@gmail.com wrote: Hi lewis, any success with this one

Re: error at parse command in Nutch 2.x : java.sql.BatchUpdateException: data exception: string data, right truncation

2012-11-01 Thread Lewis John Mcgibbney
Hi Kiran, Did you ever get anywhere with this one? Lewis On Tue, Oct 16, 2012 at 10:30 PM, kiran chitturi chitturikira...@gmail.com wrote: Hi, I am using Nutch 2.x series with updated tika dependencies with hsql database. I have did the commands 'inject,generate,fetch' and after that when

Re: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap

2012-11-01 Thread Lewis John Mcgibbney
Nice one Julien. Its nothing short of a privilege to be part of the various communities and working alongside you guys. Have a great night. Lewis On Thu, Nov 1, 2012 at 11:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi all, Apologies for cross posting. Srini Penchikala has

Re: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Lewis John Mcgibbney
I really think this should be in the FAQ's? http://wiki.apache.org/nutch/FAQ On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's

Re: nutch on AWS EMR.

2012-10-26 Thread Lewis John Mcgibbney
Hi, On Thu, Oct 25, 2012 at 3:03 PM, manubharghav manubharg...@gmail.com wrote: Will providing a core-site.xml overwriting some of the permission in core-default.xml in hadoop jar help ?? It's certainly something I would try. Also have you tried using the Nutch script at all? If you can get

Re: Nutch2.1 problems

2012-10-26 Thread Lewis John Mcgibbney
Hi, On Tue, Oct 23, 2012 at 2:42 PM, Mouradk mourad...@gmail.com wrote: This sits in a urls/seed.txt in NUTCH_HOME (not runtime folder but the home folder generated after unzipping). Please put the urls directory (with the seed file for bootstrapping) into /runtime/local and run the command

Re: Nutch2.1 problems

2012-10-23 Thread Lewis John Mcgibbney
Hi, On Tue, Oct 23, 2012 at 11:53 AM, Mouradk mourad...@gmail.com wrote: I uploaded Nutch 2.1 and tried to get it started but no luck so far. I am running it on local with Hbase 0.90.6. HBase compatibility should be fine. In all honesty we *should* probably upgrade to one of the newer

Re: Crawling Time

2012-10-23 Thread Lewis John Mcgibbney
Hi, Stefan, To date this is not implemented. I would suggest that this is the case due to the requirement to design custom crawls. It would be relatively trivial to get it dumped from within your crawl script. Lewis On Tue, Oct 23, 2012 at 2:04 PM, Stefan Scheffler sscheff...@avantgarde-labs.de

Re: Nutch 2.x, MySQL and readhostdb command.

2012-10-22 Thread Lewis John Mcgibbney
Hi James, On Mon, Oct 22, 2012 at 1:28 AM, j.sulli...@thomsonreuters.com wrote: I've figured this out...somewhat. The issue causing the error was that I was running MySQL with UTF-8 as default and needed to increase the size of the primarykey column in gora-sql-mapping.xml to 768 (which I

Re: nutch/hadoop/solr

2012-10-20 Thread Lewis John Mcgibbney
Hi, On Fri, Oct 19, 2012 at 6:23 PM, sumarlidason sumarlida...@gmail.com wrote: So, I made some changes to gora.properties, and now im getting null pointer exception.. Do you wish to detail when and how you are getting this Exception? It almost seems to me at this point nutch needs a DB for

Re: Nutch 2.x, MySQL and readhostdb command.

2012-10-20 Thread Lewis John Mcgibbney
Hi James, Have you attempted to make any changes to the host table config in gora-sql-mapping? Lewis On Fri, Oct 19, 2012 at 10:26 AM, j.sulli...@thomsonreuters.com wrote: Could somebody confirm if the bin/nutch readhostdb command works with MySQL. I am trying to figure out if it is broke

Re: building from src

2012-10-20 Thread Lewis John Mcgibbney
Hi, There are a number of major issues with your attempts to get Nutch working. Please check out our wiki for tutorials on Nutch. Only Nutch distributions obtained from the official Apache resources are supported e.g. mirrors... and development versions available from our SVN area. All of these

Re: nutch-2.0-fetcher fails in reduce stage

2012-10-16 Thread Lewis John Mcgibbney
Hi Alex, I've seen similar exceptions numerous times [0] when running the Gora test suite against HBase however this _always_ occurred against an HBase version other than the officially supported version of HBase (which is 0.90.4) when behind a local proxy so I am immediately tempted to speculate

Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

2012-10-16 Thread Lewis John Mcgibbney
Hi Kiran, If you apply the patch to your 2.x branch, then make sure that 'ant runtime' is executed. Please also make sure that the tika 1.1 dependency _does_not_ exist in your runtime /lib directory as this may conflict with expected results. If you could update the ticket it would be excellent.

Re: response time

2012-10-16 Thread Lewis John Mcgibbney
Hi, I would also direct you at an issue [0] and set of patches for trunk and 2.x which use the inet socket to obtain the host IP address if this is required. It would not be very difficult to get this patch to also/either obtain the response time I would not think... Also if anyone feels like

Re: Issue with crawling FTP with Nutch 1.4

2012-10-16 Thread Lewis John Mcgibbney
Hi Guys, I'm sure that these issues should be logged in our Jira as they not only sound serious but also ship with reasonable sounding possible solutions. If any of you feel like opening a ticket(s), it would be great... patches are always welcome. Lewis On Sat, Oct 13, 2012 at 12:14 AM, Tejas

Re: Search in specific website

2012-10-16 Thread Lewis John Mcgibbney
Hi Tolga, Please take this to the Solr user@ list. Thank you Lewis On Tue, Oct 16, 2012 at 12:13 PM, Tolga to...@ozses.net wrote: Hi, I've tried url:fass\.sabanciuniv\.edu AND content:this, and I got results from both my URLs. What to do? Regards, On 10/13/2012 12:48 AM, Alejandro

Re: Search in specific website

2012-10-16 Thread Lewis John Mcgibbney
comment to head over to Solr lists... hth Lewis On Tue, Oct 16, 2012 at 2:01 PM, Tolga to...@ozses.net wrote: Solr sent me to Nutch list, but okay. Thanks, On 10/16/2012 02:27 PM, Lewis John Mcgibbney wrote: Hi Tolga, Please take this to the Solr user@ list. Thank you Lewis On Tue

Re: How to crawl a large index

2012-10-11 Thread Lewis John Mcgibbney
Hi, After every crawl iteration check out your webdb with the readdb tool. There is pleanty linked to from the wiki on this topic. Check urlfilters as an important area as well. hth Lewis On Fri, Oct 5, 2012 at 6:08 PM, Hailong Yang hailong.yang1...@gmail.com wrote: Dear all, I am trying

Re: SqlStore in Nutch 2.1

2012-10-10 Thread Lewis John Mcgibbney
To confirm, the gora-sql-0.1.1-incubating atrifact available on maven central IS interoperable with Nutch 2.1 release. It has not yet been developed and brought up to date so has been disabled in more recent Gora releases. Thanks you Lewis On Mon, Oct 8, 2012 at 7:32 PM, Paul Dhaliwal

Re: Nutch 2.1 More Plugin -- A better fall back value for date field

2012-10-05 Thread Lewis John Mcgibbney
Hi James, I think this is a fair suggestion. I would please ask you to open an issue and submit your patch which would be very welcome indeed. As you mention, it would be interesting to check the metadata for the value however your initial suggestion is also valid imho. Thanks Lewis On Fri,

[ANNOUNCE] Apache Nutch 2.1 Released

2012-10-05 Thread lewis john mcgibbney
Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as

Re: Nutch 2.1 fields

2012-10-04 Thread Lewis John Mcgibbney
Hi James, On Thu, Oct 4, 2012 at 2:59 AM, j.sulli...@thomsonreuters.com wrote: Lewis and Chris, Agree that The Index Structure page is very useful documentation. I went through the fields/plugins listed in your link using Nutch 2.1 rc and most work. I was able to get positive results for

Re: Nutch 2.1 fields

2012-10-04 Thread Lewis John Mcgibbney
[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java On Thu, Oct 4, 2012 at 7:36 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi James, On Thu, Oct 4, 2012 at 2:59 AM, j.sulli

[RESULT] Was Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-04 Thread Lewis John Mcgibbney
Good Afternoon Everyone, We are glad to announce that the result for the Apache Nutch release-2.1 RC#1 was successful and has passed with the following VOTE's 4x +1 Release this package as Apache Nutch 2.1 Sebastian Nagel* Chris Mattmann* Lewis McGibbney* James Sullivan 0x -1 Do not release

Re: doubt about nutch 1.5.1

2012-10-04 Thread Lewis John Mcgibbney
Hi Eyeris, On Thu, Oct 4, 2012 at 6:24 PM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Hi all. I want to use Nutch 1.5.1 version, I have download nutch 1.5.1(bin) and src also, I think you've just uncovered a problem with the .zip archives e.g. that the nutch script is not present in the /bin

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

2012-10-03 Thread Lewis John Mcgibbney
Hi Matt, I know th6ere is a pile of stuff to add to this but for the time being (until I dive into your response in detail) please see below On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald m...@nearbyfyi.com wrote: Hi, ... 5) What value should I set for gora.buffer.read.limit? Currently it's

Re: gora.properties not found when running in Hadoop

2012-10-02 Thread Lewis John Mcgibbney
@ Ian, Apologies, this one slipped through the net. On Wed, Sep 26, 2012 at 8:26 PM, Ian Truslove ian.trusl...@nsidc.org wrote: ubuntu:~/apache-nutch-svn-2.1$ ~/hadoop-1.0.3/bin/hadoop jar build/apache-nutch-2.1.job org.apache.nutch.crawl.Crawler urls -dir urls -depth 3 -topN 5 The

Re: Run nutch 1.3 in eclipse

2012-10-02 Thread Lewis John Mcgibbney
Hi CarinaBambina, There was a bug with 1.5 so we released 1.5.1, can you please try this instead and get back to us with your results. Thank you Lewis On Tue, Oct 2, 2012 at 4:25 PM, CarinaBambina carina.rei...@yahoo.de wrote: I'm having the same problem with Nutch 1.5. I also checked all

Re: Error parsing html

2012-10-02 Thread Lewis John Mcgibbney
Hi, For starters can you please use 1.5.1. On Tue, Oct 2, 2012 at 4:32 PM, CarinaBambina carina.rei...@yahoo.de wrote: Hi, i'm curious if you have come up with any solution yet? As i'm having the exact same problem! When i start the crawl the entered Url is parsed perfectly, but for all

Re: NullPointerException

2012-10-02 Thread Lewis John Mcgibbney
Hi Chris, One of the main problems here is that we very rarely know which version of Nutch you are using, what nature of configuration and in what kind of deployment.. the truth is that this makes it difficult for us to help you out. This is also applicable to any Hadoop, Solr. HBase, Cassandra,

Re: Nutch 2.1 fields

2012-10-02 Thread Lewis John Mcgibbney
Hi Chris, Please see here [0] for the most up-to-date account of the fields for building your Solr index. I tried to bring this bang up to date a while back and more recently when writing some trivial plugin tests however please shout about anything which is not correct and we can edit

[PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-01 Thread Lewis John Mcgibbney
Hi All, Anyone else for this VOTE? Sorry to be a pest! Thanks Lewis On Fri, Sep 21, 2012 at 4:07 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for Apache Nutch 2.1 is available at: http://people.apache.org/~lewismc/apache-nutch-2.1 The release

Re: Building Nutch 2.0

2012-10-01 Thread Lewis John Mcgibbney
Hi Chris, On Mon, Oct 1, 2012 at 3:27 PM, Christopher Gross cogr...@gmail.com wrote: unzipped untarred it, I don't think you need to do both! BUILD FAILED /tmp/nutch-2.0/build.xml:72: Specify at least one source--a file or resource collection. Mmmm... can you even try moving it out of

Re: Building Nutch 2.0

2012-10-01 Thread Lewis John Mcgibbney
Hi Chris, On Mon, Oct 1, 2012 at 4:17 PM, Christopher Gross cogr...@gmail.com wrote: I moved it to a different directory, same errors. Mmm. I'm stumped here. What OS as you on? I'll try 2.1 and see if that works any better. Please do and get back to us with your results as we are currently

Re: Building Nutch 2.0

2012-10-01 Thread Lewis John Mcgibbney
with a 2.x version of Nutch anytime soon, I just wanted to make sure that when I'm ready to deploy it'll be a full release, so if you're really pushing for 2.1 to be out soon, then that's what I'll work with. -- Chris On Mon, Oct 1, 2012 at 11:31 AM, Lewis John Mcgibbney lewis.mcgibb

Re: Building Nutch 2.0

2012-10-01 Thread Lewis John Mcgibbney
Hi Chris, On Mon, Oct 1, 2012 at 7:09 PM, Christopher Gross cogr...@gmail.com wrote: We have ports blocked on our box, so that may be causing issues with Ivy (which is why I prefer just standard ant and having all the required jars sitting in a lib directory). Well the pro of having Nutch

Re: Building Nutch 2.0

2012-10-01 Thread Lewis John Mcgibbney
Hi Chris, On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote: OK, I added the port being used by hbase to iptables, and now I'm farther. I'm getting: 12/10/01 19:44:17 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property. But I do have an

Re: patches to parse-metatag plugin to save mutliValues

2012-10-01 Thread Lewis John Mcgibbney
Hi Kiran On Mon, Oct 1, 2012 at 7:46 PM, kiran chitturi chitturikira...@gmail.com wrote: I have made an improvement in patches for the parse-metatags plugin and posted the patches here. https://issues.apache.org/jira/browse/NUTCH-1467 Great work! Can this plugin be included in nutch-2.0 ?

Re: Fix for binary operator expected error

2012-09-28 Thread Lewis John Mcgibbney
Hi Bai, If you could use the script @NUTCH-1087 [0] and provide insight into your findings it would be very much appreciated. It is the intention to integrate this into 2.x one it has been tested enough. The glitch you highlight is exactly the type of stuff we need to find. Thanks Lewis [0]

Re: Fix for binary operator expected error

2012-09-28 Thread Lewis John Mcgibbney
. Changing the brackets made the error go away, but I still wasn't able to get nutch to run until I removed the extraneous job files. On Fri, Sep 28, 2012 at 10:03 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Bai, If you could use the script @NUTCH-1087 [0] and provide insight

Re: java.lang.Runtime Exception: Database is not supported yet (Nutch 2.0)

2012-09-27 Thread Lewis John Mcgibbney
Hi Kiran, On Thu, Sep 27, 2012 at 3:24 PM, kiran chitturi chitturikira...@gmail.com wrote: Is it because nutch 2.0 does not support postgresql ? I have setup postgresql the same way as mysql. Yes AFAIK currently HSQLDB and MySQL are the only SQL implementation currently supported. The idea

Re: Is SFTP supported / working?

2012-09-27 Thread Lewis John Mcgibbney
Hi, AFAIK this plugin has not been used extensively with Nutch 2.x however here are some of my early observations which should get it working. 1. The plugin's plugin.xml and java source quotes code from the jsch package [0] so you will need to grab that and make it available... please see below

[VOTE] Apache Nutch 2.1 Release Candidate Available

2012-09-21 Thread Lewis John Mcgibbney
Hi Everyone, A candidate for Apache Nutch 2.1 is available at: http://people.apache.org/~lewismc/apache-nutch-2.1 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.1/ We release Nutch 2.1 in this fashion due

Re: HTTP Authentication (basic) in Nutch 1.5

2012-09-20 Thread Lewis John Mcgibbney
Hi Max, On Thu, Sep 20, 2012 at 1:44 PM, Max Dzyuba max.dzy...@comintelli.com wrote: Sorry for many emails. Lewis, thanks again for a hint about parsechecker tool. No hassle, I am glad you get it sorted and yes the parsechecker is a great tool + saves you a bunch of time. Best Lewis

Re: Nutch2 + Cassandra

2012-09-20 Thread Lewis John Mcgibbney
Hi Again, On Wed, Sep 19, 2012 at 8:39 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis zzy...@gmail.com wrote: Its the problem with gora v0.2.1 which does not work with current nutch 2. I've just run a medium sized focused crawl

Re: Nutch2 + Cassandra

2012-09-20 Thread Lewis John Mcgibbney
Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Again, On Wed, Sep 19, 2012 at 8:39 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis zzy...@gmail.com wrote: Its the problem with gora v0.2.1 which does not work with current nutch 2. I've

Re: HTTP Authentication (basic) in Nutch 1.5

2012-09-19 Thread Lewis John Mcgibbney
Hi, On Wed, Sep 19, 2012 at 3:37 PM, Max Dzyuba max.dzy...@comintelli.com wrote: 2012-09-19 16:26:16,106 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'realm'@host.org:80 I don't understand why Nutch complains about No credentials available for BASIC

Re: Nutch2 + Cassandra

2012-09-19 Thread Lewis John Mcgibbney
Hi, On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis zzy...@gmail.com wrote: Its the problem with gora v0.2.1 which does not work with current nutch 2. Can you elaborate on what you think is wrong here? To give you some insight here. Between Gora 0.2 and 0.2.1 a substantial effort was put

Re: HTTP Authentication (basic) in Nutch 1.5

2012-09-19 Thread Lewis John Mcgibbney
Best tool to use is the parsechecker, it is a quick neat way to see whether your protocol/fetch/authentication is working then whether your parser is extracting the text and metadata you require. On Wed, Sep 19, 2012 at 8:30 PM, Max Dzyuba max.dzy...@comintelli.com wrote: Hi Lewis, I used that

Re: Nutch2 + Cassandra

2012-09-18 Thread Lewis John Mcgibbney
Hi, On Tue, Sep 18, 2012 at 2:34 PM, Žygimantas Medelis zzy...@gmail.com wrote: Commands I am issuing Can you read your db and see if there are any pages pending a fetch? Also I was getting NullPointerException on inject before changing conf/gora-cassandra-mapping.xml from: class

Re: Nutch 2 solrindex fails with no error

2012-09-15 Thread Lewis John Mcgibbney
Solr logs? On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li...@gmail.com wrote: I have a nutch 2 setup that I got working with solr about a month ago. I had to shelve it for a little while and I've recently come back to it. Everything seems to be working fine except for the solr

Re: how to index the size of document ?

2012-09-15 Thread Lewis John Mcgibbney
Hi, Try index-more http://wiki.apache.org/nutch/FAQ#How_can_I_find_out.2BAC8-display_the_size_and_mime_type_of_the_hits_that_a_search_returns.3F hth Lewis On Fri, Sep 14, 2012 at 9:22 PM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Hi, all. I am using nutch and solr since 1 year and i need

Re: Nutch/Solr - Pdf getting indexed but content is not showing in solr

2012-09-14 Thread Lewis John Mcgibbney
Hi, On Fri, Sep 14, 2012 at 12:59 AM, dpverma patia...@gmail.com wrote: I am using tomcat6, nutch1.1 and solr1.4 For starters this is probably your main mistake! I would seriously urge you to upgrade your Nutch distribution. I've just used to parsechecker with -dumpText and you url and I get

Re: Nutch talk accepted at ApacheCon Europe

2012-09-13 Thread Lewis John Mcgibbney
Hi, Nice one Julien, yeah I hope you see you and others in Sinsheim in November and looking forward to attending your talk... the lots of lager afterwards. Best Lewis On Thu, Sep 13, 2012 at 11:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, I'd just like to mention that I

Re: Hadoop and Nutch

2012-09-12 Thread Lewis John Mcgibbney
Hi, On Wed, Sep 12, 2012 at 1:41 PM, Stefan Scheffler sscheff...@avantgarde-labs.de wrote: I try to run nutch 2.0 on a hadoop cluster and get the following exception. HADOOP_CLASSPATH=lib/apache-nutch-1.6-SNAPSHOT.jar hadoop org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5 The

Re: nutch crawling file system SOLVED

2012-09-11 Thread Lewis John Mcgibbney
Hi, Take a look at ftp.content.limit property in nutch-default.xml and set it accordingly in nutch-site.xml Thanks Lewis On Tue, Sep 11, 2012 at 12:20 AM, dpverma patia...@gmail.com wrote: Can you pls let me know how you solved your problem? I am also getting the same error which you had.

Re: breakpoints in eclipse and nutch 1.5

2012-09-11 Thread Lewis John Mcgibbney
Hi Kiran, I think having line numbers is a bad option for an exploration of the codebase. It really does us no favours as the codebase changes through time. Currently (even without looking at them) I can most certainly tell you that they will not be 100% accurate. If you are able to provide more

Re: Nutch 2.x trunk, focused domain crawl that contains links with HTTP redirects pointing to external domains

2012-09-09 Thread Lewis John Mcgibbney
Hi Matt, I don't know if you got my message on the github mirrior issue. If you could get the patch uploaded to a new Nutch Jira ticket (unless one is already open) then I will be very happy to test as some free time now means I am able to test a few patches. Thanks Lewis On Sat, Sep 8, 2012

Re: Query SolrIndex for Id

2012-09-08 Thread Lewis John Mcgibbney
You can of course change the source and destination field mappings so that you don't need to query URL id's This is a workaround though and doesn't fully address the issue of querying by URL id. Lewis On Sat, Sep 8, 2012 at 2:22 PM, Alaak al...@gmx.de wrote: Hi, I have a problem with the way

Re: Malformed URL: '', skipping (java.net.MalformedURLException

2012-09-06 Thread Lewis John Mcgibbney
Hi, On Thu, Sep 6, 2012 at 5:50 AM, gaurav.gupta gaurav.gu...@edynamic.info wrote: C:\nutch\local\conf\crawl-urlfilter.txt as specified in my above post. This no longer exists... that might be a problem -- Lewis

Re: Crawl errors

2012-09-05 Thread Lewis John Mcgibbney
'title'='Sabancı Üniversitesi' Is it because of 'Sabancı Üniversitesi'? SOLR/example/solr/conf/schema.xml specifies UTF-8 Regards, On 09/04/2012 05:04 PM, Lewis John Mcgibbney wrote: I don't think you have your HSQLDB server running, this is essential requirement to store the crawldb

Re: Running Junit test

2012-09-05 Thread Lewis John Mcgibbney
Hi Vijith, On Wed, Sep 5, 2012 at 5:55 AM, Vijith vijithkv...@gmail.com wrote: Are you able to submit a patch for this? you mean a patch for the build.xml file... surely I can. Excellent :0) I also noticed that there is currently no way to copy the compiled plugin test cases through to

Re: Malformed URL: '', skipping (java.net.MalformedURLException

2012-09-05 Thread Lewis John Mcgibbney
I think you've incorrectly passed your regex- as your seed URL list when you've injected. As a side note it is always VERY helpful to provide basic info such as the Nutch version, the steps you took to reproduce the error, etc... basic stuff. hth Lewis On Wed, Sep 5, 2012 at 10:16 AM,

Re: Crawl errors

2012-09-04 Thread Lewis John Mcgibbney
I don't think you have your HSQLDB server running, this is essential requirement to store the crawldb, WebPage and Host data etc. You can follow the various tutorials here to get you going http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29 hth Lewis On Tue, Sep 4, 2012 at 2:27 PM, Tolga

Re: Running Junit test

2012-09-04 Thread Lewis John Mcgibbney
If you look at lines 395-399 in build.xml [0] you need to add copy file=${test.src.dir}/crawl-tests.xml todir=${test.build.classes}/ copy file=${test.src.dir}/domain-urlfilter.txt todir=${test.build.classes}/ copy

Re: Running Junit test

2012-09-03 Thread Lewis John Mcgibbney
Before doing runtime local you need to ensure the test are executed and all of the resources are present in the build directory. So please do ant test, then ant runtime, all of the test resources should then be moved to the runtime/local directory. The runtime target does NOT rely on the test

Re: recrawl a URL?

2012-08-30 Thread Lewis John Mcgibbney
Hi Max, On Tue, Aug 28, 2012 at 3:24 PM, Max Dzyuba max.dzy...@comintelli.com wrote: Is it possible to use the same crawldb but store segment data in a different directory for consecutive crawls using the bin/nutch crawl command? I thought that there is no option to specify the path to crawldb

Re: local file system crawl, unable to fetch file name containing CJK letter.

2012-08-30 Thread Lewis John Mcgibbney
Hi Ye, If you could contribute this to the community as a patch it would be greatly appreciated. If you need any help wit this then please ping us on dev@nutch and we will be more than happy to help you out. Thanks you in advance Lewis On Thu, Aug 30, 2012 at 2:14 PM, Ye T Thet

Re: bin/nutch

2012-08-29 Thread Lewis John Mcgibbney
sorry speech marks just run any runtime It most certainly works, if it does not then there is something wrong with your local copy. On Wed, Aug 29, 2012 at 7:18 AM, Tolga to...@ozses.net wrote: What brackets? I don't see brackets. On 08/28/2012 03:39 PM, Lewis John Mcgibbney wrote: I

Re: local file system crawl, unable to fetch file name containing CJK letter.

2012-08-29 Thread Lewis John Mcgibbney
Please have a look at the discussion below http://www.mail-archive.com/user@nutch.apache.org/msg04176.html It should help you out.. or point you in the correct direction at least. hth Lewis On Wed, Aug 29, 2012 at 1:13 PM, ytthet yethura.t...@gmail.com wrote: Hi Folks, I am indexing local

Re: Nutch - SMB protocol

2012-08-29 Thread Lewis John Mcgibbney
What version of Nutch is this? Lewis On Wed, Aug 29, 2012 at 9:58 AM, xpow swirja...@gmail.com wrote: Hello, I've tried to use the protocol-smb plugin with nutch. The nutch read and parsed the documents correctly, but afterward, when it hit the crawldb, crawl.CrawlDbReducer, i got a lot of

Re: Nutch - SMB protocol

2012-08-29 Thread Lewis John Mcgibbney
In the SVN area can you point me to the protocol plugin please? http://svn.apache.org/repos/asf/nutch/ Thank you Lewis On Wed, Aug 29, 2012 at 3:22 PM, Matteo Simoncini sicc...@gmail.com wrote: Sorry, I forgot it. 1.5 Matteo 2012/8/29 Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: Nutch - SMB protocol

2012-08-29 Thread Lewis John Mcgibbney
Simoncini sicc...@gmail.com wrote: I'm not so familiar with SVN. Is this what you mean? http://svn.apache.org/repos/asf/nutch/branches/branch-1.5/ Matteo 2012/8/29 Lewis John Mcgibbney lewis.mcgibb...@gmail.com In the SVN area can you point me to the protocol plugin please? http

Re: Distributed Fetching

2012-08-29 Thread Lewis John Mcgibbney
Please see the tutorial and search on the user lists (you can find plenty of info on this via out website) http://www.mail-archive.com/user%40nutch.apache.org/ http://wiki.apache.org/nutch/#Other_Tutorial.28s.29 On Wed, Aug 29, 2012 at 4:22 PM, makaveli91ro makaveli9...@yahoo.com wrote: Hello

Re: bin/nutch

2012-08-27 Thread Lewis John Mcgibbney
try ant runtime This will generate the runtime deployment(s) you require to get going, however it _does_not_ give you a ready to rock deployment. You should check out the following tutorials below http://wiki.apache.org/nutch/Nutch2Tutorial http://nlp.solutions.asia/?p=180 Lewis On Mon, Aug

Re: Nutch 2.0 error

2012-08-27 Thread Lewis John Mcgibbney
. On Sun, Aug 26, 2012 at 3:39 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Robert,, On Sun, Aug 26, 2012 at 5:25 AM, Robert Irribarren rob...@algorithms.io wrote: org.apache.solr.common.SolrException: Server Error Server Error ... Please read this [0] before

Re: recrawl a URL?

2012-08-27 Thread Lewis John Mcgibbney
The crawldb needs to receive updates of data in fetched segments, once you generate it will calculate what needs to be fetched in next iteration. It is OK to store segments in different locations but typicaly you would want to maintain one crawldb for all of your records... unless of course you

Re: bin/nutch

2012-08-27 Thread Lewis John Mcgibbney
:46 PM, Tolga to...@ozses.net wrote: Do I need HBase as well? On 08/27/2012 03:00 PM, Lewis John Mcgibbney wrote: try ant runtime This will generate the runtime deployment(s) you require to get going, however it _does_not_ give you a ready to rock deployment. You should check out

Re: Content of size X was truncated to Y

2012-08-27 Thread Lewis John Mcgibbney
further to Markus' comments please also see property nameparser.skip.truncated/name valuetrue/value descriptionBoolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes

Re: running main() in plugins?

2012-08-26 Thread Lewis John Mcgibbney
You can easily run any plugin from the terminal using ./bin/nutch plugin in the case of the HtmlParser main() method you would want to do ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser $pathToLocalFile You have actually identified an improvement which we could do with

Re: Nutch 2.0 error

2012-08-26 Thread Lewis John Mcgibbney
Hi Robert,, On Sun, Aug 26, 2012 at 5:25 AM, Robert Irribarren rob...@algorithms.io wrote: org.apache.solr.common.SolrException: Server Error Server Error ... Please read this [0] before posting to the list. It saves both you and us loads of time and also means there is less unnecessary

Re: speed of fetcher in nutch-2.0

2012-08-23 Thread Lewis John Mcgibbney
Hi, @Alxsss I hope Walters suggestion(s) help you out here. @Walter I've added your model answer to the wiki [0] this is a great response and I just couldn't help but add it. Thank you Lewis [0]

Re: Happy 10th Birthday Nutch!

2012-08-22 Thread Lewis John Mcgibbney
. Proud to have been around since 2005 (7 of them!) :) Cheers, Chris On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe

Re: Nutch Crawling for Videos

2012-08-22 Thread Lewis John Mcgibbney
Hi Robert, There is a parse-swf plugin for Nutch which uses the JavaSWF library [0] to parse such files (of what version I am not currently aware) and I can confirm that it does work e.g. when used from command line I can obtain parse data from within a local swf file. I am not sure if this

<    4   5   6   7   8   9   10   11   12   13   >