Re: Unable to inject seeds with

2013-11-20 Thread Lewis John Mcgibbney
Hi Jon, On Wed, Nov 20, 2013 at 11:08 AM, user-digest-h...@nutch.apache.org wrote: I do see that stuff is getting into Accumulo but, in my unexperienced opinion, it looks like the map method is never getting called in the job. I'm not sure if this is supposed to happen after the

Re: Unable to inject seeds with

2013-11-15 Thread Lewis John Mcgibbney
Hi Jon, As you've guessed by now this is not so much a Nutch specific problem. I'm CC'ing user@gora in here as well. On Fri, Nov 15, 2013 at 8:05 PM, user-digest-h...@nutch.apache.org wrote: I was wrong. So changing the gora.datastore.accumulo.user property caused the inject to finish and on

Re: Unable to inject seeds with

2013-11-14 Thread Lewis John Mcgibbney
Hi Jon, On Thu, Nov 14, 2013 at 4:15 PM, user-digest-h...@nutch.apache.org wrote: Unable to inject seeds with 29017 by: Jon Uhal First, here is my environment: Hadoop 1.2.1 Accumulo 1.4.4 Zookeeper 3.4.5 Gora 0.3 Solr 4.5.1 All software revisions look fine so good start :)

Re: Unable to inject seeds with

2013-11-14 Thread Lewis John Mcgibbney
Hi Jon, Glad to hear that your making some more progress! On Thu, Nov 14, 2013 at 8:45 PM, user-digest-h...@nutch.apache.org wrote: So I think it has to do with Accumulo somehow. I reverted the conf/gora.properties setting for mock from false to: gora.datastore.accumulo.mock=true and

Re: whst does the host table do in nutch2.2.1?

2013-11-08 Thread Lewis John Mcgibbney
Hi Edward, On Fri, Nov 8, 2013 at 8:46 PM, user-digest-h...@nutch.apache.org wrote: user Digest 8 Nov 2013 20:46:59 - Issue 2099 As to the host table,I am not quite sure about it's function, like in which step(inject,generate,fetch,parse,updatedb,updatehostdb) is this host table get

Re: user Digest 5 Nov 2013 13:29:55 -0000 Issue 2097

2013-11-05 Thread Lewis John Mcgibbney
Hi Olle, On Tue, Nov 5, 2013 at 1:29 PM, user-digest-h...@nutch.apache.org wrote: user Digest 5 Nov 2013 13:29:55 - Issue 2097 Hi Lewis, Just a quick question - I'm having a slight problem with the NUTCH-828v3 patch. I check out nutch trunk, make sure it runs ok, then apply the patch.

Re: NUTCH-828 fetch filter

2013-11-03 Thread Lewis John Mcgibbney
Hi Olle, On Sun, Nov 3, 2013 at 9:56 AM, user-digest-h...@nutch.apache.org wrote: user Digest 3 Nov 2013 09:56:44 - Issue 2096 Re: user Digest 30 Oct 2013 00:57:14 - Issue 2094 28926 by: Lewis John Mcgibbney 28929 by: Olle Romo Thanks for the reply :) I just

Re: user Digest 30 Oct 2013 00:57:14 -0000 Issue 2094

2013-11-02 Thread Lewis John Mcgibbney
Hi Olle, On Wed, Oct 30, 2013 at 12:57 AM, user-digest-h...@nutch.apache.org wrote: NUTCH-828 fetch filter 28911 by: Olle Romo Has anyone been able to make the nutch-828 patch fetch filter work with = 1.7? Have you tried taking the patches and manually going through them

Re: server ip

2013-10-24 Thread Lewis John Mcgibbney
Hi Yasin Julien, On Thu, Oct 24, 2013 at 8:06 PM, user-digest-h...@nutch.apache.org wrote: add it to the metadata at the protocol level. Have you checked that there isn't a patch for that in Jira? If not please create one Yeah there is a patch for this which has stagnated somewhat. You

Re: Can't find Hadoop executable

2013-10-23 Thread Lewis John Mcgibbney
Hi sujit, On Tue, Oct 22, 2013 at 8:03 PM, user-digest-h...@nutch.apache.org wrote: can't find Hadoop executable 28860 by: sujit rai Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode. This is not a Nutch specific problem. There are numerous threads

Re: Release Plan

2013-10-17 Thread Lewis John Mcgibbney
Hi, On Wed, Oct 16, 2013 at 9:50 AM, user-digest-h...@nutch.apache.org wrote: Releases of Nutch 2 tend to follow releases of Gora as it relies on it an awful lot. We're working on the Avro upgrade in Gora and finally (after a lng time away from the task at hand) I am finding time to

Re: some questions about nutch from a new user...

2013-10-02 Thread Lewis John Mcgibbney
Hi Patrick, On Sat, Sep 28, 2013 at 10:10 PM, user-digest-h...@nutch.apache.org wrote: 1. I use this command to start the crawling, as stated in the tutorial /bin/bash ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/2 So when the crawled pages will be sent to Solr for

Re: Nutch talk at Lucene/SOLR Revolution EU 2013

2013-09-25 Thread Lewis John Mcgibbney
Nice Julien. Looking forward to seeing these talks online. Gutted I will not be there. Best Lewis On Wed, Sep 25, 2013 at 12:52 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, I will be giving a talk on Nutch at Lucene/SOLR Revolution in Dublin (4/7 Nov). There should be quite

Re: updatedb crashing

2013-08-29 Thread Lewis John Mcgibbney
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists On Thursday, August 29, 2013, Ralf R. Kotowski r...@enlle.com wrote: Nutch 2.2.1 on Mysql Got now a large number of docs, I interrupted Fetch several times with [cntrl]+C and now I get this when running

Re: strange message while running updatedb?

2013-08-29 Thread Lewis John Mcgibbney
which version of bitch do you use here kaveh? can you paste full stack trace? On Wednesday, August 28, 2013, kaveh minooie ka...@plutoz.com wrote: I was wondering if this has happen to anyone else. every once in a while my update map task fails with hadoop showing this message: Too many

Re: Empty webpage metadata in IndexingFilter, but not empty in database

2013-08-29 Thread Lewis John Mcgibbney
hi Brian there's no doubt that we should add this to plugin central. do you have an interest in doing so? good to hear you got it working lewis On Monday, August 26, 2013, brian4 bqu...@gmail.com wrote: Figured out the issue was I was not explicitly including the metadata field in the indexing

Re: strange message while running updatedb?

2013-08-29 Thread Lewis John Mcgibbney
are attached. the one that has all the stack trace in it is the one that actually finished successfully. On 08/29/2013 09:16 AM, Lewis John Mcgibbney wrote: which version of bitch do you use here kaveh? can you paste full stack trace? On Wednesday, August 28, 2013, kaveh minooie ka

Re: How nutch2.2 to parse rss?

2013-08-29 Thread Lewis John Mcgibbney
Hi Jonathan, This has been a long outstanding issue IIRC. I have not used Nutch for feed crawling for a while if I am honest, and I honestly can't recall when and if I have done it with 2.x. You will see [0], that by default the plugin is not actually initialized. So for starters you should

Re: How nutch2.2 to parse rss?

2013-08-29 Thread Lewis John Mcgibbney
ParseResult! Thank you! -- 原始邮件 -- 发件人: lewis john mcgibbney [via Lucene] ml-node+s472066n4087394...@n3.nabble.com; 发送时间: 2013年8月30日(星期五) 上午9:34 收件人: 基勇252637...@qq.com; 主题: Re: How nutch2.2 to parse rss? Hi Jonathan, This has been a long

Re: Nutch - Front end?

2013-08-28 Thread Lewis John Mcgibbney
ajax-solr On Wednesday, August 28, 2013, Ralf R. Kotowski r...@enlle.com wrote: So, basically Drupal beomes the Front-end? Interesting -Original Message- From: Nicholas Roberts [mailto:niccolo.robe...@gmail.com] Sent: Thursday, August 29, 2013 1:55 AM To: user@nutch.apache.org

Re: Nutch not crawling fully

2013-08-26 Thread Lewis John Mcgibbney
to [Wrong Password] It seems as though the password is not going correctly to the proxy server. I have set all required proxy parameters correctly in nutch-site.xml. Any clues? Suresh. -Original Message- From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent

Re: Empty webpage metadata in IndexingFilter, but not empty in database

2013-08-23 Thread Lewis John Mcgibbney
Hi Brian, When getting your metadata from the WebPage are you obtaining the fulle mat e.g. page.getMetadata() or are you trying to get a Key and obtain the ByteBuffer value? e.g. page.getFromMetadata(Utf8 Key)? The latter will return null if nothing is present which is normal but means that the

Re: How to ask Nutch to get value of extra fields in IndexerJob/IndexerMapper?

2013-08-23 Thread Lewis John Mcgibbney
Hi Jeffery, Sorry about length of time to respond. Did you get a solution here? I wonder if this has to do with your crawlId? I would definately say that an indexing plugin is the way to go here. Put the outlinks to avro map in WebPage then get them and add them to your doc. On Thursday, August

Re: 2.x vs. 1.x speed

2013-08-23 Thread Lewis John Mcgibbney
I am sure that Renato (if he is watching) can plugin maybe as well. We find in Gora that in every sense of the word, native Hadoop stores such as Avro, HBase and Accumulo when we execute a query with GiraInputFormat via getParitions we retrieve GoraInputSplits natively which means splits are

Re: 2.x vs. 1.x speed

2013-08-23 Thread Lewis John Mcgibbney
I am sure that Renato (if he is watching) can plugin maybe as well. We find in Gora that in every sense of the word, native Hadoop stores such as Avro, HBase and Accumulo when we execute a query with GiraInputFormat via getParitions we retrieve GoraInputSplits natively which means splits are

Re: Nutch Solr empty but no error messages

2013-08-22 Thread Lewis John Mcgibbney
Hi Tracy, Logs are always your friend. Take it step by step [0], look at your logs and read the web db after every step to see whats going on. hth Lewis [0] http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling On Thu, Aug 22, 2013 at 1:44 PM, tracy

Re: Parse and DBUpdate Exception

2013-08-21 Thread Lewis John Mcgibbney
Hi Ward, The main problem with using this set up seems to have been the gora-sql-mapping.xml config file. The one which ships with Nutch was only a guide and has been proven time after time to be unsuitable for many set ups. This being said, it should be noted that the entire gora-sql module is

Re: Update documentation

2013-08-21 Thread Lewis John Mcgibbney
have time to use tools, not time to contribute much. I was merely pointing out the lack of documentation for Nutch v2. On Tue, Aug 20, 2013 at 4:36 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Yes please Andrew. If you can stick some time in to this then it would be greatly

Re: Display Document Count Added To Solr Server

2013-08-21 Thread Lewis John Mcgibbney
Nice work. The patch looks good and I would be +1 to getting it in to the codebase. Thanks Lewis On Wednesday, August 21, 2013, kamaci furkankam...@gmail.com wrote: Currently you can not see how many documents are added to Solr Server. One could see how many documents are added to Solr server

Re: Automating nutch installation

2013-08-19 Thread Lewis John Mcgibbney
John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Andrew, It seems that from the core Nutch team the time and desire is not there right now to push forward with your proposals. This DOES NOT mean that the proposals will be ignored. I would FULLY back convenient packaging

Re: Automating nutch installation

2013-08-18 Thread Lewis John Mcgibbney
of others' lives, a whole lot easier. I'll see what I can get done during downtime, probably prioritizing Homebrew first. On Fri, Aug 16, 2013 at 12:06 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Andrew, Have you written and maintained Debian packages before? I opened a Jira

Re: Automating nutch installation

2013-08-16 Thread Lewis John Mcgibbney
Hi Andrew, Have you written and maintained Debian packages before? I opened a Jira issue for this a whiel back and as far as I know it is a far from trivial process but I would be very very interested to get a Debiran package for Nutch and look at a mac package if so required. There is an issue of

Re: Nutch doen't crawl all links

2013-08-15 Thread Lewis John Mcgibbney
http.content.length override? I haven't checked your URL (although I do like your taste in music :) ) but this is a possible source. hth Lewis On Thursday, August 15, 2013, porcelet jeremy.ponce...@outlook.com wrote: Hello, i'm trying to index all beatles tabs from www.ultimate-guitar.combut

Re: SolrIndexerJob connection reset - job failed

2013-08-13 Thread Lewis John Mcgibbney
Hi Brian, I've never seen this before. I found this however http://web.archiveorange.com/archive/v/L9Ul807Yu77D5QW7PGPn I know posting links to resolve problems is not ideal... bit as I said I've never seen it before. Interesting thought that this happens intermittently same as in the issue

Re: crawlID doesn't work?

2013-08-12 Thread Lewis John Mcgibbney
Hi Kaveh, No your not missing anything... crawlID is not equal to the Cassandra keyspace (keyspaces by default set to webpage for webdb and host for hostdb) instead the crawlId can be used to generate, identify, maintain, etc. different datasets which can all belong to the same keyspace. If you

Re: Hbase is able to connect to Zookeeper but the connection closes immediatly

2013-08-09 Thread Lewis John Mcgibbney
Hi Ralf, AFAICS this would be much better suited to hbase user list. Sorry I can't help more On Friday, August 9, 2013, Ralf R. Kotowski r...@enlle.com wrote: Nutch 2.2.1 Hbase 0.90.4 Solr 4.4.0 Fedora Core 19 Sun Java (latest) Error Msg: Hbase is able to connect to Zookeeper but the

Re: need help with store.CassandraStore

2013-08-09 Thread Lewis John Mcgibbney
Hi Kaveh, N.B. Taking this to user@gora and after this mail please drop user@nutch Quick question, is your cassandra server up and running at default port 9160? On Fri, Aug 9, 2013 at 3:36 PM, kaveh minooie ka...@plutoz.com wrote: Hi Everyone So I don't know if I am doing something wrong

Re: file:/// URLS with spaces in path

2013-08-07 Thread Lewis John Mcgibbney
baishen.li...@gmail.com wrote: Is it possible to run a web server and connect to them that way? That was what I ended up doing. On Tue, Aug 6, 2013 at 4:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Struggling with this one. And yes I acknowledge that it is not really

Re: file:/// URLS with spaces in path

2013-08-07 Thread Lewis John Mcgibbney
that way? That was what I ended up doing. On Tue, Aug 6, 2013 at 4:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Struggling with this one. And yes I acknowledge that it is not really a Nutch based question but hopefully someone can help... I have a directory

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

2013-08-07 Thread Lewis John Mcgibbney
Hi Tejas, Thanks this looks like the key ;) On Tue, Aug 6, 2013 at 9:51 PM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Lewis, Can you try the patch attached over here: https://issues.apache.org/jira/browse/NUTCH-1483 Thanks, Tejas On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

2013-08-07 Thread Lewis John Mcgibbney
PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Now using Nutch trunk 1.8-SNAPSHOT HEAD Back at this tonight. When attempting to fetch file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes) which contains loads of HTML files, I get

Re: Re: Parameter 'depth' is still supported in 2.2.1?

2013-08-07 Thread Lewis John Mcgibbney
Hi Rui, Please open a Jira for this and patch up 2.3-SNAPSHOT if you are able. You are right, it's probably about time to get ride of the class and entry within the bin/nutch script... or at atleast to log a HUGE WARN message when the class is invoked to say that it is deprecated and should not be

Re: Re: Re: Parameter 'depth' is still supported in 2.2.1?

2013-08-07 Thread Lewis John Mcgibbney
like to do that, but it seems I don't have permission to commit the code. Could someone give me the access? Thanks. Rui At 2013-08-08 10:30:21,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Rui, Please open a Jira for this and patch up 2.3-SNAPSHOT if you are able. You

Re: 2.x vs. 1.x speed

2013-08-06 Thread Lewis John Mcgibbney
There is a benchmark class... whcih can be invoked from the nutch script I think. We can maybe extend this for some provisional benchmarks and post the stats on the Nutch wiki/site? wdyt? On Tuesday, August 6, 2013, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Otis, That certainly

Re: file:/// URLS with spaces in path

2013-08-06 Thread Lewis John Mcgibbney
remain in their native path form. On Tue, Aug 6, 2013 at 1:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Struggling with this one. And yes I acknowledge that it is not really a Nutch based question but hopefully someone can help... I have a directory path as follows

protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

2013-08-06 Thread Lewis John Mcgibbney
Hi, Now using Nutch trunk 1.8-SNAPSHOT HEAD Back at this tonight. When attempting to fetch file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes) which contains loads of HTML files, I get the error as below. Fetcher: throughput threshold retries: 5 -finishing thread

Re: SolrClean not available in nutch 2.x

2013-08-01 Thread Lewis John Mcgibbney
Thanks. Great job. On Wed, Jul 31, 2013 at 5:45 PM, claudiuchis claudiuchi...@gmail.comwrote: Hi Lewis, I've created patch NUTCH-1294-v3.patch. Here are the steps I followed: $ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1 $ cd release-2.2.1 $ patch -p0

Re: SolrClean not available in nutch 2.x

2013-07-31 Thread Lewis John Mcgibbney
Hi Claudiu, Can you please attach your new patch if possible to the issue and we can try it out. I would be keen to get this in to the codebase. Thank you very much for getting back here. Best Lewis On Wed, Jul 31, 2013 at 2:42 PM, claudiuchis claudiuchi...@gmail.comwrote: Hi Lewis, The

Re: Deleting Duplicates works fine on one solr core, but not on antother - Nutch 1.5

2013-07-30 Thread Lewis John Mcgibbney
Makes perfect sense. I wonder if this is something to do with the Solr side? Do you have some logs you can view? On Tue, Jul 30, 2013 at 9:48 AM, dogrdon dgor...@planning.org wrote: oh, sorry, I just obscured it for the purposes of posting because I did not want to publish our solr

Re: Help with 'read data'

2013-07-30 Thread Lewis John Mcgibbney
Great. On Tue, Jul 30, 2013 at 9:16 AM, Weder Carlos Vieira weder.vie...@gmail.com wrote: Hello Lewis, I changed the ivy.xml GORA 0.3 to 0.2.1 version. Weder On Tue, Jul 30, 2013 at 1:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, AFAIK, MySQL

Re: SolrClean not available in nutch 2.x

2013-07-30 Thread Lewis John Mcgibbney
https://issues.apache.org/jira/browse/NUTCH-1294 Would be really really great if you could try this out and comment on this issue. Another tool we would then need to port to pluggable indexing. hth Lewis On Tue, Jul 30, 2013 at 11:10 AM, claudiuchis claudiuchi...@gmail.comwrote: Hi folks, I

Re: SolrClean not available in nutch 2.x

2013-07-30 Thread Lewis John Mcgibbney
I did not port I merely helped a bit :) Dan Rosher was the driving force behind this one! Thanks for any feedback. Best On Tue, Jul 30, 2013 at 11:23 AM, claudiuchis claudiuchi...@gmail.comwrote: Hi Lewis. Thank you for porting SolrClean to the 2.x branch. I'll apply the patch and let you

Re: SolrClean not available in nutch 2.x

2013-07-30 Thread Lewis John Mcgibbney
Hi, On Tue, Jul 30, 2013 at 3:29 PM, claudiuchis claudiuchi...@gmail.comwrote: ... snip ... 4. I applied the patch cd /usr/local/nutch-2.2.1 patch -p0 NUTCH-1294-v2.patch The patch didn't update src/bin/nutch and conf/log4j.properties for some reason. I've updated these manually. If

Re: crawl time details of a particular domain

2013-07-29 Thread Lewis John Mcgibbney
pandey devangpande...@gmail.comwrote: @lewis ... Thanx for replying . Thing is using readdb I can read my crawldb but it shows only fetcher time . How can i find Fetcher start and end time as suggested by you.. On Sat, Jul 27, 2013 at 6:05 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: Deleting Duplicates works fine on one solr core, but not on antother - Nutch 1.5

2013-07-28 Thread Lewis John Mcgibbney
Hi, On Sun, Jul 28, 2013 at 2:38 PM, dogrdon dgor...@planning.org wrote: 2013-07-26 16:55:31,593 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://domain:port/solr/core0/ ... Caused by: java.lang.NullPointerException at This Solr URL looks incorrect.

Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]

2013-07-26 Thread Lewis John Mcgibbney
Please look at recent list archives. SqLStore is deprecated. Thanks Lewis On Friday, July 26, 2013, EarthMan huangrong...@gmail.com wrote: Hello Weder, Have you solved this problem with nutch 2.2? If yes can you share the solution? thank you. I get the same error below: Exception in thread

Re: crawl time details of a particular domain

2013-07-26 Thread Lewis John Mcgibbney
by looking at the fetcher start time and finish time Sir. There should be mothods in TimingUtil to help you. hth On Friday, July 26, 2013, devang pandey devangpande...@gmail.com wrote: Hello , I am working on nutch 1.4 to crawl certain domains . Now after successful crawling I want to get start

Re: Nutch 2.2.1 Freezing / Deadlocked During Generator Job

2013-07-24 Thread Lewis John Mcgibbney
Hi Brian, Gora =0.3 deprecates the gora-sql 0.1.1-incubating artifact. This means Nutch 2.2.1 and MySQL/HSQLDB are incompatible. Lewis On Wed, Jul 24, 2013 at 12:42 PM, brian4 bqu...@gmail.com wrote: It definitely has nothing to do with HBase - I switched to use MySQL and I am still having

Re: Null Pointer Exception trying to run Nutch

2013-07-24 Thread Lewis John Mcgibbney
Hi, On Wed, Jul 24, 2013 at 2:02 PM, band_master swirlanalyt...@gmail.comwrote: After reading up a bit more, I see the 'crawl' function is deprecated in Nutch2 in favor of a java file located in 'bin/crawl' that executes each command in sequence. It is a replacement script which chains the

Re: Null Pointer Exception trying to run Nutch

2013-07-23 Thread Lewis John Mcgibbney
Hi band_master, On Tue, Jul 23, 2013 at 1:20 PM, band_master swirlanalyt...@gmail.comwrote: I am having trouble, though, getting Nutch to work. I can successfully inject urls, but there seems to be an error in the Hadoop log around parsing UTF8 characters. How are you coming to this

Re: Nutch Plugin Runtime Classpath

2013-07-23 Thread Lewis John Mcgibbney
Hi Alex, About now is a good time to read how Nutch deals with classloading. Navigate to plugin central on the wiki and you will see the documentation. hth you out Lewis On Tuesday, July 23, 2013, AC Nutch acnu...@gmail.com wrote: Hi All, I'm attempting to build a Nutch plugin on Nutch 1.7

Re: Nutch Plugin Runtime Classpath

2013-07-23 Thread Lewis John Mcgibbney
added to the plugin class-loader. However, that doesn't appear to be the case - I must be missing something, but I'm not sure what that is...? Alex On Tue, Jul 23, 2013 at 11:30 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alex, About now is a good time to read how

Re: Re: [2.2.1] What does inject job do?

2013-07-21 Thread Lewis John Mcgibbney
On Sat, Jul 20, 2013 at 10:58 PM, Rui Gao gaorui...@163.com wrote: I checked the DB, the URL is already in DB. The plugin property is configured like this: property nameplugin.folders/name value./src/plugin,./plugins/value Any reason at all that you have two directories listed? Do you

Re: [2.2.1] org.apache.hadoop.hbase.MasterNotRunningException

2013-07-20 Thread Lewis John Mcgibbney
Hi Rui, You can't use this version of HBase in your search stack with Nutch 2.x right now. You need to downgrade quitte significantly to 0.90.x Thanks Lewis On Saturday, July 20, 2013, Rui Gao gaorui...@163.com wrote: Hi, I try to setup Nutch2.2.1 + hbase-0.94.9 + eclipse + cygwin on Windows

Re: [2.2.1] What does inject job do?

2013-07-20 Thread Lewis John Mcgibbney
Hi Rui, On Saturday, July 20, 2013, Rui Gao gaorui...@163.com wrote: So, what direction will Nutch go? Will it co-operate with relationship database or will it only work on non-relationship database like hbase? This has nothing to do with Nutch. It has everything to do with Apache Gora and we

Re: Nutch 2.2.1 parse (slow?)

2013-07-20 Thread Lewis John Mcgibbney
Hi Martin, On Saturday, July 20, 2013, Martin Aesch martin.ae...@googlemail.com wrote: I have about 25K URLs per map task and around 8M URLs total All 6 mappers run and have continuously output. The aggregated parse rate is 100URLs/sec. wow this is painstakingly slow indeed. This was similar

Re: [2.2.1] What does inject job do?

2013-07-20 Thread Lewis John Mcgibbney
, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362 Do you have any suggestion? Thanks. Best Regards, Rui At 2013-07-11 03:53:12,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Rui, This should not work. The SqlStore module and support

Re: [2.2.1] What does inject job do?

2013-07-20 Thread Lewis John Mcgibbney
, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362 Do you have any suggestion? Thanks. Best Regards, Rui At 2013-07-11 03:53:12,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Rui, This should not work. The SqlStore module and support

Re: [2.2.1] What does inject job do?

2013-07-20 Thread Lewis John Mcgibbney
John Mcgibbney lewis.mcgibb...@gmail.com wrote: Please read the exception trace. You are running on Hadoop? You need to ensure that your plugins.directory points to the right path. There is also a mention of a missing job file. Please ensure that your nutch job file is on the Hadoop jobtracker

Re: Nutch 2.2.1 Freezing / Deadlocked During Generator Job

2013-07-19 Thread Lewis John Mcgibbney
Hi Brian, On Thursday, July 18, 2013, brian4 bqu...@gmail.com wrote: On one machine, nutch just suddenly started freezing during the generator job. Are these continuous crawls? What values d you have set for generate.max.count? I ask as calls must be made to the backed the determine a limit for

Re: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

2013-07-19 Thread Lewis John Mcgibbney
Hi, On Fri, Jul 19, 2013 at 9:43 AM, dogrdon dgor...@planning.org wrote: +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt file should be -^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* Thanks Lewis

Re: Nutch 2.2.1 Freezing / Deadlocked During Generator Job

2013-07-19 Thread Lewis John Mcgibbney
Hi Brian, On Friday, July 19, 2013, brian4 bqu...@gmail.com wrote: No not continuous or large-scale. Crawls are just run each day. The machine that has the freezing issue was the one I was planning to use to do the daily crawls. Think this is most certainly a local config bottleneck. Nutch

Re: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

2013-07-19 Thread Lewis John Mcgibbney
Sorry I misunderstood completely. You can enable filtering (and normalizing) for the solr-indexer job in trunk http://wiki.apache.org/nutch/bin/nutch%20solrindex This will enable you to crawl everything but restrict what gets sent down to the index from your crawdl. hth Lewis On Friday, July

Re: Nutch 2.2.1 and Nutch 1.7

2013-07-19 Thread Lewis John Mcgibbney
Hi, People are using both. People are finding bugs, improving code and making better software with every release we push. It is fair to say that 1.x is a mature and production ready piece of software. There is absolutely no doubt about this. 2.x has come a long way over the last few releases.

Re: Nutch 2.2.1 parse (slow?)

2013-07-19 Thread Lewis John Mcgibbney
Hi Martin, Havve you checked that all mappers are working while parsing job is running? How many URLs are you trying to parse here? On Friday, July 19, 2013, Martin Aesch martin.ae...@googlemail.com wrote: Dear nutchers, Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core

Re: Issue in generating URLs for re-fetching once db.fetch.interval.max elapses

2013-07-19 Thread Lewis John Mcgibbney
I am not looking at the code. Can you explain what you're expecting to happen please? On Thursday, July 18, 2013, vivekvl vive...@yahoo.com wrote: I found a issue in shouldFetch() of AbstractFetchSchedule class. (Nutch 2.1) Here even when (fetchTime - curTime maxInterval * 1000L), the method

Re: How to configure SolrDeDup Job to run per batch Id not entire index?

2013-07-18 Thread Lewis John Mcgibbney
Hi Tony, On Thursday, July 18, 2013, Tony Mullins tonymullins...@gmail.com wrote: Currently in Nutch2.x SolrDeDup job runs on entire index. Is it possible to configure it to run against the current batch Id ? It will be possible. There are various issues open (and patches) for 2.3 which deal

Re: Storing Nutch crawled data in database

2013-07-12 Thread Lewis John Mcgibbney
Hi, Please grab the most recent Nutch 2.2.1 release from our downloads page. A description of the codebase is available on the Nutch home page. You can use this with different NoSQL backends. Tutorials are available on the Nutch wiki. hth Lewis On Fri, Jul 12, 2013 at 4:02 AM, devang pandey

Re: How to run unit tests for a single plugin in 2.x

2013-07-12 Thread Lewis John Mcgibbney
Hi Brian, On Fri, Jul 12, 2013 at 11:57 AM, brian4 bqu...@gmail.com wrote: What am I doing wrong? You're doing nothing wrong. We would need to submit a patch for this to get it working. Currently, when the plugins are compiled and tested, I *think* that the generated plugin test

Re: Two questions about Nutch

2013-07-12 Thread Lewis John Mcgibbney
Hi, On Friday, July 12, 2013, Yves S. Garret yoursurrogate...@gmail.com wrote: 1 - Is there a web-gui interface that will enable me to look over the different search terms that I can use and what searches are going on? If so, how can I solve this problem? Nutch 2.x has a REST interface via

Re: Exception in nutch...

2013-07-12 Thread Lewis John Mcgibbney
The gora-sql artifact is now deprecated. Please read your ivy.xml descriptor for reasoning and logic. We advise you to use another storage mechanism... the options are also in the ivy.xml descriptor. hth Lewis On Thursday, July 11, 2013, Ramakrishna ramakrishna...@dioxe.com wrote: When i use

Re: Exception while running Nutch

2013-07-12 Thread Lewis John Mcgibbney
Please check the syntax you are using for the cli arguments. It is all wrong. You can see correct usage syntax on nutch tutorial or on command line options. hth Lewis On Friday, July 12, 2013, Ramakrishna ramakrishna...@dioxe.com wrote: Injector: starting at 2013-07-12 18:17:41 Injector:

Re: any changes to Nutch 2.2.1 webpage table

2013-07-10 Thread Lewis John Mcgibbney
No 2.2.1 webpage schema will be the same. Nutch 2.1 introduced the concept of batchId for URLs but there is no change in most recent. Thanks Lewis On Tuesday, July 9, 2013, A Laxmi a.lakshmi...@gmail.com wrote: Hello, I could use Nutch 1.6 and 2.1 without any issues in the past. However, now

Re: nutch status code

2013-07-10 Thread Lewis John Mcgibbney
Can you show the relevant part of the segment dump? On Wed, Jul 10, 2013 at 4:10 AM, devang pandey devangpande...@gmail.comwrote: hello I am using readseg command to read a segment corresponding to a particular url . But output contains nutch status 67 . What exactly is nutch status 67

Re: boost field is always 0.0 in nutch 2.x after custom scoring filter

2013-07-10 Thread Lewis John Mcgibbney
Hi, On Tue, Jul 9, 2013 at 1:03 AM, imran khan imrankhan.x...@gmail.com wrote: I have gone through the source code of this plugin but couldn't find any code which could be affect the value of boost field. Assuming that you are using 2.2.1 or 2.X HEAD, the boost field as assigned to the

Re: [2.2.1] What does inject job do?

2013-07-10 Thread Lewis John Mcgibbney
Hi Rui, This should not work. The SqlStore module and support for it is now deprecated within Apache Gora. If you would like to downgrade to use Nutch 2.1, then you can use older Gora artifacts but this is not recommended. Thanks Lewis On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao gaorui...@163.com

Re: Number of mappers in a distributed mode

2013-07-03 Thread Lewis John Mcgibbney
Please look for mapred-site.xml in hadoop conf directory. you can specify mapred.reduce.tasks and set an int for this value You will need to restart the jobtracker for this to kickin I would imagine. On Wednesday, July 3, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi When

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Thanks a lot Markus! Where do we define this parameter, please? Benjamin On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Thanks a lot Markus! Where do we define this parameter, please? Benjamin On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Thanks a lot Markus! Where do we define this parameter, please? Benjamin On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma

Re: no digest field avaliable

2013-07-02 Thread Lewis John Mcgibbney
Which version of Nutch are you using please? On Tuesday, July 2, 2013, Christian Nölle noe...@uni-wuppertal.de wrote: Hi everbody, I got a problem concering solrdedup. We got a field digest in solr, solrindex-mapping for digest is fine as well, but there is no field digest showing up in the

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
directory? Benjamin On Tue, Jul 2, 2013 at 5:10 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Thanks a lot Markus! Where do

Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-07-02 Thread Lewis John Mcgibbney
Renato Marroquín Mogrovejo (nb) Sebastian Nagel Julien Nioche Lewis John McGibbney [ ] +/-0, fine, but consider to fix few issues before... [ ] -1, nope, because... (and please explain why) WOW Great VOTEíng turn out. Thank you so much to everyone for reviewing this RC. I will progress

[RESULT] WAS Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-07-02 Thread Lewis John Mcgibbney
Sorry team, this should have been a [RESULT] thread. Thanks Lewis On Tue, Jul 2, 2013 at 9:08 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: In the famous words of Truman good morning, good afternoon, good evening and good night... to all Nutch'ers!!! I would like to bring

[ANNOUNCE] Apache Nutch v2.2.1 Released

2013-07-02 Thread Lewis John Mcgibbney
Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the immediate release of Apache Nutch v2.2.1, we advise all current users and developers of the 2.X series to upgrade to this release ASAP. Apache Nutch is an open source web-search software project. Stemming from Apache

Re: Nutch scalability tests

2013-07-02 Thread Lewis John Mcgibbney
Hi, Please try *http://s.apache.org/mo* Specifically the generate.max.count property. Many many URLs are unfetched here... look into the logs and see what is going on. This is really quite bad and there is most likely one/a small number of reasons which ultimately determine why so many URLs are

Re: Nutch scalability tests

2013-07-02 Thread Lewis John Mcgibbney
Hi, On Tue, Jul 2, 2013 at 3:53 PM, h b hb6...@gmail.com wrote: So, I tried this with the generate.max.count property set to 5000, rebuild ant; ant jar; ant job and reran fetch. It still appears the same, first 79 reducers zip through and the last one is crawling, literally... Sorry I

Re: INTEGRATION OF NUTCH AND SOLR

2013-07-02 Thread Lewis John Mcgibbney
Hi Avilash, It is extremely difficult to comment here. We need information on whats actually happening. Your description is a bit of a black box. Can you please look in hadoop.log and solr logs as well. THIS WIll give you an indication of how many documents are/were written down to Solr. thank you

Re: Questions/issues with nutch

2013-07-01 Thread Lewis John Mcgibbney
Is there a temporary file within the urls directory. something like seed.txt~ ? On Monday, July 1, 2013, h b hb6...@gmail.com wrote: Hi, I started to inspect the content of the crawled html. I have 2 urls in my seed.txt. So I should just have 2 documents in my solr response, right? I dropped

Re: nutch2.x in cluster mode ?

2013-06-30 Thread Lewis John Mcgibbney
Yes its as simple as that. The JobTracker takes care of delegation of tasks, therefore there is no need for Nutch to be present on every node. Hadoop and HBase (or whichever back you choose) is a different case. On Sunday, June 30, 2013, Tejas Patil tejas.patil...@gmail.com wrote: I have never

<    1   2   3   4   5   6   7   8   9   10   >