Re: Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

2014-10-09 Thread Lewis John Mcgibbney
Hi Alex, On Thu, Oct 9, 2014 at 9:45 AM, user-digest-h...@nutch.apache.org wrote: I can't help but think that you have too many moving pieces here! Few of which now appear to be 'stable' enough. I would highly encourage you to look at https://issues.apache.org/jira/browse/NUTCH-1843 This is

Re: Crawled data not inserting in the tables

2014-09-30 Thread Lewis John Mcgibbney
Can you also make sure that the cluster name and fully qualified address and port agree between mapping and Gora.properties Thanks On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi Kartik, If TTL hasn't been set or if it has been set to 0, then

Re: Question about Nutch Wicket

2014-09-27 Thread Lewis John Mcgibbney
Hi Folks, On Thu, Sep 25, 2014 at 10:30 PM, user-digest-h...@nutch.apache.org wrote: I never used nutch web admin. Web admin that you used, is very old. Maybe you can use our brand new web admin development ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now it is

DOCUMENTATION - Nutch and Hidden Services

2014-09-23 Thread Lewis John Mcgibbney
Hi Folks, I've added a document for crawling hidden services .onion sites present within the Tor network. The documentation is available on the Nutch wiki https://wiki.apache.org/nutch/SetupNutchAndTor Hope some folks find this helpful. Thank you to Roger Dingledine from Tor for his patience and

Re: Running Crawls via REST API

2014-09-16 Thread Lewis John Mcgibbney
Hi Johannes On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote: is it possible to have nutch as a kind of stand-alone crawl server only spoken to via the REST API? Yes this is possible. We just finished a Google Summer of Code project which addresses exactly this via

Re: Revisiting Loops Job in Nutch Trunk

2014-09-16 Thread Lewis John Mcgibbney
Hi Markus, On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote: Hi - So you are not using it for scoring right, but to inspect the graph of the web. Yeah, I think that this is a pretty accurate statement. Then there's certainly no need to weed out loops using the

Re: Revisiting Loops Job in Nutch Trunk

2014-09-11 Thread Lewis John Mcgibbney
Hi Markus, On Wed, Sep 10, 2014 at 10:28 PM, user-digest-h...@nutch.apache.org wrote: Weird, i didn't see my own mail arriving on the list, i sent it via kmail but am on webmail now, which seems to work. sigh ;) Anyway, for vertical search on a whole website i would rely on your

Re: Revisiting Loops Job in Nutch Trunk

2014-09-10 Thread Lewis John Mcgibbney
are popular, so that means large scale. On Wednesday 10 September 2014 07:43:34 Lewis John Mcgibbney wrote: Hi Markus, On Wed, Sep 10, 2014 at 2:00 AM, user-digest-h...@nutch.apache.org wrote: Hey Lewis, We didn't use it in the end, but did run the LinkRank on large amounts

Re: making nutch compatible with hadoop 2

2014-09-09 Thread Lewis John Mcgibbney
Hi Sachin, On Tue, Sep 9, 2014 at 8:38 AM, user-digest-h...@nutch.apache.org wrote: hi all i am trying to make nutch2.3 compatible with hadoop 2 so now i am facing some problems. I have configured apache gora0.4 and hbase 0.94 with nutch2.3 so now when i inject the urls in the database a

Re: Nutch FAQ

2014-09-09 Thread Lewis John Mcgibbney
Hi Julien, Apologies about delay, this thread is old now but still important and relevant. On Mon, Sep 1, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote: Our FAQ page [http://wiki.apache.org/nutch/FAQ] needs a bit of an update. Some of the items on it are now irrelevant (search and

Re: Cassandra and Nutch 2.X not coding in UTF8

2014-09-08 Thread Lewis John Mcgibbney
Hi cervenkovab, This is an inherent design choice we made whilst developing gora-cassandra module to what it is now. Ultimately we store all data as a Byte Array. CQLSH subsequently gets data as it is within Cassandra. Therefore no decoding is done on the client side before the data is presented

Re: [ANNOUNCE] GSoC Create a Wicket-based Web Application for Nutch Project SUCCESSFUL

2014-09-01 Thread Lewis John Mcgibbney
Hi Martin, On Mon, Sep 1, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote: Thank you for working on this project, Феодор! I hope you enjoyed working on it and you will continue contributing to Open Source projects. +1 Nutch community please let us know when you are ready to

[ANNOUNCE] GSoC Create a Wicket-based Web Application for Nutch Project SUCCESSFUL

2014-08-31 Thread lewis john mcgibbney
Hi Folks, I thought I would make an announcement regarding a project which has been ongoing over the summer and which has now successfully passed Google Summer of Code 2014 program. Our stuent Fjodor Vershinin, approached the Nutch community some time in February of this year to express his

Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney
Hi Nicholas, NOTE: Thread name has changed to reflect diversion on topic. On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote: will you use config management like ansible backing vagrant? Well thanks for the links here. The github repos they have indicates that they

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney
Hi Julien, On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote: Just out of interest, what sort of analytics do you do and why is it better to do it in 2.x than 1.x? Nowhere did I say it was better or worse than in 1.X. Let me be clear here. I use Nutch 2.X, as I

Re: [RELEASE] Apache Nutch 1.9

2014-08-28 Thread Lewis John Mcgibbney
Hi Mo, On Thu, Aug 28, 2014 at 3:33 PM, user-digest-h...@nutch.apache.org wrote: Sorry for the late reply. Me included. This email was lost in the pile! I use Nutch 2.x as it enables me to do analytics over the data I am crawling. This is my justification for trying to maintain an further

Re: Nutch Ant-Ivy build issue resolving HBase dependencies

2014-08-19 Thread Lewis John Mcgibbney
Hi Azhar, On Tue, Aug 19, 2014 at 5:16 AM, user-digest-h...@nutch.apache.org wrote: As suggested in the report, I dropped my ~/.ivy2 folder, re-ran and was able to build successfully. Once your repo is in a broken state then your stuck with the hostname added to revision issue. Fantastic.

[RELEASE] Apache Nutch 1.9

2014-08-18 Thread Lewis John Mcgibbney
Hi Everyone, The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.9, we advise all current users and developers of the 1.X series to upgrade to this release. Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is a

Re: Nutch Ant-Ivy build issue resolving HBase dependencies

2014-08-18 Thread Lewis John Mcgibbney
Hi Azhar, On Mon, Aug 18, 2014 at 1:01 AM, user-digest-h...@nutch.apache.org wrote: Subject: Nutch Ant-Ivy build issue resolving HBase dependencies Hi I'm having a problem with resolving dependencies while building Nutch 2.2.1. Have added the dependency in ivy.xml to use gora-hbase.

[RESULT] WAS Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-16 Thread Lewis John Mcgibbney
Afternoon Troops, 72hrs has come and gone therefore I am closing the VOTE'ing. Results can be seen below. [4] +1 Push the release, I am happy :) Julien Nioche * Lu Feng * Sebastian Nagel * Lewis John McGibbney * [ ] +0 I am not bothered either way [ ] -1 I am not happy with this release

Re: How to index the plugin field in nutch with solr?

2014-08-12 Thread Lewis John Mcgibbney
Hi, On Tue, Aug 12, 2014 at 1:33 AM, user-digest-h...@nutch.apache.org wrote: Hi, everyone: I integrate nutch/solr/hbase to construct a search engine, it work well, except that some fileds in the schma.xml are not indexed to solr. The fields in !-- core fields -- and !-- fields for

[VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-12 Thread Lewis John Mcgibbney
-10-28 Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.orgPlease VOTE as follows[ ] +1 Push the release, I am happy :)[ ] +0 I am not bothered either way[ ] -1 I am not happy with this release candidate (please state why)Firstly thank you to everyone that contributed to Nutch, it is greatly

Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-12 Thread Lewis John Mcgibbney
VOTE'ing will be open for 'at-least' 72 hours to allow people enough time to cast their VOTE's. Thanks Lewis On Tue, Aug 12, 2014 at 10:31 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.9. The release candidate

Re: Run Nutch and Hbase of different nodes

2014-08-07 Thread Lewis John Mcgibbney
Hi Hung, Nutch 2.X ships with an hbase-site.xml file. https://svn.apache.org/repos/asf/nutch/branches/2.x/conf/hbase-site.xml.template Can you not use that for your configuration? On Thu, Aug 7, 2014 at 6:44 PM, user-digest-h...@nutch.apache.org wrote: We have been trying Nutch for 2 days,

Re: New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

2014-07-30 Thread Lewis John Mcgibbney
Hi Mohammed, On Wed, Jul 30, 2014 at 11:46 AM, user-digest-h...@nutch.apache.org wrote: Available at: https://github.com/momer/nutch-selenium-grid-plugin This looks fantastic. Are you interested in bringing in into the codebase?I think that this would be very useful to many users of

Re: NUTCH + MongoDB

2014-07-24 Thread Lewis John Mcgibbney
Hi Muhamad, On Thu, Jul 24, 2014 at 4:25 AM, user-digest-h...@nutch.apache.org wrote: Anyone ever had MongoDB as storage in NUTCH 2.2.1 ?. Advice me please. I personally have not no, but there is absolutely no reason why you can't. I know that the author's of the module e.g. Dictanova,

[FEEDBACK] Improving Content on the Nutch WebSite

2014-06-30 Thread Lewis John Mcgibbney
Hi Folks, As part of an ongoing project this summer within my work, I am working with Apache Nutch and therefore promoting the software as much as possible to every I meet. We received some feedback** on the site content and design and I wondered if people would mind reviewing the feedback and

Re: [FEEDBACK] Improving Content on the Nutch WebSite

2014-06-30 Thread Lewis John Mcgibbney
... and the link https://docs.google.com/spreadsheets/d/1FKD30fzojIDrSQ03qPztLpjToaIohT9_FMuGca6kxug/edit?usp=sharing On Mon, Jun 30, 2014 at 3:07 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, As part of an ongoing project this summer within my work, I am working

Re: updatedb deletes all metadata except _csh_

2014-06-26 Thread Lewis John Mcgibbney
Hi Alex, On Thu, Jun 26, 2014 at 5:48 AM, user-digest-h...@nutch.apache.org wrote: I already came up with similar changes to the code as in this patch. Only suggestion to this patch's code is that to move checking if url exists in the datastore under if (!additionsAllowed) {

GSoC Nutch REST API Documentation

2014-06-25 Thread Lewis John Mcgibbney
Hi Folks, For those that are interested, I would like to guide you towards the growing documentation our student Fjodor is producing. https://wiki.apache.org/nutch/NutchRESTAPI Thanks Fjodor... keep thit coming this is great and extremely helpful. Best Lewis -- *Lewis*

Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread Lewis John Mcgibbney
Hi Alex, I am really sorry for not making the connection here. On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote: So far, this looks like a bug in updatedb when filtering with batchId. I could only found one solution, to check if new pages are in the datastore and

Re: updatedb deletes all metadata except _csh_

2014-06-18 Thread Lewis John Mcgibbney
Hi Alex, On Tue, Jun 17, 2014 at 2:06 PM, user-digest-h...@nutch.apache.org wrote: I am using nutch-2.x with GORA_97. You mean GORA-94, the Avro upgrade? With which gora- backend please? Further investigation shows that DbUpdateReducer calls inlinkedScoreData.clear(); I see this on

#nutch on IRC

2014-06-18 Thread Lewis John Mcgibbney
Hi Folks, I've opened a channel on IRC for Nutch. It's at #nutch For those of you interested in joining the room via browser, you can do so here http://webchat.freenode.net/ Thanks Lewis -- *Lewis*

Re: #nutch on IRC

2014-06-18 Thread Lewis John Mcgibbney
Yep Do you fancy making your first commit to the new CMS? ;) On Wed, Jun 18, 2014 at 10:51 AM, Markus Jelsma mar...@openindex.io wrote: Cool Lewis. If this is there to stay, shouldn't we advertise it on our homepage? Markus On Wednesday, June 18, 2014 10:34:13 AM Lewis John Mcgibbney

Re: #nutch on IRC

2014-06-18 Thread Lewis John Mcgibbney
UPDATE We are on #nutchbot Someone took #nutch already! See you there. On Wed, Jun 18, 2014 at 10:34 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, I've opened a channel on IRC for Nutch. It's at #nutch For those of you interested in joining the room via browser

New Apache Nutch Site

2014-06-10 Thread Lewis John Mcgibbney
Hi Folks, I recently attacked [0] which now enables us to run our site as a content management system as oppose to a static host for primitive documentation. Hopefully the site is easier to navigate now and of course it is MUCH easier for us to maintain as a community. Please see the README

Re: Injector works. But generator and fetcher don't work.

2014-06-06 Thread Lewis John Mcgibbney
Hi, On Thu, Jun 5, 2014 at 11:19 PM, user-digest-h...@nutch.apache.org wrote: # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] Is this your preference? Anyways, you need to try and debug why there ends up being no Map Input records for the Generate phase. I

Re: Injector works. But generator and fetcher don't work.

2014-06-05 Thread Lewis John Mcgibbney
which version of Nutch are you using? Nutch 2 what? On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan manikan...@thesocialpeople.net wrote: Dear Lewis, I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using Cassandra as my backend datastore . I’m trying to crawl one link as

Re: Injector works. But generator and fetcher don't work.

2014-06-05 Thread Lewis John Mcgibbney
It looks like the InjectorJob phase successfully injects your 1 URL in to Cassandra Keyspace. On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan manikan...@thesocialpeople.net wrote: 14/06/05 15:01:02 INFO mapred.JobClient: Map input records=1 ... 14/06/05 15:01:02 INFO

Re: Getting started/Tutorial

2014-05-29 Thread Lewis John Mcgibbney
Hi Karl-Philipp, On Thu, May 29, 2014 at 12:50 AM, user-digest-h...@nutch.apache.org wrote: 2. downloaded and started HBase, shell is running, test database creation successful Please make sure for Nutch version 2.2.1 you are using HBase version 0.90.4. Please also make sure that you use

Re: Solr Deduplicate - Class Not Found Exception

2014-05-28 Thread Lewis John Mcgibbney
Hi, On Mon, May 26, 2014 at 11:20 AM, Manikandan Saravanan manikan...@thesocialpeople.net wrote: I’m running Nutch 2 Which version? Do you have the code packaged in to the .job jar? You need to look in there and see? It seems that it is not there.

Crawl Email Server with IMAPS or POP3

2014-05-16 Thread Lewis John Mcgibbney
Hi Folks, Has anyone done this before? Is email archiving something which we can do or not? I've been playing around with Geronimo's Javamail library and wondered if we could use it as Protocol extensions for above protocol's. Any thoughts? Lewis -- *Lewis*

Re: Nutch 2.x from svn.

2014-05-12 Thread Lewis John Mcgibbney
Hi BlackIce, On Sun, May 11, 2014 at 9:20 AM, user-digest-h...@nutch.apache.org wrote: Subject: Nutch 2.x from svn. java.lang.Exception: java.lang.IllegalStateException: Target host must not be null, or set in parameters. Which version of httpcore is included in /lib or on classpath?

Re: Nutch 1.8 Solrindexer failingBlackIce

2014-05-12 Thread Lewis John Mcgibbney
Hi BlackIce, On Sun, May 11, 2014 at 9:20 AM, user-digest-h...@nutch.apache.org wrote: You are correct, I did some research and found it to be a TIKA issue, its is fixed by setting the Title field to multivalued in schema.xml.I think by default the Nutch schema should be updated

Re: Nutch 2.3 ?

2014-05-07 Thread Lewis John Mcgibbney
Hi BlackIce, On Sat, May 3, 2014 at 10:52 PM, user-digest-h...@nutch.apache.org wrote: Any idea on when Nutch 2.3 will be released? Thnx We have a roadmap here https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel What would be

Re: Solr 4.7 Schema?

2014-05-07 Thread Lewis John Mcgibbney
Hi BlackIce, On Sat, May 3, 2014 at 10:52 PM, user-digest-h...@nutch.apache.org wrote: Does anyone have a good nutch/solr 4.7 schema file? What about the one here http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema.xml ??? Lewis

Re: Nutch 2.2.1: Web Content size of a particular website

2014-04-25 Thread Lewis John Mcgibbney
Hi, On Fri, Apr 25, 2014 at 11:15 AM, user-digest-h...@nutch.apache.org wrote: From what you said earlier, Isn't that the same as contentLength in index-more plugin which is determined according to the type of download page? Pretty much ;) It would be interesting to see if you could use

[ANNOUNCEMENT] Apache Gora 0.4 Release

2014-04-23 Thread Lewis John Mcgibbney
Good Afternoon Everyone, The Apache Gora team are very proud to announce the immediate release of Gora 0.4 which is a major release for the project. The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key

Re: [ANNOUNCEMENT] Apache Gora 0.4 Release

2014-04-23 Thread Lewis John Mcgibbney
Hi Julien, On Wed, Apr 23, 2014 at 1:56 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Great news! Well done and thanks to everyone involved. I am sure this will be popular with the Nutch 2.x users. +1 BTW I can smell a rematch of

[ANNOUNCE] NUTCH-841 Accepted into Google Summer of Code

2014-04-23 Thread Lewis John Mcgibbney
Hi Folks, A quick message to say that by the end of the summer it looks like we will have a killer Wicket-based Web Application for Nutch 2.x branch as the project was successfully accepted into this years GSoC program. :) :) :) Thanks Lewis -- *Lewis*

[ANNOUNCE] crawler-commons 0.4 is released

2014-04-11 Thread Lewis John Mcgibbney
11th April 2014 - crawler-commons 0.4 is released We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of

Re: nutch-2.x with hbase filter option

2014-04-11 Thread Lewis John Mcgibbney
Hi Alex, Regarding your findings and the code you've posted. Can you please open an issue on Jira, posting in the code changes (or even better a patch :) ) If you could like it to the Gora 0.4 upgrade issue in Nutch Jira it would be excellent. Thanks v much Alex. Lewis On Fri, Apr 11, 2014 at

Re: Crawl Anonymously

2014-04-09 Thread Lewis John Mcgibbney
Hi, On Wed, Apr 9, 2014 at 8:43 AM, user-digest-h...@nutch.apache.org wrote: Subject: Re: Crawl Anonymously Set up a anonymous proxy server and configure nutch to crawl through the proxy. http://wiki.apache.org/nutch/SetupProxyForNutch

Re: Nutch 2.2.1: Web Content size of a particular website

2014-04-09 Thread Lewis John Mcgibbney
Hi On Wed, Apr 9, 2014 at 8:43 AM, user-digest-h...@nutch.apache.org wrote: user Digest 9 Apr 2014 14:43:51 - Issue 2188 I might not be thinking in the right direction so need some help. Is there a way to find an approximate web content size of a particular website in Nutch 2.2.1?

Re: How to stop crawling in middle and start it from it was stopped

2014-04-07 Thread Lewis John Mcgibbney
Hi reddibabu, On Mon, Apr 7, 2014 at 7:20 AM, user-digest-h...@nutch.apache.org wrote: I am using Nutch 1.7. Nutch setup was on Linux Box. My requirement is to stop nutch crawling in middle and start it from where it was stopped. Here, I don't want to see missing crawled data. Is there any

Re: user Digest 1 Apr 2014 06:34:32 -0000 Issue 2184

2014-04-01 Thread Lewis John Mcgibbney
Hi Alex, On Tue, Apr 1, 2014 at 7:34 AM, user-digest-h...@nutch.apache.org wrote: I have applied the patch to the current trunk Here is the output of ant ...SNIP... Also, here are files that import/use StateManager class, which seems was removed from GORA_94 ...SNIP... Thanks.

[WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-01 Thread Lewis John Mcgibbney
Hi Folks, We are please to announce that the Nutch PMC recently VOTE'd to extend an invitation to Talat inviting him to join our PMC. His ongoing mailing list contributions and code contributions (mostly) to the 2.X branch has been evident for some time now and we are really glad to have him on

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-30 Thread Lewis John Mcgibbney
Hi Shane, On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote: MYSQL version 5.6.16 Nutch version 2.2 I would always suggest that you use the most up-to-date version of Nutch. For the 2.x branch that is 2.2.1. This may or may not have been fixed. Support for gora-sql

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-30 Thread Lewis John Mcgibbney
Hi Shane, On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote: How do you use the readdb command when using MYSQL there is no crawldb created ? A physical crawldb residing on HDFS is non-existent. It's equivalent in 2.x is the WebPage table which you will see is created

Re: nutch-2.x with hbase filter option

2014-03-30 Thread Lewis John Mcgibbney
Hi alxsss, On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote: I downloaded GORA_94 branch and with libs from it a get 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: test_urls Exception in thread main java.lang.NoClassDefFoundError:

[GSoC] Deadline for Student Applications

2014-03-20 Thread Lewis John Mcgibbney
Hi everyone, This is just a friendly reminder that there are just about 48 hours left until the deadline for student applications this year.* Late proposals will not be accepted for any reason. It's great to see some applications coming in. Best Lewis

Re: Book of Nutch

2014-03-19 Thread Lewis John Mcgibbney
Hi Folks, On Wed, Mar 19, 2014 at 7:49 PM, user-digest-h...@nutch.apache.org wrote: Re: Book of Nutch So what was the issue with this book? You can see some rather interesting reviews online http://s.apache.org/qy

Re: Interleaved nutch crawls locks crawldb

2014-03-19 Thread Lewis John Mcgibbney
Hi anupamk, On Tue, Mar 18, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote: While running the two crawler's concurrently I have run into the problems and nutch sometimes throws a IOException saying that the .locked file exists in crawldb. While one of crawl script tries to

Re: Nutch 2.2.1 pseudo dist, errors

2014-03-19 Thread Lewis John Mcgibbney
Hi BlackIce, On Wed, Mar 19, 2014 at 3:07 PM, user-digest-h...@nutch.apache.org wrote: HI, My first try to run Nutch in pseudo dist, when trying to run any nutch comman from the /runtime/deploy folder I get following error: Which version of Hadoop? Check the classpath for the offending

Re: Probleme with nutch inject blocked

2014-03-19 Thread Lewis John Mcgibbney
Hi a.ciccia04, On Wed, Mar 19, 2014 at 3:07 PM, user-digest-h...@nutch.apache.org wrote: Im working with apache-nutch-2.2.1, hbase-0.90.4 solr-4.7.0 You've not stated how you've configured your stack. You've not mentioned how many machines you run with. This may simply be an IO problem

Re: solrindex Content instead of ParseText ?

2014-03-17 Thread Lewis John Mcgibbney
Hi anupamk, On Sat, Mar 15, 2014 at 11:59 PM, user-digest-h...@nutch.apache.org wrote: I would like to know if I can configure nutch to solrindex the Content:: part of the record rather than ParseText:: part. I really don't know but it would be nice to make the configurable. It would

[RESULTS] WAS Re: [VOTE] Release Apache Nutch 1.8RC#2

2014-03-16 Thread Lewis John Mcgibbney
to everyone who contributed to Nutch 1.8 development drive. Best Lewis On Tue, Mar 11, 2014 at 10:17 PM, lewis john mcgibbney lewi...@apache.orgwrote: Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 1.8 RC#2. The release candidate comprises the following components

[ANNONCEMENT] Apache Nutch 1.8 Release

2014-03-16 Thread Lewis John Mcgibbney
Good Evening, The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.8. Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Re: nutch vs hadoop package versions

2014-03-05 Thread Lewis John Mcgibbney
Hi, On Wed, Mar 5, 2014 at 4:32 AM, user-digest-h...@nutch.apache.org wrote: There is a property to set the precedence for the jars in the submitted MapReduce job over the Hadoop jars, I dont recall the name of the property but you can try googling it. mapreduce.job.user.classpath.first =

Re: [VOTE] Apache Nutch 1.8 Release Candidate #1

2014-03-05 Thread Lewis John Mcgibbney
Hi Markus, On Wed, Mar 5, 2014 at 4:32 AM, user-digest-h...@nutch.apache.org wrote: -1 we still need to fix the last issue of the new segment merger test and the indexing issues. We also introduced the hostdb in 1.8 but it doesnt seem to work. Releasing 1.8 now is releasing a broken nutch

[VOTE] Apache Nutch 1.8 Release Candidate #1

2014-03-04 Thread Lewis John Mcgibbney
Hi All Nutch'ers, This thread is a VOTE for releasing Apache Nutch 1.8. The release candidate comprises the following components. * A staging repository [0] containing various Maven artifacts * A branch-1.8 of the trunk code [1] * The tagged source upon which we are VOTE'ing [2] * Finally, the

Re: Nutch API - conf id in create job

2014-02-26 Thread Lewis John Mcgibbney
Hi Mateusz, On Wed, Feb 26, 2014 at 9:59 AM, user-digest-h...@nutch.apache.org wrote: I'm investigating Nutch REST api Nice (I couldn't find any documentation). Yeah, this is something which will hopefully become part of this years GSoC if the project gets a go ahead. Right now it is

Re: Common Crawl's Move to Apache Nutch

2014-02-21 Thread Lewis John Mcgibbney
Hi Julien, On Fri, Feb 21, 2014 at 3:09 PM, user-digest-h...@nutch.apache.org wrote: Hi, Just in case you missed it, here is a blog post from Jordan Mendelson on how they moved to Nutch : http://commoncrawl.org/common-crawl-move-to-nutch/ Julien I think a significant portion of this

Re: Nutch 2.2.1 can not index to solr

2014-02-12 Thread Lewis John Mcgibbney
Hi Gavin, On Wed, Feb 12, 2014 at 9:24 AM, user-digest-h...@nutch.apache.org wrote: ParserJob: starting ParserJob: resuming:false ParserJob: forced reparse:false ParserJob: parsing all Parsing http://www.tianya.cn/ Parsing http://www.163.com/ Parsing http://www.hao123.com/

Re: Nutch - Hadoop Help

2014-02-05 Thread Lewis John Mcgibbney
Hi Manikandan, On Wed, Feb 5, 2014 at 7:36 AM, Manikandan Saravanan manikan...@thesocialpeople.net wrote: I'm getting this when running the crawl script right after the parse phase Exception in thread main java.lang.IllegalArgumentException: usage: (-crawlId id) Something wrong with

Re: Nutch - Hadoop Help

2014-02-04 Thread Lewis John Mcgibbney
://thesocialpeople.net On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney ( lewis.mcgibb...@gmail.com //lewis.mcgibb...@gmail.com) wrote: Hi Manikandan, On Mon, Feb 3, 2014 at 3:45 PM, user-digest-h...@nutch.apache.org wrote: And then, I'm running this: $HADOOP_HOME/bin/hadoop jar /usr/local/nutch

Re: Nutch - Hadoop Help

2014-02-03 Thread Lewis John Mcgibbney
Hi Manikandan, On Mon, Feb 3, 2014 at 3:45 PM, user-digest-h...@nutch.apache.org wrote: And then, I'm running this: $HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 -topN 5000 You're using the Crawler class. This is

Re: exception when trying to run nutch 2.2.1 on hadoop

2014-01-31 Thread Lewis John Mcgibbney
Hi Alberto, On Fri, Jan 31, 2014 at 5:26 AM, user-digest-h...@nutch.apache.org wrote: I noticed that the crawl script is generating the batch id but the Crawler class don't. So I changed nutch's code to generate the batch id and the problem solved. My questions: 1. I guess that there is a

Re: NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser when using HtmlParser

2014-01-22 Thread Lewis John Mcgibbney
Hi d_k, On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote: nekohtml-0.9.5.jar and nekohtml-1.9.17.jar and I suspect that the file that's actually loaded is 0.9.5. So do I If I execute 'ant clean' followed by 'ant runtime' (the patch was already applied) the only

Re: Crawling Websites for Links

2014-01-22 Thread Lewis John Mcgibbney
Hi Teague, On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote: @Markus: When you say that the problem may be with url filters, what can I do about that? By default Nutch uses a regex urlfilter for filtering out URLs which we assume ill ultimately mess up your crawlDB.

Re: Nutch 2.2.1 missing inbound link when using HBase

2014-01-22 Thread Lewis John Mcgibbney
Hi weishenyun, On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote: Hi lewis, I also found that there is something wrong in the DBUpdaterReducer. See below code block: snip... This sentence 'page.putToInlinks(new Utf8(inlink.getUrl()), new

Nutch 2.x HEAD + gora-core gora-cassandra 0.4-SNAPSHOT (trunk)

2014-01-22 Thread Lewis John Mcgibbney
Hi Folks, Sorry for cross posting... Simple question. Is anyone using the above combination? When I try and fetch a batchId e.g. ./bin/fetch 1390426083-1144459470, sometimes I am unable to fetch pages and my logging indicates 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s,

Re: NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser when using HtmlParser

2014-01-20 Thread Lewis John Mcgibbney
Hi d_k, On Mon, Jan 20, 2014 at 11:39 AM, user-digest-h...@nutch.apache.org wrote: Posting back as promised. :-) Great I just encountered the error java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser and applied the patch NUTCH-1253-2.x-v2.patch from NUTCH-1253

Fwd: ApacheCon NA 2014 Travel Assistance Applications now open!

2014-01-17 Thread Lewis John Mcgibbney
Hi user@/dev@, Please see the message below regarding travel assistance opportunities for people wishing to attend Apache Con 2014 which will be held in Denver, Colorado, April 7-9, 2014. Kind Regards Lewis -- Forwarded message -- From: lewis john mcgibbney lewi...@apache.org

Re: Cannot run program chmod : too many open files

2014-01-16 Thread Lewis John Mcgibbney
Hi Yann, On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote: Anything I can do on my side at this point? If you can reproduce this at your end, and suspect that others should be able to reproduce it as well, then I would suggest logging a ticket in our Jira. Please make

Re: need help about urlfilter

2014-01-16 Thread Lewis John Mcgibbney
Hi Jason, On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote: I had tried +^http://www.cancer.gov/cancertopics/druginfo http://www.cancer.gov/cancertopics/druginfo/lungcancer.* or +^http://www.cancer.gov/cancertopics/druginfo/

Re: Nutch 2.2.1 missing inbound link when using HBase

2014-01-16 Thread Lewis John Mcgibbney
Hi weishenyun, On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote: I have tried to use Nutch 2.2.1 recently. Using HBase as storage and I found that the column family il(inbound link) was missing. I have set db.update.max.inlinks = 1000 but none of il was there. Do

New Wiki Page - WorkingWithGoraSnapshots

2014-01-07 Thread Lewis John Mcgibbney
Hi Folks, It's become obvious that folks who are dabbling with Nutch 2.x deployments are either using, or wish to use, stable Gora SNAPSHOT's e.g. Gora trunk 0.4-SNAPHOT at time of writing. I put together this wiki page [0] for those people. Please send comments to the user@nutch list and we will

Re: Using Gora SNAPSHOT with Nutch

2014-01-07 Thread Lewis John Mcgibbney
Hi Manikandan, Please see https://issues.apache.org/jira/browse/NUTCH-1696 Can you please try this patch out and comment on the issue. I hope this works for you. Thank you Lewis On Tue, Jan 7, 2014 at 6:05 PM, Manikandan Saravanan manikan...@thesocialpeople.net wrote: Hi, I’m running Nutch

Re: Exception in NUTCH 2.2.1

2013-12-23 Thread Lewis John Mcgibbney
Hi Rajni, On Sun, Dec 22, 2013 at 12:39 PM, user-digest-h...@nutch.apache.org wrote: what can be error. Please try updating the sql database with the updatedb command. I would also advise you against using the crawl command and subsequent class. Please chain together individual commands and

Re: Exception in NUTCH 2.2.1

2013-12-19 Thread Lewis John Mcgibbney
Hi rk_sharma, On Wed, Dec 18, 2013 at 9:40 PM, user-digest-h...@nutch.apache.org wrote: Hi i am using nutch on rhel-5 and facing an exception [root@localhost local]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5 InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora

Re: In reference to http://www.mail-archive.com/user@nutch.apache.org/msg09999.html (Get HTML content generated by Javascript)

2013-12-19 Thread Lewis John Mcgibbney
Hi Nibal, On Sun, Dec 15, 2013 at 11:26 PM, user-digest-h...@nutch.apache.org wrote: of Single Page Web-apps and JavaScript-only web-applications is sky-rocketing.well, isn't this a high priority issue It would appear not. Unless folk provide patches then core contributers have not

Re: Exception in NUTCH 2.2.1

2013-12-19 Thread Lewis John Mcgibbney
Hi Rajni, On Thu, Dec 19, 2013 at 3:53 PM, user-digest-h...@nutch.apache.org wrote: Thanks for your suggestion. I wants to use MySql as storage backend for Gora in Nutch-2.2.1 is their any proper documentation that can be followed. Last two days i am searching on that but could not find any

Re: Effective way to crawling seed and discover new urls.

2013-12-13 Thread Lewis John Mcgibbney
Hi Nguyen, On Fri, Dec 13, 2013 at 4:28 AM, user-digest-h...@nutch.apache.org wrote: I am crawling a list of home pages to discover new articles, crawler will stop at depth 1.But at depth 1, crawler still add many new urls with depth 2, so event i only crawl up to depth 1 but crawldb still

discrepancies in using Tika parser and DOMFragmentParser

2013-12-13 Thread Lewis John Mcgibbney
Hi, In the process of addressing ad porting NUTCH-840 [0], I've discovered a couple of anomalies. Within org.apache.nutch.tika.TestDOMContentutils#setup trunk uses org.cyberneko.html.parsers.DOMFragmentParser like so private static void setup() throws Exception { conf =

[ANNOUNCE] Dublin NoSQL Meetup – Apache Gora and the Oracle NoSQL database

2013-12-12 Thread Lewis John Mcgibbney
Hi Folks, A quick post here to promote an event Apostolos Giannakidis (Apache Gora's GSoC student this year) and myself with be speaking at in Dublin this coming Monday. Event info and registration can be found below

Re: Manipulating Nutch 2.2.1 scoring system

2013-12-10 Thread Lewis John Mcgibbney
Hi Talat, On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote: Hi Vangelis, I draw a Nutch Software Architecture diagram. Maybe it can be help you. https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/ edit?usp=sharing Talat Would you be interested in

Re: NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser when using HtmlParser

2013-12-10 Thread Lewis John Mcgibbney
Hi d_k, Can you please check out this issue https://issues.apache.org/jira/browse/NUTCH-1253 I uploaded a patch on Fed 7th 2013 which has not been tested but which i hope will fix this issue. Can you please read up on the Jira issue and test the patch? Please also see my comments below On Tue,

Re: Unsuccessful fetch/parse of large page with many outlinks

2013-12-10 Thread Lewis John Mcgibbney
Hi, On Tue, Dec 10, 2013 at 8:46 PM, user-digest-h...@nutch.apache.org wrote: So this leaves me with a question. Are there recommendations for a properly configured User-Agent string that identifies an instance of a Nutch Crawler and does not run afoul of a firewall like this? Using the

Re: Too many link with status=1

2013-11-20 Thread Lewis John Mcgibbney
Hi vagkarv, On Wed, Nov 20, 2013 at 11:08 AM, user-digest-h...@nutch.apache.org wrote: Hi! I use apache-nutch-2.2.1 and for my database mysql. I run the crawl script and each time I check my database, I see that the number of links with status=1 are many more than the number of links with

<    1   2   3   4   5   6   7   8   9   10   >