Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Lewis John Mcgibbney
Welcome to the world of post 1.3 Nutch ;) On Thursday, February 21, 2013, Amit Sela am...@infolinks.com wrote: I basically just built with ant and copied the contents of deploy (job file + nutch and crawl scripts) to nutch folder in my hadoop-user directory on the master. I changed the crawl

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
by gora 0.2.1. I am actually using 0.90.6 for my hbase, but I don't how modify ivy.xml file to accomplish that. thanks, On 02/21/2013 10:39 AM, Lewis John Mcgibbney wrote: http://s.apache.org/WbGsorry for ridiculous size of font hth On Thu, Feb 21, 2013 at 10:31 AM, kaveh minooie ka

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
: cvc-complex-type.3.2.2: Attribute 'rev' is not allowed to appear in element 'include'. in file:/source/nutch/nutch/ivy/ **ivy.xml it is the same with exclude tag included as well. On 02/21/2013 11:19 AM, Lewis John Mcgibbney wrote: replace dependency org=org.apache.gora name=gora-hbase

Re: Nutch 1.6 with Java - not loading correct configuration file

2013-02-21 Thread Lewis John Mcgibbney
http://svn.apache.org/repos/asf/nutch/tags/release-1.6/src/java/org/apache/nutch/util/NutchConfiguration.java On Thu, Feb 21, 2013 at 12:03 PM, imehesz imeh...@gmail.com wrote: hello, I finally crossed all the terminal issues and I can run Nutch and Solr with no problems from the command

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
) at org.apache.gora.store.**DataStoreFactory.**createDataStore(** DataStoreFactory.java:118) this is output of a nutch inject commoand. BTW, what is snappy ? On 02/21/2013 12:17 PM, Lewis John Mcgibbney wrote: Try this dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi Roland, You say you start a fetch run, does this mean the FetcherJob or GeneratorJob? What kind of settings do you run your zNutch server with? On Wednesday, February 20, 2013, Roland rol...@rvh-gmbh.de wrote: Hi list, we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
: batchId: 1361367698-1708119958 FetcherJob: threads: 40 FetcherJob: parsing: true FetcherJob: resuming: false FetcherJob : timelimit set for : -1 --Roland Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney: Hi Roland, You say you start a fetch run, does this mean the FetcherJob

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi, Please head over to most recent thread on dev@ for potential improvements for the Generator* code. Thanks for invoking this discussion, it is well overdue. Lewis On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alex, On Wed, Feb 20, 2013

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters

Re: Nutch 2.1 / Hbase / Gora / Solr

2013-02-20 Thread Lewis John Mcgibbney
Hi Raja, There are certainly issues with the 2.x branch (of which 2.1 is the most recent release). Dependencies are managed via Ivy, so to build 2.1, just use the ant runtime target. You can see the Gora artifacts here http://search.maven.org/#search|ga|1|gora On Wed, Feb 20, 2013 at 9:14 PM,

Re: Slow parse on hadoop

2013-02-19 Thread Lewis John Mcgibbney
Hi, NUTCH-1420 is now committed, so you can update your local copy of Nutch 2.x if you are working from HEAD source. So there was another issue here where the parse was only running on one node in the cluster. Is this also the case with you? On Tue, Feb 19, 2013 at 2:48 PM, t_gra

Re: Slow parse on hadoop

2013-02-19 Thread Lewis John Mcgibbney
On Tue, Feb 19, 2013 at 3:40 PM, t_gra alexey.tiga...@gmail.com wrote: I tried skipping pages with large content size, and it figured that ALL my pages have content 125981292 bytes long (and probably the same contents). And this is okay? I don;t really understand. BTW, what number of

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

2013-02-18 Thread Lewis John Mcgibbney
wrote: So what (stable) version of Nutch and which architecture would best fit my cluster ? Is there a quick (simplified) deployment if I already have a running cluster and I don't want to change it's existing data or configuration ? Thanks. On Fri, Feb 15, 2013 at 12:42 AM, Lewis John

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

2013-02-18 Thread Lewis John Mcgibbney
for solr and zookeeper it is not affecting the slf4j? thanks, On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote: A solution would be to manually prune the dependencies which are fetched via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then maybe we need to make the exclusions

Re: fields in solrindex-mapping.xml

2013-02-17 Thread Lewis John Mcgibbney
? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Sat, Feb 16, 2013 10:58 am Subject: Re: fields in solrindex-mapping.xml In short, it helps with searching when you can slice your data using

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

2013-02-16 Thread Lewis John Mcgibbney
NUTCH-XX remove unused db.max.inlinks from nutch-default.xml trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers kaveh@d1r2n2:/2locos/source/nutch/nutch.git$ i am using branch 2.x On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote: Hi Kaveh, Two seconds please. First

Re: fields in solrindex-mapping.xml

2013-02-16 Thread Lewis John Mcgibbney
to include digest, tstamp, boost and batchid fields in solrindex? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 15, 2013 4:21 pm Subject: Re: fields in solrindex-mapping.xml Hi Alex, OK so we

Re: Dump of WebDB in 2.x

2013-02-16 Thread Lewis John Mcgibbney
presentable without any issues but i am not sure if we have any special characters within our content. I can check and tell you more on monday when i go back to work. I use Nutch-2.x with Hbase. Kiran. On Sat, Feb 16, 2013 at 3:01 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote

Re: Slow parse on hadoop

2013-02-16 Thread Lewis John Mcgibbney
Can you dump your webdb and check what the various fields are like? Can you read these in an editor? I think there may be some problems with the serializers in gora-cassandra but Iam not sure yet. Lewis On Saturday, February 16, 2013, t_gra alexey.tiga...@gmail.com wrote: Hi All, Experiencing

Re: Nutch 2.1 different batch id (null)

2013-02-15 Thread Lewis John Mcgibbney
And you want to get to the bottom of the batchId = null? You haven't actually asked a question.here. On Thursday, February 14, 2013, Dragan Menoski dragan.meno...@x3mlabs.com wrote: Hi, I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the instruction in this link:

Re: fields in solrindex-mapping.xml

2013-02-15 Thread Lewis John Mcgibbney
Hi Alex, So we can tack this one. https://issues.apache.org/jira/browse/NUTCH-1532 Thanks Lewis On Fri, Feb 15, 2013 at 4:21 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alex, OK so we can certainly remove segment from 2.x solr-index-mapping.xml. It would however be nice

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

2013-02-15 Thread Lewis John Mcgibbney
Hi Kaveh, Two seconds please. First lets set some thing straight. Nutch trunk is from here [0] Nutch 2,x is from here [1] Which one do you use? On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie ka...@plutoz.com wrote: but here is my problem. I tried to build the nutch using ver 1.4.3 of the

Re: How to get page content of crawled pages

2013-02-15 Thread Lewis John Mcgibbney
was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

2013-02-14 Thread Lewis John Mcgibbney
Hi Amit, On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela am...@infolinks.com wrote: I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2, and I saw that Nutch 2.1 with Gora supports HBase as backend. First thing's first. We cannot guarantee that Gora and subsequently Nutch

Re: fields in solrindex-mapping.xml

2013-02-14 Thread Lewis John Mcgibbney
Hi Alex, Tstamp represents fetch tiem, used for deduplication. Boost is for scoring-opic and link. This is required in 2.x as well. I don't have the code right now, but you can try removing digest and segment. To me they both look legacy. There is a wiki page on index structure which you can

Fwd: [GSoC Mentors Announce] Google Summer of Code 2013

2013-02-11 Thread Lewis John Mcgibbney
Hi All, This year again I will be getting involved in GSoC program. If you are interested in participating please get in touch on the relevant dev@ list and we can initiate discussion. See you on dev@ Best Lewis -- Forwarded message -- From: Carol Smith Date: Monday, February 11,

Re: Content Truncation in Nutch 2.1/MySQL

2013-02-10 Thread Lewis John Mcgibbney
. On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: So the problem for you is resolved? The main (typical) problem here is in the underlying gora-sql library and some rather difficult to master gora-sql-mapping.xml constraints. Hope all is resolved

Re: How to get page content of crawled pages

2013-02-09 Thread Lewis John Mcgibbney
and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update

Re: Could not find any valid local directory for output/file.out

2013-02-08 Thread Lewis John Mcgibbney
+1 This is a ridiculous size of tmp for a crawldb of minimal size. There is clearly something wrong On Friday, February 8, 2013, Tejas Patil tejas.patil...@gmail.com wrote: I dont think there is any such property. Maybe its time for you to cleanup /tmp :) Thanks, Tejas Patil On Fri, Feb

Re: Could not find any valid local directory for output/file.out

2013-02-08 Thread Lewis John Mcgibbney
Is truncating content not a possibility? By default, parsing is skipped for truncated docs IIRC. On Fri, Feb 8, 2013 at 4:18 PM, Eyeris Rodriguez Rueda eru...@uci.cuwrote: I have an idea of what was the problem, there is a url that contain a repository of pdf documents and nutch delay and

Re: Content Truncation in Nutch 2.1/MySQL

2013-02-07 Thread Lewis John Mcgibbney
It will prduce more output on the fetcher part of your hadoop.log not on the parsechecker tool itself that is why you are seeing nothing more. Are you still having problems with the truncation aspect? Lewis On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving w...@appirio.com wrote: Lewis:

Re: Could not find any valid local directory for output/file.out

2013-02-07 Thread Lewis John Mcgibbney
machine and 50 GB for solr machine. Please some advice or explanation will be accepted. Thanks for your time. - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 13:06:11 Asunto: Re: Could

Re: Content Truncation in Nutch 2.1/MySQL

2013-02-07 Thread Lewis John Mcgibbney
. Not as simple to detect when you've loaded data previously. Thanks for your assistance. On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: It will prduce more output on the fetcher part of your hadoop.log not on the parsechecker tool itself

Re: Nutch 2.1 + HBase cluster settings

2013-02-06 Thread Lewis John Mcgibbney
Please let us know how you get on as we can add this to the 2.x errors section of the wiki. Thanks and good luck with the problem. Lewis On Wed, Feb 6, 2013 at 4:45 PM, k4200 k4...@kazu.tv wrote: Hi Lewis, Thanks for your reply. 2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi

Re: Nutch 1.6 +solr 4.1.0

2013-02-06 Thread Lewis John Mcgibbney
Hi, We are not good to go with Solr 4.1 yet. There are changes required to schema.xml as well as the indexer package in nutch to accommodate api changes in 4.1. Please check our Jira for these issues. I am happy to help with the update however it will block some other proposed changes to the

Re: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-06 Thread Lewis John Mcgibbney
I've eventually added this to our FAQ's http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F This should explain for you. Lewis On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote: Hi I have a performance question: why fetcher and parser is staged in

Re: Content Truncation in Nutch 2.1/MySQL

2013-02-06 Thread Lewis John Mcgibbney
Can you use the parsechecker tool with fetcher.verbose overriden as true and the same settings on one of the (HTML?) documents giving you bother? The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest. On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving w...@appirio.com wrote:

Re: Nutch 2.1 + HBase cluster settings

2013-02-06 Thread Lewis John Mcgibbney
with Nutch. I replaced hbase-0.90.4.jar with hbase-0.90.6-cdh3u5.jar and the problem resolved. Regards, Kaz 2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Please let us know how you get on as we can add this to the 2.x errors section of the wiki. Thanks and good luck with the problem

Re: Nutch 1.6 +solr 4.1.0

2013-02-06 Thread Lewis John Mcgibbney
Nice, thanks for letting us know. I take it you were using an amended schema? On Wed, Feb 6, 2013 at 7:46 PM, alx...@aim.com wrote: Hi, Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with solr-4.1.0. Alex. -Original Message- From: Lewis John Mcgibbney

Re: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-06 Thread Lewis John Mcgibbney
to the discussion. Thanks for the input Ken. Lewis On Wed, Feb 6, 2013 at 8:21 PM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi Lewis, On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote: I've eventually added this to our FAQ's http://wiki.apache.org/nutch/FAQ

Re: Parsing error : java.lang.NoClassDefFoundError: org/cyberneko/html/LostText

2013-02-06 Thread Lewis John Mcgibbney
On Wed, Feb 6, 2013 at 9:35 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Two observations here 1) Did you try any versions more recent than 1.9.12? I assume you are talking about the net.sourceforge.nekohtml groupId artifact [0] as oppose to the nekohtml groupId artifact [1]? 2

Re: Usage of db.max.inlinks property in nutch-site.xml in 2.x

2013-02-05 Thread Lewis John Mcgibbney
Done. Committed @ r1442838 in 2.x HEAD Thanks Lewis On Tue, Feb 5, 2013 at 12:05 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote: Absolutely. We should remove any unused property that is not in the planning for (re)implementing. On Tue, Feb 5, 2013 at 2:12 AM, Lewis John Mcgibbney

Re: Nutch 2.0 and HBase 0.90.4

2013-02-04 Thread Lewis John Mcgibbney
Hi Adriana, Thanks for the update, I've added the solution to our wiki for others to consult in the future http://s.apache.org/jcs Thank you for getting back to us on this one. Lewis On Mon, Feb 4, 2013 at 2:18 AM, Adriana Farina adriana.farin...@gmail.com wrote: I solved my issue and I want

Re: 2.x : Links with 404 status are not being updated from db_unfetched to db_gone

2013-02-04 Thread Lewis John Mcgibbney
Hi Kiran, You are using 2.x still? On Mon, Feb 4, 2013 at 8:57 AM, kiran chitturi chitturikira...@gmail.com wrote: The file clearly shows that urls with status 1 have the protocolStatus(NOT FOUND). Those seeds are never moved to status (db_gone) that is status 3 if i am correct. Did

Re: Usage of db.max.inlinks property in nutch-site.xml in 2.x

2013-02-04 Thread Lewis John Mcgibbney
This looks like a bit of deprecation in nutch-default.xml then. We can remove the unused property? On Mon, Feb 4, 2013 at 8:10 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: Hi Lewis, The relevant property seems to be db.update.max.inlinks On Fri, Feb 1, 2013 at 4:27 AM, Lewis John

Re: How to get page content of crawled pages

2013-02-02 Thread Lewis John Mcgibbney
to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue

Re: mime type text/plain

2013-01-31 Thread Lewis John Mcgibbney
Can you briefly describe the problem here Sourajit? On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak sourajit.ba...@gmail.com wrote: Seems to be related to NUTCH-374 but that shows as fixed. I have set Nutch to accept unlimited content size this page is gzip encoded. On Thu, Jan 31, 2013

Re: Nutch 2.0 and HBase 0.90.4

2013-01-31 Thread Lewis John Mcgibbney
Hi Adriana, On Thu, Jan 31, 2013 at 3:03 AM, Adriana Farina adriana.farin...@gmail.com wrote: Searching on google, I've found that it can be an issue due to /etc/hosts, but it's correctly configured: 127.0.0.1 crawler1a localhost.localdomain localhost where crawler1a is the

Usage of db.max.inlinks property in nutch-site.xml in 2.x

2013-01-31 Thread Lewis John Mcgibbney
Hi All, Is it just me who or do we actually use the following property in 2.x anywhere? property namedb.max.inlinks/name value1/value descriptionMaximum number of Inlinks per URL to be kept in LinkDb. If invertlinks finds more inlinks than this number, only the first N inlinks will

Re: mime type text/plain

2013-01-31 Thread Lewis John Mcgibbney
And your regex rules? So is the URL fetched? On Thu, Jan 31, 2013 at 8:47 PM, Sourajit Basak sourajit.ba...@gmail.com wrote: Here it goes. Try to dump the content from this url with the following settings.

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread Lewis John Mcgibbney
They should be under the 'il' field... can you confirm if your inlinks are under the 'ol' field please? On Wed, Jan 30, 2013 at 10:43 AM, alx...@aim.com wrote: I see that inlinks are saved as ol in hbase. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread Lewis John Mcgibbney
Hi Kiran, On Wed, Jan 30, 2013 at 11:10 AM, kiran chitturi chitturikira...@gmail.comwrote: I have checked the database after the dbupdate job is ran and i could see only markers, signature and fetch fields. Which Gora artifacts are you using? We've recently fixed a bug in gora-cassandra [0]

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-30 Thread Lewis John Mcgibbney
You are not getting very many URLs! On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto peterbarrett...@gmail.comwrote: 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-27 Thread Lewis John Mcgibbney
Increase number of threads when fetching Also please see nutch-deault.xml for paritioning of urls, if you know your target domains you may wish to adapt the policy. Lewis On Sunday, January 27, 2013, peterbarretto peterbarrett...@gmail.com wrote: I want to increase the number of urls fetched at

Re: A question about injecting urls from a MySQL database rather than a text file

2013-01-22 Thread Lewis John Mcgibbney
This has certainly been explained in the past, however I can't find the archived thread. In short currently it is not possible. I think it would be a nice feature for the injector though. On Tuesday, January 22, 2013, 刘兆贵 liuzhao...@126.com wrote: Dear, I have a question, could you kind help

Re: Size limit for fetched pages

2013-01-18 Thread Lewis John Mcgibbney
Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi Kaz, On Sat, Jan 12, 2013 at 1:09 AM, k4200 k4...@kazu.tv wrote: Here are the questions: 1. How to fix this? I'm guessing changing the block size in HBase would fix the problem, but I don't know how. gora.properties, perhaps

Re: Nutch - ElasticSearch example

2013-01-16 Thread Lewis John Mcgibbney
It should be added that currently this functionality is available only on the 2.x branch courtesy of Ferdy. Lewis On Wednesday, January 16, 2013, Stanislav Orlenko orlenko.s...@gmail.com wrote: Hi bin/nutch elasticindex $elasticClusterName -reindex it is enough for me use bin/nutch

Re: Nutch 2.x : readdb command dump

2013-01-16 Thread Lewis John Mcgibbney
Hi Kiran, For this I think you are looking at diving further into the Gora API and codebase. As you can see around line 232 [0], the Query is set and executed based on the key. What you wish to do would possible encompass setting fields via the Gora Query API. There are some other useful methods

Re: nutch 2.x recrawl re-crawl

2013-01-14 Thread Lewis John Mcgibbney
Hi Bayu, Yes it will run fine on 1.6. Lewis On Sun, Jan 13, 2013 at 10:24 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: On Mon, Jan 14, 2013 at 6:45 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Markus, implemented an extension of the AdaptiveFetchSchedule [0] which

Re: nutch 2.x recrawl re-crawl

2013-01-14 Thread Lewis John Mcgibbney
Hi, On Mon, Jan 14, 2013 at 3:56 AM, J. Gobel jj.go...@gmail.com wrote: Hi Lewis, Thanks for your mail. My ideal goal would be to crawl the index.php several times per day, and fetch the new urls from that page and parse them. Then I know 'for sure' that my index is up to date. Sounds

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-12 Thread Lewis John Mcgibbney
Hi, On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: We can see that some of parse processes were not completed successfully. Yes I see this. I also see that you have a http.proxy.port = 8080 but no proxy host and that the protocol-httpclient plugin is not

Re: Crawling NCP with Nutch

2013-01-11 Thread Lewis John Mcgibbney
Hi Till, Currently no. You would need to write your own implementation. You can look at the protocol-* plugins in the link below for some guidance http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/ Hth Lewis Friday, January 11, 2013, Till Plumbaum till.plumb...@dai-labor.de wrote: Hi,

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Lewis John Mcgibbney
Hi, java.io.IOException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver If you look at ivy.xml [0] you will see that the mysql-connector-java dependency is commented out. Please uncomment it, then build Nutch 2.x src again. This will download the dependency and make it available on

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Lewis John Mcgibbney
11, 2013 at 7:01 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, java.io.IOException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver If you look at ivy.xml [0] you will see that the mysql-connector-java dependency is commented out. Please uncomment

Re: Parsing error : java.lang.NoClassDefFoundError: org/cyberneko/html/LostText

2013-01-09 Thread Lewis John Mcgibbney
Can you include the contents of your parse-plugins.xml file please The following two lines of logging look off to me On Tue, Jan 8, 2013 at 10:38 PM, Arcondo Dasilva arcondo.dasi...@gmail.comwrote: application/xhtml+xml via parse$ parse-plugins.xml, but no$ seems that tika cannot parse

Re: nutch 2.1 nutchserver documentation

2013-01-08 Thread Lewis John Mcgibbney
Hi Michael, There is very little on this however IIRC it can be done using REST calls. By default, if you initiate the Nutch server from the nutch script, it starts a local Jetty server running Nutch from which crawls can be executed via REST calls. By no means is is this a feature of Nutch which

Re: nutch 2.1 and session cookies

2013-01-08 Thread Lewis John Mcgibbney
Hi Michael, So far there has been no discussion on this topic with specific focus on adding the functionality. I also notice that NUTCH-827 is not marked for inclusion in 2.2. I would urge you to open another issue describing your approach and suggested solution specifically for 2.x... if this is

Re: differences between nutch 1 and nutch 2

2013-01-08 Thread Lewis John Mcgibbney
Hi David, The best resources we have for this can be found on the wiki. These explain quite a bit about the respective Nutch tools (Injector, Generator, etc.) and how they are implemented in 2.x. http://wiki.apache.org/nutch/Nutch2Crawling On Tue, Jan 8, 2013 at 4:07 AM, Michael Gang

Re: Parsing error : java.lang.NoClassDefFoundError: org/cyberneko/html/LostText

2013-01-08 Thread Lewis John Mcgibbney
Hi Arcondo, On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva arcondo.dasi...@gmail.comwrote: My question : why I can't use Tika to parse Html instead of Neko ? is it possible to get ride of Neko or it is mandatory ? I would urge you to override the parsing logic in parse-plugins.xml [0]

Re: Nutch 2.1 crash with solr

2013-01-08 Thread Lewis John Mcgibbney
Is this from a crawl command or from the bin script... or something else? Your input arguments are not complete. the -batch X switch will not work for anything, as such a parameter simply does not exist. Are you aware of how you ended up with the batchId being null? What version of 2.x are you

Re: nutch javascript capabilities

2013-01-08 Thread Lewis John Mcgibbney
Hi Michael, On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang michaelg...@gmail.com wrote: JavaScript (for extracting links only?) (parse-js) Yes, both in and outlinks if present. I don't understand what this exactly means. Let's say if i have a link a onclick=do_something or a jquery

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-07 Thread Lewis John Mcgibbney
Hi Bayu, On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Anyone can give me a hint? In parallel I changed to use nutch 1.6 binary and works well. But curious to use the latest of nutch 2.1. Please check out the latest 2.x branch here [0]. This uses Tika 1.2

Re: Parsing error : java.lang.NoClassDefFoundError: org/cyberneko/html/LostText

2013-01-07 Thread Lewis John Mcgibbney
Hi Arcondo, Is this still a problem? Lewis On Tue, Jan 1, 2013 at 12:50 PM, Arcondo Dasilva arcondo.dasi...@gmail.comwrote: Hello, I'm still getting the error even after ant clean and an entire rebuild. I cannot parse a site and getting this error java.util.concurrent.ExecutionException:

Re: Re: Nutch 2.1 crash with solr

2013-01-07 Thread Lewis John Mcgibbney
that. But I still don't understand the concept of 'batch id'. Besides, is it the right direction to capture 'batch' argument in command line? Thanks. At 2012-12-19 22:07:23,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Currently the batchID is originally set by the GeneratorJob

Re: apache-nutch-*.jar packed inside job file (v1.5.1)

2013-01-07 Thread Lewis John Mcgibbney
Hi Sourajit, You're suggesting that there is a clear case of compiled code duplication? If this is the case I have no idea and further if this actually is the case then we could address it... however I would be surprised if this were the case. Any ideas anyone? Lewis On Fri, Dec 28, 2012 at

Re: generate.max.count was not affected

2013-01-06 Thread Lewis John Mcgibbney
:01 AM, Tejas Patil tejas.patil...@gmail.com wrote: Hey Lewis, Yes. Thats a good idea. There are so many properties in nutch-default.xml and having the deprecated ones adds to the confusion. Thanks, Tejas Patil On Sat, Jan 5, 2013 at 11:12 PM, Lewis John Mcgibbney

Re: nutch 2.1 command line options

2013-01-06 Thread Lewis John Mcgibbney
Hi Jc, This is correct. The command line parameters differ in key tools, the generator being one. I think we would be best to document this on the wiki as well as attempting to implement useful command line options to stdout for all tools in 2.x, this would shadow the verbose and more helpful

Re: generate.max.count was not affected

2013-01-05 Thread Lewis John Mcgibbney
I think it would be good to phase out some of the deprecated configuration properties if possible. We have had several stable releases with these props included... Lewis On Jan 5, 2013 6:22 PM, Tejas Patil tejas.patil...@gmail.com wrote: The generate.max.per.host is deprecated but still is used

Re: Nutch2.1 + Hsql2.2.9 java.sql.BatchUpdateException: data exception: string data, right truncation

2013-01-03 Thread Lewis John Mcgibbney
Hi Rui, The gora-sql backend is not stable so please do not be surprised if things do not work flawlessly. I would urge you to have a look at the gora-sql-mapping.xml file [0] and check the respective field values for the columns you are attempting to map. This aside, I would use the following

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-03 Thread Lewis John Mcgibbney
Hi Arcondo, As Tejas pointed out, the jar is not on the classpath. This should be automated by the Ant and Ivy configuration in Nutch however if it is not then simply manually enforce it. Lewis On Wed, Jan 2, 2013 at 9:43 PM, Arcondo arcondo.dasi...@gmail.com wrote: Hello, I made an ant

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-03 Thread Lewis John Mcgibbney
...@gmail.comwrote: Thanks for the explanation. I'm more a functional guy with no solid background in Java. Could you give some details on how to enforce it manually ? Thanks in advance, Arcondo On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: the jar

Re: Crawling localhost Webapps - regex- urfilter query

2012-12-19 Thread Lewis John Mcgibbney
This sounds most like non-existence of robots.txt on the webserver. Lewis On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski rajinima...@gmail.com wrote: Hi Tejas, I found out the reason for why the blog was not getting crawled : http://rajinimaski.blogspot.in/ This is because of the proxy

Re: Nutch 2.1 crash with solr

2012-12-19 Thread Lewis John Mcgibbney
Hi, Currently the batchID is originally set by the GeneratorJob#run() method @line 169 [0], you will see that this can also be overridden by the generate.batch.id property in nutch-site.xml Currently if you look at line 117 in the crawl script [1] you will see that there is a TODO to capture the

Re: Parsing of document types

2012-12-12 Thread Lewis John Mcgibbney
Hi James, One of the plugins is Nutch uses Tika 1.2 as parser wrapper. The list of Tika formats can be found below http://tika.apache.org/1.2/formats.html hth Lewis On Wed, Dec 12, 2012 at 4:02 PM, James Ford simon.fo...@gmail.com wrote: Hello, Which document types can nutch parse? I know

Re: Best way to extract content from a web page

2012-12-12 Thread Lewis John Mcgibbney
Hi, You can take a look at around line 102 in the ParserChecker tool [0] for details on how to find desired fields and display them. hth Lewis [0] https://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java?view=markup On Wed, Dec 12, 2012 at 2:12 AM, alw37

Re: MoreIndexingFilter last-modified time from protocol-file docx

2012-12-12 Thread Lewis John Mcgibbney
Hi, I think this is a bug and should be logged however as it is a rather specific use case (with an older version of Nutch), I wonder if you can confirm this with trunk? It would be great to log it against 1.7 (and/or 2.2) so we can work towards a solution. Best Lewis On Tue, Dec 11, 2012 at

Re: Web pages parsed status

2012-12-11 Thread Lewis John Mcgibbney
Hi Renato, OK here we go :0) On Mon, Dec 10, 2012 at 3:44 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: I did notice that this pages weren´t fetched but the thing is that I do want them to be fetched without actually having to fetch, and parse them individually with the

Re: [ANNOUNCE] Apache Nutch 1.6 Released

2012-12-09 Thread Lewis John Mcgibbney
Hi Eyeris, Yeah I'll fix this, thank you for pointing this out. For reference the link is below. Thank you Lewis http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt On Sun, Dec 9, 2012 at 4:19 AM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Thanks for this news Lewis, i was checking but

Re: Web pages parsed status

2012-12-08 Thread Lewis John Mcgibbney
Hi Renato, Firstly are you on 2.x? If so what gora- storage backend are you on? If not what version of 1.x are you using. After fetching have you parsed the pages? How are you executing your crawl cycle. The one step command/script or individually via a custom script? We advise against using

Re: [VOTE] Apache Nutch 1.6 Release Candidate

2012-11-28 Thread Lewis John Mcgibbney
Hi Julien, Thanks for initial review On Wed, Nov 28, 2012 at 10:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: - CHANGES.txt contains dates in both MM/DD/ and DD/MM/ formats. Shall we write the month in text form e.g. 7th July 2012 from now on? Yes I am +1 for your

Re: doubts about some propierties on nutch-site.xml file

2012-11-23 Thread Lewis John Mcgibbney
Lovely Javadoc Andrzej On Fri, Nov 23, 2012 at 7:32 AM, Markus Jelsma markus.jel...@openindex.io wrote: See: http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Fri 23-Nov-2012

Re: java.io.IOException: Job failed!

2012-11-23 Thread Lewis John Mcgibbney
Hi attabi225, This is really a question for the user@ list so I have copied everyone in. Firstly, I would please ask you to use 1.5.1, we found a major (but tiny) bug in 1.5 which renders it as a release we will forget about for the time being ;0) On Fri, Nov 23, 2012 at 1:48 PM,

[VOTE] Apache Nutch 1.6 Release Candidate

2012-11-23 Thread lewis john mcgibbney
Hi Everyone, A candidate for the Apache Nutch 1.6 RC#1 is available at: http://people.apache.org/~lewismc/apache_nutch_1.6/ The release candidate is a src.zip, src.tar.gz, bin-zip and bin-tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.6 Further, a

Re: Jenkins build is back to normal : Nutch-trunk #2019

2012-11-21 Thread Lewis John Mcgibbney
Hi, Moving this thread to user@ it has nothing to do with Nutch development. On Wed, Nov 21, 2012 at 5:08 AM, Bhagya n.bhagyalaks...@gmail.com wrote: Hi, Thanks for your reply. I installed cygwin, and on my windows. I am getting the below error while running ant command. Please find the

Re: Run nutch 2.1 in distributed mode

2012-11-21 Thread Lewis John Mcgibbney
Hi Donald, I would advise you to re-generate the nutch job archive, as it appears that your settings are not included within the job file you are trying to deploy on your hadoop setup/cluster. you can do this by running ant job (after making changes to the files in conf) from $NUTCH_HOME hth

Re: Get full content in a plugin extending HTMLParseFilter

2012-11-21 Thread Lewis John Mcgibbney
Hi Jorge, On Wed, Nov 21, 2012 at 3:21 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: I'm just building the plugin in this machine and testing on Ubuntu GNU/Linux 12.04 works just fine. Excellent Would this worth start an issue? It's seems to be just in my particular

Re: IOException Hadoop

2012-11-21 Thread Lewis John Mcgibbney
Hi, A more general question... is anyone using Nutch with Windows 7 successfully? It might be nice to get a trunk and 2.x build on the Windows Jenkins slaves just so we have an idea of this. I've not been near Windows in years sorry. Lewis On Wed, Nov 21, 2012 at 9:12 PM, Prashant Ladha

Re: timestamp in nutch/solr

2012-11-21 Thread Lewis John Mcgibbney
Hi Joe, On Wed, Nov 21, 2012 at 9:25 PM, Joe Zhang smartag...@gmail.com wrote: Are you saying that as long as I crawl some page once, nutch will go and refetch the page in 30 days by default, without me running the command again? No this is impossible (unless you have an automated job

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-14 Thread Lewis John Mcgibbney
Hi Erol, What exactly did you do to get it working correctly? I am jkeen to find out as I will not be able to retry with 2.x deployment until later in the week. Thanks Lewis On Wed, Nov 14, 2012 at 3:08 PM, Erol Akarsu eaka...@gmail.com wrote: Lewis, I finally run Nutch 2.1 and SOLR 4.0

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-14 Thread Lewis John Mcgibbney
it off These 2 changes cleared the the issue. Erol Akarsu On Wed, Nov 14, 2012 at 10:49 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Erol, What exactly did you do to get it working correctly? I am jkeen to find out as I will not be able to retry with 2.x deployment

<    3   4   5   6   7   8   9   10   11   12   >