Re: What's the current status of upgrading nutch 1.* trunk to solr 4?

2013-05-02 Thread Lewis John Mcgibbney
-httpclient plugin. And I don't see any handling of this in the patch. lewis john mcgibbney wrote Hi, Have you looked at the patch for NUTCH-1486? this is not just schema changes. The patch is for 2.x but the process of porting it to new pluggable indexing architecture for trunk is trivial

Re: What's the current status of upgrading nutch 1.* trunk to solr 4?

2013-05-02 Thread Lewis John Mcgibbney
I've updated 1487 with the patch. Please test and get back to us. It would be great to upgrade to Solr 4.x prior to pushing Nutch 1.7 Thank you so much Lewis On Thu, May 2, 2013 at 9:09 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi adfel70, It is only a patch

Re: Solrindex -all not working correctly

2013-05-01 Thread Lewis John Mcgibbney
What version are you using? If you can I would advise you to upgrade to 2.x HEAD. On Wed, May 1, 2013 at 4:32 AM, Bai Shen baishen.li...@gmail.com wrote: My crawl loop consists of the following. generate -topN fetch -all parse -all updatedb solrindex -all With the fetch and parse the

Re: HBase 0.94.6 and Nutch 2.1

2013-05-01 Thread Lewis John Mcgibbney
In short, Gora needs to upgrade the use of HBase API to more recent version. If you are able and willing to do so, we would be very very happy to have you contribute to Gora. https://issues.apache.org/jira/browse/GORA-201 On Wed, May 1, 2013 at 11:41 AM, AC Nutch acnu...@gmail.com wrote: Hello

Re: Example crawl script Nutch 2.1

2013-04-30 Thread Lewis John Mcgibbney
Hi James, Please look for NUTCH-1545 capture batchid... If you could review and use this patch it would be very very helpful. thank you lewis On Tuesday, April 30, 2013, James Ford simon.fo...@gmail.com wrote: Thanks for your answer! I think I will create my own modified crawlscript then. But

Re: Remove fetched files from HBase after parse

2013-04-30 Thread Lewis John Mcgibbney
I would most likely agree with Tejas. Either that or you could use the delete and deleteByQuery operations for http://gora.apache.org/docs/current/apidocs-0.2.1/index.html?org/apache/gora/hbase/store/HBaseStore.html It depends on how you intend to use the software. hth On Tue, Apr 30, 2013 at

Re: Nutch 2.1 different batch id (null)

2013-04-30 Thread Lewis John Mcgibbney
Hi, There is a pretty difficult aspect to this problem which makes it difficult for others/me to address. There are a number of variables which may (depending on your task execution between crawls) change the possibility/probability of some MARK not being present. The core problem here within the

Re: Nutch 2.1 different batch id (null)

2013-04-30 Thread Lewis John Mcgibbney
I've opened NUTCH-1567 to track and address this. https://issues.apache.org/jira/browse/NUTCH-1567 On Tue, Apr 30, 2013 at 9:39 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, There is a pretty difficult aspect to this problem which makes it difficult for others/me to address

Re: Nutch 2 hanging after aborting hung threads

2013-04-30 Thread Lewis John Mcgibbney
That would be very much appreciated. Lewis On Tue, Apr 30, 2013 at 5:00 AM, Bai Shen baishen.li...@gmail.com wrote: I'll let you know if I figure out any good defaults. Thanks. On Sat, Apr 27, 2013 at 5:30 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Bai, On Thu

Re: Proper way to stop a crawl safely - Nutch 1.6 from Hadoop 1.1.1

2013-04-30 Thread Lewis John Mcgibbney
Hi, For reference, ideally you should fetch many smaller segments. This prevents many baddies. This sounds brutal, but I would just kill it. You loose one segment... hopefully. Lewis On Tue, Apr 30, 2013 at 4:20 PM, AC Nutch acnu...@gmail.com wrote: Hello All, I've been looking around for a

Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

2013-04-27 Thread Lewis John Mcgibbney
Hi, @Tejas, you will remember the work undertaken on NUTCH-1284 (the patch for which you submitted included the fix for NUTCH-1042) relates to this. I am not sure if the situations are identical, but they are closely linked by the looks of it. @ianin, can you look at the commentary and provide

Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

2013-04-27 Thread Lewis John Mcgibbney
. -Original Message- From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Saturday, April 27, 2013 3:30 PM To: user@nutch.apache.org Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay Hi, @Tejas, you will remember the work undertaken on NUTCH-1284 (the patch

Re: Nutch 2 hanging after aborting hung threads

2013-04-27 Thread Lewis John Mcgibbney
Hi Bai, On Thu, Apr 25, 2013 at 4:33 AM, Bai Shen baishen.li...@gmail.com wrote: Well, I still ended up having to set a content limit. Which is why I'm wondering how the Nutch Gora integration works. I didn't see a lot of documentation on it. So far Nutch seems to be running okay with

Re: [nutch 2.1 with mysql] different batch id (null)

2013-04-26 Thread Lewis John Mcgibbney
- inject - fetch The second inject will leave entries in the db without fetchmarks seen by the fetcher later. --Roland On Fri, Apr 26, 2013 at 12:30 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Additionally, why do we log.DEBUG that there is a different batch id ( + mark

Re: solrdedup NullPointerException

2013-04-26 Thread Lewis John Mcgibbney
I just found out this was logged by Markus many moons ago https://issues.apache.org/jira/browse/NUTCH-992 It would be nice if you could update this Jira issue with any progress you are able to make on it. I am not able to help right now sorry. Lewis On Fri, Apr 26, 2013 at 2:14 PM, brian4

Re: [nutch 2.1 with mysql] different batch id (null)

2013-04-26 Thread Lewis John Mcgibbney
Hi All, I went ahead and added some documentation to the wiki on this topic *http://s.apache.org/Jb6 * Please add to it where you see fit. I still think that the logging is incorrect on this one.* * On Fri, Apr 26, 2013 at 12:47 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote

Re: [nutch 2.1 with mysql] different batch id (null)

2013-04-25 Thread Lewis John Mcgibbney
, Apr 25, 2013 at 7:31 AM, Carmine Paternoster carmine...@gmail.comwrote: Hi Lewis, thank you very much, for your answer. I do not know how, but I solved it. No longer appear different batch id (null). In any case, I'm using Nutch 2.1 Good day, Carmine 2013/4/24 Lewis John Mcgibbney

Re: Unable to crawl a series of pages in tutorial

2013-04-24 Thread Lewis John Mcgibbney
Yes On Wed, Apr 24, 2013 at 11:41 AM, Yves S. Garret yoursurrogate...@gmail.com wrote: The dmoz directory, it should be located here, yes? ${APACHE_NUTCH_HOME}/runtime/local On Tue, Apr 23, 2013 at 10:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: The DmozParser should

Re: Unable to crawl a series of pages in tutorial

2013-04-24 Thread Lewis John Mcgibbney
, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: The DmozParser should have created a flat file similar to a bootstrap file which you can inject. The flat file should be inside a the dmoz directory (if you've followed the tutorial). Please make sure the file is present

Re: GENERAL PROBLEMS LEARNING TO USE NUTCH

2013-04-24 Thread Lewis John Mcgibbney
Hi, CC: user@nutch.apache.org Questions like this should really go to the user@ list, you have a must better change of being helped there are there are many many eyes. On Wed, Apr 24, 2013 at 8:57 AM, d...@e-sentry.net wrote: I would be really gratefull if you could provide some links on the

Re: [nutch 2.1 with mysql] different batch id (null)

2013-04-24 Thread Lewis John Mcgibbney
Hi Carmine, CC: user@nutch.apache.org On Wed, Apr 24, 2013 at 3:13 AM, Carmine Paternoster carmine...@gmail.comwrote: I configured Nutch and mySql following this guide ( http://nlp.solutions.asia/?p=180). everything worked fine, but at some point in the database I find all elements with

Re: Unable to crawl a series of pages in tutorial

2013-04-24 Thread Lewis John Mcgibbney
Please reread my previous comments On Wed, Apr 24, 2013 at 3:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: If you are using Nutch 2.x then drop the arguments for crawl/crawldb as Nutchn does not maintain a local crawldb in 2.x. We delegate Gora to deal

Re: Unable to crawl a series of pages in tutorial

2013-04-24 Thread Lewis John Mcgibbney
Hi Yves, On Wed, Apr 24, 2013 at 3:07 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: The issue that I'm having a hard time with at the moment is that I don't understand how Gora would replace crawldb here (as in what the commands would be to do this). I'm going to keep looking for how

Re: Nutch 2 hanging after aborting hung threads

2013-04-23 Thread Lewis John Mcgibbney
can you please give examples of the files which were truncated? thank you Lewis On Tuesday, April 23, 2013, Bai Shen baishen.li...@gmail.com wrote: I just set http.content.limit back to the default and my fetch completed successfully on the server. However, it truncated several of my files.

Re: Any way to run tasks after Nutch is done executing?

2013-04-23 Thread Lewis John Mcgibbney
://wiki.apache.org/nutch/NutchTutorial#A3.1_Using_the_Crawl_Command On Tue, Apr 23, 2013 at 3:30 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Just write a crawl script? Effectively that's all the crawl script is, just chaining together logical tasks. The one provided with Nutch

Re: Crawling and Hadoop problem

2013-04-23 Thread Lewis John Mcgibbney
://maximilianomarin.com Celular: (+56 9) 780 688 91 2013/4/22 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Can you look into the job archive and see what is wrong? Maybe you need to rebuild the job archive ant job from ${NUTCH.HOME} On Mon, Apr 22, 2013 at 1:09 PM, Maximiliano Marin conta

Re: Unable to crawl a series of pages in tutorial

2013-04-23 Thread Lewis John Mcgibbney
The DmozParser should have created a flat file similar to a bootstrap file which you can inject. The flat file should be inside a the dmoz directory (if you've followed the tutorial). Please make sure the file is present, and that the CLI syntax is correct. If you are using Nutch 2.x then drop the

Re: Any way to run tasks after Nutch is done executing?

2013-04-23 Thread Lewis John Mcgibbney
, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Yves, We advise to use this script and modify it for your own needs http://svn.apache.org/repos/asf/nutch/trunk/src/bin/crawl hth Lewis On Tue, Apr 23, 2013 at 12:52 PM, Yves S. Garret yoursurrogate...@gmail.com wrote

Re: Error Nutch2 and HBase

2013-04-23 Thread Lewis John Mcgibbney
Hi Maximiliano, This version of HBase is most likely not compatabile with Gora HBase ersion is: 0.94.2-cdh4.2.0 On Tue, Apr 23, 2013 at 8:08 PM, Maximiliano Marin conta...@maximilianomarin.com wrote: Hello: First I want to give thank for all the replies in my last thread. Now I am trying to

Re: Error Nutch2 and HBase

2013-04-23 Thread Lewis John Mcgibbney
, Virtualization MCTS: SQL Server 2008, Implementation and Maintenance Web: http://maximilianomarin.com Celular: (+56 9) 780 688 91 2013/4/24 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi Maximiliano, This version of HBase is most likely not compatabile with Gora HBase ersion is: 0.94.2

Re: [Exception in thread main java.io.IOException: Job failed!]

2013-04-22 Thread Lewis John Mcgibbney
logging explicitly states that no solrUrl is set. On Sunday, April 21, 2013, kiran chitturi chitturikira...@gmail.com wrote: Hi Mick, Since this is an error with Indexing, Can you check the logs from Solr side ? On Sun, Apr 21, 2013 at 4:15 AM, micklai lailixi...@gmail.com wrote: HI

Re: Crawling and Hadoop problem

2013-04-22 Thread Lewis John Mcgibbney
run your job jar from within the runtime/deploy directory. On Monday, April 22, 2013, Maximiliano Marin conta...@maximilianomarin.com wrote: Hello guys: I am trying to run nutch over Hadoop. Everything was ok. I modified files by the tutorial that I have already read and in the moment of make

Re: Crawling and Hadoop problem

2013-04-22 Thread Lewis John Mcgibbney
9) 780 688 91 2013/4/22 Lewis John Mcgibbney lewis.mcgibb...@gmail.com run your job jar from within the runtime/deploy directory. On Monday, April 22, 2013, Maximiliano Marin conta...@maximilianomarin.com wrote: Hello guys: I am trying to run nutch over Hadoop. Everything

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Lewis John Mcgibbney
hi Tejas, this is a real excellent reply and very useful. it would be really great if we could somehow have this kind of low level information readily available on the Nutch wiki. On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Fetcher threads try to get a fetch item (url)

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Lewis John Mcgibbney
22, 2013 at 8:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: hi Tejas, this is a real excellent reply and very useful. it would be really great if we could somehow have this kind of low level information readily available on the Nutch wiki. On Monday, April 22, 2013, Tejas

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread Lewis John Mcgibbney
Hi Senthilkumar, In short, search recrawl from the Nutch wiki to find an external blog post on recrawling with Nutch. If you have anything to add to the post contact the author. If on the other hand you need clarification on anything then ping us here Hth Lewis On Thursday, April 18, 2013,

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread Lewis John Mcgibbney
Hi Raja, The FetchSchedule [0] defines the contract for implementations that manipulate fetch times and re-fetch intervals. FetchScheduleFactory [1] caches the instance in the ObjectCache. The Interface and classes (respectively) do not automate or semi-automate actual scheduling e.g. execute the

Question about Nutch and Hadoop

2013-04-16 Thread Lewis John Mcgibbney
Hi Alexander, Please feel free to sign up to our wiki (please provide one of the dev team with your uid) and link to your documentation. Best Lewis On Monday, April 15, 2013, Alexander Chepurnoy kusht...@yahoo.com wrote: You can find those files under Hadoop folder. Working with Hadoop+Nutch is

Re: Trying to output to db in MS-SQL on Azure

2013-04-16 Thread Lewis John Mcgibbney
Hi Yves, This has nothing to do with Nutch. It strictly has to do with Gora. That was my justification for moving a similar thread (it may actually even have been this one) over to user@gora. As Renato explained, by the looks of it Microsoft Azure platform has a client library which enables you to

Re: Trying to output to db in MS-SQL on Azure

2013-04-16 Thread Lewis John Mcgibbney
Hi Yves, On Tue, Apr 16, 2013 at 1:43 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: Thanks for your reply. Forgive me for being so clueless, but there's much that I still don't know about Apache Nutch (and Hadoop for that matter, but I'm learning). Not at all, I am learning as well.

Re: Indexing to Solr4.2 with nutch 1.6

2013-04-10 Thread Lewis John Mcgibbney
, SolrConstants.TIMESTAMP_FIELD, SolrConstants.DIGEST_FIELD); On Tue, Apr 9, 2013 at 9:15 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Before we do the upgrade we need to consolidate all of these use cases. What criteria do we want to review and accept

Setting up nutch 1.6 with Solr 4.2

2013-04-09 Thread Lewis John Mcgibbney
help. Thanks. On Mon, Apr 8, 2013 at 10:31 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Amit, I recently updated NUTCH-1486 [0] with a patch to work against Solr 4.2.1. You will be able to pull stuff from this patch and push it into your Solr 4 schema file, etc. I

Re: Indexing to Solr4.2 with nutch 1.6

2013-04-09 Thread Lewis John Mcgibbney
(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Thanks. On Mon, Apr 8, 2013 at 10:33 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I would probably be best to describe what you've tried here, possibly a paste of your schema, what you've done

Re: Setting up nutch 1.6 with Solr 4.2

2013-04-08 Thread Lewis John Mcgibbney
Hi Amit, I recently updated NUTCH-1486 [0] with a patch to work against Solr 4.2.1. You will be able to pull stuff from this patch and push it into your Solr 4 schema file, etc. I will begin work on upgrading trunk to work with Solr 4 shortly... maybe this afternoon. If you are able to help with

Re: Indexing to Solr4.2 with nutch 1.6

2013-04-08 Thread Lewis John Mcgibbney
I would probably be best to describe what you've tried here, possibly a paste of your schema, what you've done (if anything) to the Nutch source to get it working with Solr 4, etc. The stack trace you get would also be beneficial. Thank you Lewis On Mon, Apr 8, 2013 at 4:13 AM, Amit Sela

Re: How to get page content of crawled pages

2013-04-02 Thread Lewis John Mcgibbney
Hi Peter, The patch attached to the issue is for trunk. If you were able to make a patch for 2.x and upload it to the issue that would be great. There are API differences so I can tell you that even though the mongodb indexer classes have been applied, it. Will most likely be a fruitless effort.

Re: Re: What urls does Nutch crawl?

2013-04-02 Thread Lewis John Mcgibbney
in an additional Parse plugin in order to prevent nutch from crawling the outlinks in the article page?At 2013-01-15 13:31:11,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I take it you are updating the database with the crawl data? This will mark all links extracted during

Re: error using generate in 2.x

2013-03-30 Thread Lewis John Mcgibbney
(Method.java:597) at org.apache.hadoop.util.RunJar.**main(RunJar.java:156) If I revert to previous release it works fine. Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Fri, Mar 29, 2013 4:30 pm

Re: error using generate in 2.x

2013-03-29 Thread Lewis John Mcgibbney
Hi Kaveh, Firstly, as logged below, Gora attempts to associate your HBase table configuration with specified tables (from within gora-hbase-mapping.xml) however it seems that your case satisfies the condition if (!tableName.equals(tableNameFromMapping)) meaining that the table name is not equal to

Re: error using generate in 2.x

2013-03-29 Thread Lewis John Mcgibbney
when I ommit the -crawlId parameter ( forcing it to use the default name webpage ), and more importantly it is new. I haven't had this problem before, it just started to happening 2 days ago when i pulled the latest commits to 2.x branch. On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote: Hi

Re: error using generate in 2.x

2013-03-29 Thread Lewis John Mcgibbney
(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) If I revert to previous release it works fine. Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: Root slash being stripped from file path

2013-03-28 Thread Lewis John Mcgibbney
the patch a try and see if that fixes my issue. On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Nutch version please? Sebastian and others worked on this a while ago. I don't know about the progress on it. There is most certainly open/resolved tickets

Re: Root slash being stripped from file path

2013-03-27 Thread Lewis John Mcgibbney
Nutch version please? Sebastian and others worked on this a while ago. I don't know about the progress on it. There is most certainly open/resolved tickets for it on Jira please look there. Thank you Lewis On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen baishen.li...@gmail.com wrote: I'm trying to

Re: parsechecker and redirection

2013-03-25 Thread Lewis John Mcgibbney
Hi Canan, Thank you for bringing this up, I just noticed that 2.x does not have the configurable property in nutch-default.xml property namehttp.redirect.max/name value0/value descriptionThe maximum number of redirects the fetcher will follow when trying to fetch a page. If set to

Re: parsechecker and redirection

2013-03-25 Thread Lewis John Mcgibbney
://www.apachecon.eu/ ... There is already NUTCH-1419: report redirect and do not parse. @Lewis: I'll review the latest patch soon, so we can sort this out. @Canan: feel free to open a new Jira to make parsechecker follow redirects. Thanks! Sebastian On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote

Re: parsechecker and redirection

2013-03-25 Thread Lewis John Mcgibbney
, NUTCH-1389, NUTCH-1419, and NUTCH-1501! On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote: Thanks for clarification on this one Seb. I was aware that you were clued up on this and hoped you would drrop in. On Monday, March 25, 2013, Sebastian Nagel wastl.na...@googlemail.com wrote

Re: parsechecker and redirection

2013-03-25 Thread Lewis John Mcgibbney
Hi Alex, We need to fix this. Can you please open an issue in the Jira and we can address? Thank you very much in advnace. Lewis On Mon, Mar 25, 2013 at 4:53 PM, alx...@aim.com wrote: Hello, I would like to let you know that, currently nutch -2.x does not index redirected pages, independent

Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank Algorithm

2013-03-24 Thread Lewis John Mcgibbney
Hi All, After some discussion and drumming up of interest within the Giraph community, I've logged a Google Summer of Code issue [0] for this topic. We are looking for interested students to come forward and participate in the effort. I logged this over in Giraph as there was no GSoC eefort

Re: waitForCompletion Error

2013-03-23 Thread Lewis John Mcgibbney
Can you please turn logging to DEBUG, then steo through the job. Provide any observations please. Lewis On Sat, Mar 23, 2013 at 5:38 PM, kamaci furkankam...@gmail.com wrote: After crawling when I run that command: bin/nutch solrindex http://localhost:8983/solr -index Sometims I get that

Re: waitForCompletion Error

2013-03-23 Thread Lewis John Mcgibbney
On the thread you pointed to Sebastian provides some clues on how to properly DEBUG the issue. You can try to DEBUG the issue. By this I mean actually DEBUGGING it, not just setting logging to DEBUG and hoping for excellent results.. this will unfortunately not happen. Can you please confirm your

Re: How to resolve Error: Could not find or load main class org.apache.nutch.crawl.Crawler in Windows 7

2013-03-18 Thread Lewis John Mcgibbney
Hi Prasanna, I would like to note for the record that I do not know of anyone running 2.X series within windows environment so I am keen to help you get this working. Once you build the project source, please make sure that the generated .job file is on your classpath along with the other

Re: Any plans to make nutch 1.x support solr cloud?

2013-03-17 Thread Lewis John Mcgibbney
Hi, You are always encouraged to look at our Jira instance before asking questions. It really helps both you and us solve problems efficiently. Please check out https://issues.apache.org/jira/browse/NUTCH-1377 And comment where you can. When we eventually do the entire out of the box upgrade to

[WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-16 Thread lewis john mcgibbney
Hi Everyone, On behalf of the Nutch PMC I would like to announce and welcome Feng Lu on board as PMC and Committer on the project. Amongst others, Feng has been an important part of the Nutch development over the last while and we would like to welcome him. @Feng, Please feel free to say a bit

Re: Run Nutch Crawl in Eclipse

2013-03-15 Thread Lewis John Mcgibbney
Nutch provides you with a pretty fine grained (common) logging mechanism. If you check out conf/log4j.properties you can alter specific tools, or the entire logging policy to obtain the coarseness you require. In this instance, I would either set the logging for Injector to DEBUG, or of course

Re: Understanding fetch MapReduce job counters and logs

2013-03-15 Thread Lewis John Mcgibbney
Hi Amit, I know this thread is a bit old now, however it is also something which bugged me when I was looking into something else (InjectorJob counters). On Tue, Mar 5, 2013 at 3:16 AM, Amit Sela am...@infolinks.com wrote: And summing all counters does not equal the total map input...

Re: Run Nutch Crawl in Eclipse

2013-03-15 Thread Lewis John Mcgibbney
Hi Mustafa, 1. Always tell us what version of the software you are using. It also helps to mention whether your are using a binary version or src. 2. Please read the responses from users@, you haven't answered which version of Nutch your using 3. As I explained, If you check out

How to identify seed URL for a given record from Webpage

2013-03-14 Thread Lewis John Mcgibbney
seedurl as one of the metadata. I was looking for some plugin which I could use but in this case I did not find any suitable plugin. Regards, Anand. On 13 March 2013 22:40, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Anand, The first step is to look at thew issue over on NUTCH

Re: How to identify seed URL for a given record from Webpage

2013-03-13 Thread Lewis John Mcgibbney
Hi Anand, The first step is to look at thew issue over on NUTCH-1533 If you feel like addressing anything then please do. This particular issue has nothing to do with Gora, or Hadoop so you will not need to look at any of the code there. I will also be working on that issue when I get some time.s

Re: Nutch : Wiki Section updates

2013-03-13 Thread Lewis John Mcgibbney
Hi Kiran Please send me your wiki uid On Wed, Mar 13, 2013 at 9:56 PM, kiran chitturi chitturikira...@gmail.comwrote: Hi! I have noticed that there are certain sections of Nutch wiki that are not up to date. I am planning to update these pages with some pointers to the mailing list

Re: Nutch : Wiki Section updates

2013-03-13 Thread Lewis John Mcgibbney
Hi Kiran On Wed, Mar 13, 2013 at 9:56 PM, kiran chitturi chitturikira...@gmail.comwrote: I am planning to update these pages with some pointers to the mailing list discussion which give valuable information and also JIRA's. Nice Second, Does anyone have suggestions on improving/updating

Re: How to identify seed URL for a given record from Webpage

2013-03-11 Thread Lewis John Mcgibbney
There are numerous methods to do this. *You can either assign some metadata to each URL chen injecting and bootstrapping the system *You could embed some meta tags or other distinguishing feature in the URLs and use the facilities (existing or available in Jira) to identify these pages. *You may

Re: How to identify seed URL for a given record from Webpage

2013-03-11 Thread Lewis John Mcgibbney
Do you have an interest to work on implementing NUTCH-1533? I would be happy to work on this as well. Lewis On Mon, Mar 11, 2013 at 7:39 PM, Anand Bhagwat abbhagwa...@gmail.comwrote: Thanks for the information. I guess using the batch id is a good idea.. On 11 March 2013 21:50, Lewis John

Re: mapred.FileOutputCommitter - Output path is null in cleanup

2013-03-09 Thread Lewis John Mcgibbney
Hi Marcel, The WARN can be ignored. Really, it occurs when we commit a job and do the clean up of a temporary directory. This is not a problem. On Wed, Mar 6, 2013 at 6:56 AM, mma m...@aufwind.cc wrote: Is there a posibility to get more Informationen in the hadoop logfile ? Not in this

[ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer

2013-03-09 Thread lewis john mcgibbney
Hi All, Over the last while we have been aware of Kiran's ongoing contribution to the Nutch community. It is with great pleasure that we invite Kiran to join the Nutch PMC and also take up Committer role. @Kiran, please feel free to say a bit about yourself and introduce what brought you to

Re: keep all pages from a domain in one slice

2013-03-08 Thread Lewis John Mcgibbney
mergesegs operation. I think this could be a useful feature to many Nutch users. I can see that I wont get any more assistance here. Thanks, Jason On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Jason, There is nothing I can see here which concerns

Re: Nutch 1.6 from Java via HttpServlet

2013-03-06 Thread Lewis John Mcgibbney
The invocation Exception means that something further down is the problem. It looks to be the presence of your URLNormalizer. Make sure the configuration is all fine, make sure that the resources are available. This is not a problem with Nutch code, rather how you are using Nutch in your own code.

Re: Continue Nutch Crawling After Exception

2013-03-05 Thread Lewis John Mcgibbney
Hi, On Tue, Mar 5, 2013 at 7:22 AM, raviksingh ravisingh.air...@gmail.comwrote: I am new to Nutch.I have already configured Nutch with MYSQL. I have few questions : I would like to star by saying that this is not a great idea. If you read this list you will see why. 1.Currently I am

Re: Rest API for Nutch 2.x

2013-03-05 Thread Lewis John Mcgibbney
Documentation - No prior art - yes - http://www.mail-archive.com/user@nutch.apache.org/msg06927.html Jira issues - NUTCH-932 Please let us know how you get on. Getting some concrete documentation for this would be excellent. Thank you Lewis On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat

Re: keep all pages from a domain in one slice

2013-03-05 Thread Lewis John Mcgibbney
Hi Jason, There is nothing I can see here which concerns Nutch. Try solr lists please. Thank you Lewis On Tuesday, March 5, 2013, Stubblefield Jason mr.jason.stubblefi...@gmail.com wrote: I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet. My index

Re: Nutch 2.1 crawling step by step and crawling command differences

2013-03-04 Thread Lewis John Mcgibbney
Hi, If you look at the crawl script iirc there is no way to programmatically obtain the generated batchId(s) from the generator. This sounds like the source of the problem. As Kiran said though, the Nutch crawl script is the way forward ;) On Monday, March 4, 2013, kiran chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

2013-03-04 Thread Lewis John Mcgibbney
Please don't go ahead and delete the parse directories just yet before you hear back from others. My suggestion would be to try and delete a subsection of the directories and see if this is possible. Have you changed some configuration and now want to parse out some more content/structure? On

Re: Nutch 1.6 : How to reparse Nutch segments ?

2013-03-04 Thread Lewis John Mcgibbney
for 1.x like 2.x has. Regards, Kiran. On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Please don't go ahead and delete the parse directories just yet before you hear back from others. My suggestion would be to try and delete a subsection

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-03-02 Thread Lewis John Mcgibbney
successfully with plugins. On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: honestly, I think we should get this fixed. Can someone please explain to me why we don't build every plugin within Nutch 2.x? I think we should. On Thu

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-03-01 Thread Lewis John Mcgibbney
tried to build 2.x with Eclipse i) Feed ii) parse-swf iii) parse-ext iv) parse-zip v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478) The above plugins need to be ported to build 2.x successfully with plugins. On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-02-28 Thread Lewis John Mcgibbney
This shouldn't be happening but we are aware (the Jira instance reflects this) that there are some existing compatibility issues with Nutch 2.x HEAD. IIRC Kiran had a patch integrated which dealt with some of these issues. What I have to ask is what JDK are you using? I use 1.6.0_25 (I really need

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-02-28 Thread Lewis John Mcgibbney
to be ported due to the API changes. https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This shouldn't be happening but we are aware (the Jira

Re: Something for the weekend

2013-02-28 Thread Lewis John Mcgibbney
must have an awesome cluster to run this :) Thanks, Tejas Patil On Thu, Feb 28, 2013 at 12:06 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, I pushed a real simple script which I use as a cron job to bootsrtrap Apache Nutch with 1M URLs every day. For those wanting

Re: why is nutch2.1 trying to parse the same documnets again and again?

2013-02-27 Thread Lewis John Mcgibbney
Have you looked at the java code? I am curious (and confused) about this different batch id (null) logging and want to either get rid of it... or better... make it more informative which would address both of our concerns. I would like not only to document this in the java code but also on the

Re: nutch-2.1 with hbase - any good tool for querying results?

2013-02-27 Thread Lewis John Mcgibbney
What for? What do you want? We are discussing (in the Gora community) making a gora-pig module so that there is a unified mechanism for doing pig driven inference of the data you hold in gora-* stores. Are you interested in engaging in that conversation? In all honesty (although indirectly linked)

Re: why is nutch2.1 trying to parse the same documnets again and again?

2013-02-27 Thread Lewis John Mcgibbney
Hi On Wednesday, February 27, 2013, adfel70 adfe...@gmail.com wrote: Yes I looked at the code. Great I saw that shouldProccess() check is performed on each file in the mapper. I've got used in nutch1.* to a method in which in each cycle only a set of urls is being processed. Is nutch2.*

Re: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-02-27 Thread Lewis John Mcgibbney
Hi, It is clear that Nutch (or more specifically Nutch 2.x) is not interoperable with this or some Hadoop distributions, in this case it is CDH4. It is not an easy problem to address from a community-to-work ratio point of view, especially with Nutch 2.x where there are multiple libraries which we

Re: Eclipse Error

2013-02-27 Thread Lewis John Mcgibbney
Glad to confirm that it was something wrong with your local windows environment Danilo and that it is now fixed. I tried to get nightly windows 7 builds running for Nutch on the Apache build infrastructure but I've been unable to do so yet. On Wed, Feb 27, 2013 at 9:31 AM, Danilo Fernandes

Re: nutch-2.1 with hbase - any good tool for querying results?

2013-02-26 Thread Lewis John Mcgibbney
We will be working on better support (gora-pig adapter) for this functionality in Apache Gora 0.3. For now Kiran's suggestion is by far the best. Thank you Lewis On Tue, Feb 26, 2013 at 10:17 AM, kiran chitturi chitturikira...@gmail.comwrote: I found apache pig [1] convenient to use with Hbase

Re: Eclipse Error

2013-02-26 Thread Lewis John Mcgibbney
What is the problem? There is a community here that can help... if we know what is wrong! On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes dan...@kelsorfernandes.com.br wrote: I tried both and no one function! :( -Mensagem original- De: kiran chitturi

Re: Eclipse Error

2013-02-26 Thread Lewis John Mcgibbney
shed some light. Thanks, Tejas Patil On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com [15] wrote: What is the problem? There is a community here that can help... if we know what is wrong! On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes dan

Re: migrating from 1.x to 2.x

2013-02-26 Thread Lewis John Mcgibbney
Hi kaveh, Size of crawl database is not an issue with regards to migration between Nutch versions, it is a compatibility issue which you need to be concerned about. There are no tools currently available in Nutch (as far as I know) to read URLs from hdfs and import/inject your crawl data into your

Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Danilo, You can check out the architecture changes here http://wiki.apache.org/nutch/#Nutch_2.x Nutch trunk (1.7-SNAPSHOT) is here http://svn.apache.org/repos/asf/nutch/trunk/ 2.x is here http://svn.apache.org/repos/asf/nutch/branches/2.x/ On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes

Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Markus, This is very useful thank you. Lewis On Mon, Feb 25, 2013 at 3:08 PM, Markus Jelsma markus.jel...@openindex.iowrote: Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you

Re: nutch with cassandra internal network usage

2013-02-22 Thread Lewis John Mcgibbney
e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Those filters are applied only to URLs which do

Re: nutch with cassandra internal network usage

2013-02-21 Thread Lewis John Mcgibbney
records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11

<    2   3   4   5   6   7   8   9   10   11   >