Re: Updating the documentation for crawl via 2.x

2013-06-30 Thread Lewis John Mcgibbney
The individual command line options on the wiki deal with this no? http://wiki.apache.org/nutch/CommandLineOptions I am a huge fan of making Nutch easier to use, and that it is more obvious from a user perspective. I am also however very keen to write and maintain documentation which stands the

Re: Dependant lib in a plugin

2013-06-30 Thread Lewis John Mcgibbney
Hi, Please see plugin central for all your plugin therapy ;) https://wiki.apache.org/nutch/PluginCentral Also for your dependencies please see, most noticeably, parse-tika. Take a look at plugin.xml http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-tika/ hth Lewis On Sun, Jun 30, 2013

[ANNOUNCE] Apache Nutch v1.7 Released

2013-06-28 Thread lewis john mcgibbney
Hi All, The Apache Nutch PMC are extremely pleased to announce the immediate release of Apache Nutch v1.7. Apache Nutch is an open source web-search software project. Stemming from Apache Lucene http://lucene.apache.org/java/, it now builds on Apache Solrhttp://lucene.apache.org/solr/adding

Re: Questions/issues with nutch

2013-06-27 Thread Lewis John Mcgibbney
is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Hemant, I strongly advise you to take some

Re: Questions/issues with nutch

2013-06-27 Thread Lewis John Mcgibbney
-site.xml was changed to use AvroStore as storage class and job was rebuilt, and I reran inject, the output of which still shows that it is trying to use Memstore. On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: The Gora MemStore was introduced

[VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Lewis John Mcgibbney
Hi, It would be greatly appreciated if you could take some time to VOTE on the release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is (amongst other things) a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String. The big fix solved 8 issues:

Re: Depth level 5 crawling issue

2013-06-27 Thread Lewis John Mcgibbney
Hi, Can you please try this http://s.apache.org/wIC Thanks Lewis On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote: Hi, I'm using nutch 2.x with HBase and tried to crawl http://www.halliburton.com/en-US/default.page; site for depth level 5. Following is the

Re: Nutch 2.1/Cassandra - Nutch 2.2.1/Cassandra

2013-06-27 Thread Lewis John Mcgibbney
Hi Martin, Thanks for the mail. Please see my answers in line On Thu, Jun 27, 2013 at 12:53 PM, Martin Aesch martin.ae...@googlemail.comwrote: Should I switch from 2.1/Cassandra to 2.2.1/Cassandra? Once we release yes. we are currently VOTEíng on the release of 2.2.1. Some background here, in

Re: Fetch iframe from HTML (if exists)

2013-06-26 Thread Lewis John Mcgibbney
It looks like your on a pre 1.3 version of Nutch here. It is highly recommended to upgrade. Thanks Lewis On Wednesday, June 26, 2013, Amit Sela am...@infolinks.com wrote: I did succeed in parsing using content and iterating over every line but I'd prefer do it with DocumentFragment. my

Re: Fetch iframe from HTML (if exists)

2013-06-26 Thread Lewis John Mcgibbney
ended up using org.jsoup.nodes.Document document = Jsoup.parse(content. Listorg.jsoup.nodes.Node childNodes = document.childNodes(); On Wed, Jun 26, 2013 at 7:19 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: It looks like your on a pre 1.3 version of Nutch here. It is highly

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

2013-06-26 Thread Lewis John Mcgibbney
Hi Tony, On Wed, Jun 26, 2013 at 1:13 AM, Tony Mullins tonymullins...@gmail.comwrote: As you siad UpdateDBJob doesn't expect crawlId , No I didn't. I said, is doesn't *use* (we'll not until yesterday it didn't) crawlId parameter. This is now fixed within the crawl script. and will be

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

2013-06-26 Thread Lewis John Mcgibbney
On Wed, Jun 26, 2013 at 4:30 AM, Tony Mullins tonymullins...@gmail.comwrote: Is it possible to crawl with crawlId but HBase only crates 'webpage' table without crawlId prefix , just like Cassandra does? I can't understand this question Tony. And my other problems of DBUpdateJob's

Re: Crawl in Nutch2.2

2013-06-26 Thread Lewis John Mcgibbney
...@gmail.com wrote: Hi Lewis, Thanks for your reply I just set the values: gora.datastore.default=org.apache.gora.hbase.store.HBaseStore I already removed the Hbase table in the past. Can it be a cause? Benjamin On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney lewis.mcgibb

[ANNOUNCE] Apache Nutch v1.7 Released

2013-06-26 Thread Lewis John Mcgibbney
N.B. Previous message doesn't seem to have been mod'd through under my @ apache.org address so resending ;) It has however been distributed to annou...@apache.org already Hi All, The Apache Nutch PMC are extremely pleased to announce the immediate release of Apache Nutch v1.7. Apache Nutch is

Re: Questions/issues with nutch

2013-06-26 Thread Lewis John Mcgibbney
Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ Thanks

Re: Crawl in Nutch2.2

2013-06-25 Thread Lewis John Mcgibbney
Have you changed from the default MemStore gora storage to something else? On Tuesday, June 25, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: thanks Tejas Yes, I cheecked the logs and no Error appears in them I let the http.content.limit and parser.html.impl with their

Re: Solr job is not processing document with nutch 2.x hbase backend

2013-06-25 Thread Lewis John Mcgibbney
Hi Jamshaid, There is/are identical thread(s) currently open about this exact issue by Tony. Please chime in on them instead of opening brand new threads. Thanks you http://www.mail-archive.com/user%40nutch.apache.org/ On Tue, Jun 25, 2013 at 5:38 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote:

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

2013-06-25 Thread Lewis John Mcgibbney
Hi Tony, On Tue, Jun 25, 2013 at 1:10 AM, Tony Mullins tonymullins...@gmail.comwrote: So what should I do now to run my complete cycle of Nutch2.x jobs and insert my docs to Solr ? I'm not using HBase as backend however I know that as per the crawl script, the updatedb doesn't use crawlId

Re: Solr dedup command shows error

2013-06-25 Thread Lewis John Mcgibbney
org.apache.hadoop*hadoop*.mapreduce.Mapper this to import org.apache.hadoop.mapreduce.Mapper. Thanks Regards, Jamshaid On Tue, Jun 25, 2013 at 12:13 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Jamshaid, Please see the Jira issue and patch for this. https

Re: Nutch-Solr 4.3 schema

2013-06-25 Thread Lewis John Mcgibbney
the above field in Nutch's schema-solr4 will solve the issue and it will out of box work with the pre-defined examples of Solr 4.3.0. Tony. On Mon, Jun 17, 2013 at 11:04 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Please note that we will be dropping the schema-solr4 in Nutch

Re: Solr dedup command shows error

2013-06-24 Thread Lewis John Mcgibbney
Hi Jamshaid, Please see the Jira issue and patch for this. https://issues.apache.org/jira/browse/NUTCH-1571 I would like to commit a patch for 2.x regarding how we were writing bytes, if you can test this patch then we can maybe add it as well and push 2.2.1 Thank you Lewis On Monday, June 24,

Re: Parse reduce stage take forver

2013-06-24 Thread Lewis John Mcgibbney
I may as well drop this one in here, I opened an issue a while back to discuss inherent differences in ordering of filtering and normalization between 1.x and 2.x codebases specifically within Generator* classes. https://issues.apache.org/jira/browse/NUTCH-1373 I am not sure how/if this applies to

[RESULT] WAS Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-24 Thread Lewis John Mcgibbney
to fix few issues before... [0] -1, nope, because... (and please explain why) GREAT, I'll push the release artifacts, promote the Maven staging repos and make thr announcements. Thank you very much to everyone that VOTE'd. Great work LEwis On Thu, Jun 20, 2013 at 2:48 PM, lewis john mcgibbney lewi

Re: When the webpage write the column mtdt:_csh_ in the hbase?

2013-06-24 Thread Lewis John Mcgibbney
Hi, law@CEE279Law3-Linux:~/Downloads/asf/2.x$ find . | xargs grep _csh_ ./src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/.svn/text-base/OPICScoringFilter.java.svn-base: private final static Utf8 CASH_KEY = new Utf8(_csh_);

Re: need legends for fetch reduce jobtracker ouput

2013-06-22 Thread Lewis John Mcgibbney
, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I agree. I can sort this tomorrow. @Kiran, Are we still working to addition of documentation contributers via contributers and admin group since the most recent lockdown? Tejas should be added to both groups

Re: need legends for fetch reduce jobtracker ouput

2013-06-22 Thread Lewis John Mcgibbney
#What_do_the_numbers_in_the_fetcher_log_indicate_.3F On Sat, Jun 22, 2013 at 12:54 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Sounds great Tejas. Wow this is a late shift. If you can commit your fetcher diagnostics it would be great Tejas. On Saturday, June 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: What

Re: Most stable backend for Nutch 2.x

2013-06-22 Thread Lewis John Mcgibbney
Hi Imran, HBase 0.90.x thank you Lewis On Saturday, June 22, 2013, imran khan imrankhan.x...@gmail.com wrote: Greetings, I have seen many mails here about people having different issues with different backends with Nutch 2.x So which backend is most suited /stable with Nutch 2.x and also

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-21 Thread Lewis John Mcgibbney
result. I really dont know what else I do here !!! Could you please try any simple ParseFilter with latest Nutch2.x. ? Thanks, Tony On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: And the rest of the webpage fields actually. Are you

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Lewis John Mcgibbney
Hi Tony, The second bullet point on the tutorial states that Gora works with 0.90.X HBase branch (yes this is old) It is known not to work with the 0.94.X branch. Please try with the 90 branch. Thanks Lewis On Fri, Jun 21, 2013 at 8:12 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi ,

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Lewis John Mcgibbney
that it has been committed. Would be good to fix it if we can. The code compiles and passes the test. +1 to release Thanks Julien On 20 June 2013 22:48, lewis john mcgibbney lewi...@apache.org wrote: Hi, Please VOTE on the release of the Apache Nutch 1.7 artifacts. As always

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Lewis John Mcgibbney
On Fri, Jun 21, 2013 at 11:40 AM, Tony Mullins tonymullins...@gmail.comwrote: Thanks guys for your help support. No hassle great to have you poking around and using the software. We know there is work to be done. Thank you I'll try it now with HBase 0.90.x. Let us know how you get on.

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Lewis John Mcgibbney
and passes the test. +1 to release Thanks Julien On 20 June 2013 22:48, lewis john mcgibbney lewi...@apache.org mailto: lewi...@apache.org wrote: Hi, Please VOTE on the release of the Apache Nutch 1.7 artifacts. As always, we solved a bunch of issues: http://s.apache.org/1zE http

Re: Get HTML content generated by Javascript

2013-06-21 Thread Lewis John Mcgibbney
Hi, Nearly all of this page is generated by JS right? Right now my answer is no. We fetch then parse page source... which in this case is mostly all JS. The magic happens in the browser. ... Lewis On Tue, Jun 18, 2013 at 10:59 PM, Deals Collect dealscoll...@gmail.comwrote: Hi all, Can Nutch

Re: Inconsistencies in use of ParseStatus in 2.x

2013-06-21 Thread Lewis John Mcgibbney
Forget this. I am tripping and the low counters were directly in relation to NUTCH-1591 Sorry Lewis On Wed, Jun 19, 2013 at 5:04 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, We define the structure of ParseStatus [0] in our WebPage JSON schema [1]. All good so far. What

Re: confusion over fetch schedule

2013-06-21 Thread Lewis John Mcgibbney
Hi Joe, In 1.x Markus and Julien IIRC committed a real nice patch a while back which allows you to achieve what I think you are after. Please look at this thread http://www.mail-archive.com/user@nutch.apache.org/msg08738.html You will find piles of stuff on the user archive about this kinda

Re: Slow parse on hadoop

2013-06-21 Thread Lewis John Mcgibbney
Thanks Jason for posting this to user@ list. For those using Cassandra (1.1.2) and gora-cassandra please patch your copy of Nutch 2.x with Jason's patch. It would be real great if we could get some feedback on this as I am of the opinion that it certainly justifies a point oh release for 2.x.

Re: Repeated html while parsing sites in ParseFilter

2013-06-20 Thread Lewis John Mcgibbney
Hi, There is an open thread on the user list for this right now. Please look in the recent archive. I think it would be best to take this conv over there. Lewis On Thursday, June 20, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote: Hi All, I'm using Nutch 2.x/Cassandra and I have 3 urls in

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-20 Thread Lewis John Mcgibbney
Maybe an obvious question Tony, but have you tried stepping through this and debugging your code? There is another thread which appeared today, which basically is the same problem as you have. I am struggling to see how there are parsefilter plugin implementations shipped with 2.x which do not

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-20 Thread Lewis John Mcgibbney
And the rest of the webpage fields actually. Are you getting multiple values for each field or is it just for content? On Thursday, June 20, 2013, Tony Mullins tonymullins...@gmail.com wrote: Hi, Did any one get chance to look at the pointed out issue ? Just would like to know that is this a

[VOTE] Apache Nutch 1.7 Release Candidate

2013-06-20 Thread lewis john mcgibbney
Hi, Please VOTE on the release of the Apache Nutch 1.7 artifacts. As always, we solved a bunch of issues: http://s.apache.org/1zE SVN source tag: http://svn.apache.org/repos/asf/nutch/tags/release-1.7/ Staging repo: https://repository.apache.org/content/repositories/orgapachenutch-044/

Re: Synchronization Consistency of data in ParseFilter and IndexingFIlter

2013-06-20 Thread Lewis John Mcgibbney
Hi Tony, You are using Cassandra backend right? I think it's safe to say that there are lingering bugs in gora-cassandra. I am getting some dodgy behaviour using Cassandra 1.1.2 during large crawls. On Tue, Jun 18, 2013 at 12:40 AM, Tony Mullins tonymullins...@gmail.comwrote: I have debuged

Re: run nutch-1.6 in eclipse

2013-06-19 Thread Lewis John Mcgibbney
Mustafa, Please read this thoroughly http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists Once we understand from a well detailed email, what the problem is, we are more than willing to help. Until then it is really difficult to help you out. Sorry. Lewis On

Inconsistencies in use of ParseStatus in 2.x

2013-06-19 Thread Lewis John Mcgibbney
Hi, We define the structure of ParseStatus [0] in our WebPage JSON schema [1]. All good so far. What is not good (or not clear to me at least), is how we currently use methods within this class to define Hadoop counters for the parsing tasks. I parse large amounts of URLs, but the counters on one

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-18 Thread Lewis John Mcgibbney
Hi Tony, On Tue, Jun 18, 2013 at 11:49 AM, Tony Mullins tonymullins...@gmail.comwrote: ...instead of returning html of the current page it is returning me the url of all the pages in seed.txt I suspect that this should not be happening at all! Could you please try entering 2 or more

Re: Generate map task not fully completing in 2.x

2013-06-18 Thread Lewis John Mcgibbney
This also happened during fetching stage as well. JobTracker shows that Fetcher was successful, but that 98.30% of the Map was complete. This looks like user@nutch is the wrong list of this. I will take it over to MR. Lewis On Tue, Jun 18, 2013 at 8:05 PM, Lewis John Mcgibbney lewis.mcgibb

Re: Nutch-Solr 4.3 schema

2013-06-17 Thread Lewis John Mcgibbney
expects _version_ field and that was missing in my schema. And the patch also doesn't include this field in Schema-Solr4.xml. Besides that I was also missing some .jars in my Solr class path. Tony. On Mon, Jun 17, 2013 at 12:14 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote

Re: Nutch-Solr 4.3 schema

2013-06-17 Thread Lewis John Mcgibbney
Please note that we will be dropping the schema-solr4 in Nutch and merging the content into schema.xml. It is not good for us to maintain two schemas. Thanks Lewis On Sunday, June 16, 2013, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, If you're able to patch this up and submit

Re: DBUpdateJob failed - Exception job failed: name=update-table,

2013-06-17 Thread Lewis John Mcgibbney
Hi Tony, Which gora backend are you on, including the version of the backend itself please? I use Gora 0.3 with gora-cassandra on some cron jobs and injected your URLs into my db. All works fine. I did notice that these pages have a hellish lots of content which is not displayed on the page. Loads

Re: DBUpdateJob failed - Exception job failed: name=update-table,

2013-06-17 Thread Lewis John Mcgibbney
, Jun 17, 2013 at 11:21 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, Which gora backend are you on, including the version of the backend itself please? I use Gora 0.3 with gora-cassandra on some cron jobs and injected your URLs into my db. All works fine. I

Re: Solrindex job failed !

2013-06-17 Thread Lewis John Mcgibbney
Please read the log message. It needs to be a unique field. Correct this in your schema and you should be good to go. hth On Mon, Jun 17, 2013 at 6:35 AM, kamal11 kkroyal@gmail.com wrote: I am also facing the same error. The nutch log says ERROR solr.SolrIndexer - java.io.IOException: Job

Re: Nutch scoring question again

2013-06-16 Thread Lewis John Mcgibbney
Yes Joe this is correct. On Sun, Jun 16, 2013 at 12:03 PM, Joe Zhang smartag...@gmail.com wrote: Thanks. with regards to (2), is this score the boost we see in solr index? On Sun, Jun 16, 2013 at 10:38 AM, Ahme Emre Aladağ emre.ala...@agmlab.comwrote: Note: I'm a newbie. As far as

Re: what is stored in the hbase after inject job

2013-06-14 Thread Lewis John Mcgibbney
Hi, On Thursday, June 13, 2013, RS tinyshr...@163.com wrote: Thanks a lot 1.Is there a document discribe the column symbols (likecolumn=s:s )? There are a lot symbols I can not understand. Please check your gora-hbase-mapping.xml, this is where field names and qualifiers are defined. It

Re: MalformedURLException

2013-06-13 Thread Lewis John Mcgibbney
I would not advise you t0o use the MemStore. So far the purpose of this is not to store persistent data, it is mainly used for testing... as explained in nutch-default.xml. There are other alternatives which will be much more useful for your deployment. On Thu, Jun 13, 2013 at 5:36 AM, Peter

Re: tstamp and date field -- future dates???

2013-06-13 Thread Lewis John Mcgibbney
Hi James, Can you please point us to the patch and we will try to get it in to the codebase? Thanks... and sorry if we didn't look into the patch yet. Lewis On Thu, Jun 13, 2013 at 4:29 AM, James Sullivan james.brian.sulli...@gmail.com wrote: Please ignore this E-mail. It only happens in the

Re: Nutch-Solr 4.3 schema

2013-06-13 Thread Lewis John Mcgibbney
Hi Tony, Please see https://issues.apache.org/jira/browse/NUTCH-1486 Thanks Lewis On Thu, Jun 13, 2013 at 8:04 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi, Any one has a updated Nutch-Solr Schema for new Solr 4.3 ? As the existing NutchconfSolr 4 schema doesn't work with Solr 4.3.

Re: Nutch Compilation Error with Eclipse

2013-06-10 Thread Lewis John Mcgibbney
Hi, It is (IMHO) kind of fruitless running the crawl class (which is deprecated now and we highly suggest you use and amend the /src/bin/crawl script for your usecase) within Eclipse. You will learn far more setting breakpoints within individual classes and watching them execute on that basis. I

Issues on Compiling Nutch 2.x with Eclipse

2013-06-10 Thread Lewis John Mcgibbney
rebuild workspace. BTW: On which packages / classes do you see red dots ? On Sun, Jun 9, 2013 at 9:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, This source has literally just been released. The tutorial on the Nutch wiki has also just been updated

Re: using Tika within Nutch to remove boiler plates?

2013-06-09 Thread Lewis John Mcgibbney
of Nutch? Which config file should I touch? On Sat, Jun 8, 2013 at 10:56 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Joe, https://issues.apache.org/jira/browse/NUTCH-961 On Saturday, June 8, 2013, Joe Zhang smartag...@gmail.com wrote: Can somebody please point me

Re: using Tika within Nutch to remove boiler plates?

2013-06-09 Thread Lewis John Mcgibbney
Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as

Re: [RESULT] WAS: Re: [VOTE] Apache Nutch 2.2 Release Candidate

2013-06-08 Thread Lewis John Mcgibbney
Hi Julien, Dynamite. I will release today. Lewis On Saturday, June 8, 2013, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Lewis, The md5, asc and sha are now correct. Thanks for fixing it. Have a nice week end Julien On 7 June 2013 21:16, Lewis John Mcgibbney lewis.mcgibb

[ANNOUNCE] Apache Nutch 2.2 Released

2013-06-08 Thread lewis john mcgibbney
Good Afternoon Everyone, The Apache Nutch PMC are extremely pleased to announce the immediate release of Apache Nutch v2.2. Apache Nutch is an open source web-search software project. Stemming from Apache Lucene http://lucene.apache.org/java/, it now builds on Apache

Re: using Tika within Nutch to remove boiler plates?

2013-06-08 Thread Lewis John Mcgibbney
Hi Joe, https://issues.apache.org/jira/browse/NUTCH-961 On Saturday, June 8, 2013, Joe Zhang smartag...@gmail.com wrote: Can somebody please point me to some sample code? Thanks much! -- *Lewis*

[RESULT] WAS: Re: [VOTE] Apache Nutch 2.2 Release Candidate

2013-06-07 Thread Lewis John Mcgibbney
) Kiran Chitturi Feng Lu Julien Nioche Sebastian Nagel Lewis John McGibbney [0] +/-0, fine, but consider to fix few issues before... [0] -1, nope, because... (and please explain why) This is an excellent VOTE'ing count so thank you to everyone that took the time to review the release candidate

Re: Nutch 2.1 - parsechecker: small output

2013-06-05 Thread Lewis John Mcgibbney
Additionally we've harmonized the behaviour of 2.x and 1.x so that the next releases will be consistent. We should hopefully be releasing 2.2 very very soon. You can look on the recent archive of this list to find the release candidate artifacts for 2.2 Lewis On Wednesday, June 5, 2013, feng lu

Re: How to add field to index

2013-06-05 Thread Lewis John Mcgibbney
Yes I will defo try and fix this. You could please log a Jira for it and we will get round to it when we can. Thanks for reporting the issue here so far. On Wed, Jun 5, 2013 at 11:04 AM, stone2dbone antoinette.d.stan...@gmail.com wrote: Lewis, I've been told by someone who knows Java (I

Re: How to add field to index

2013-06-04 Thread Lewis John Mcgibbney
Hi, On Tue, Jun 4, 2013 at 11:40 AM, stone2dbone antoinette.d.stan...@gmail.com wrote: Okay, IndexFiltersChecker shows the value of my added field is '[Ljava.lang.String;@15d1c817'. What might be causing this? So you've not added any code only configuration right? It is kinda difficult

Re: How to setup HBase as backend

2013-06-04 Thread Lewis John Mcgibbney
Nutch jobs locally. On Tue, Jun 4, 2013 at 3:25 AM, Yves S. Garret yoursurrogate...@gmail.comwrote: One more question, would it matter what version of Hadoop that I have? On Thu, May 30, 2013 at 6:57 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: In all

Re: crawl job failed error

2013-06-04 Thread Lewis John Mcgibbney
Hi, On Tue, Jun 4, 2013 at 7:42 PM, RS tinyshr...@163.com wrote: InjectorJob: total number of urls injected after normalization and filtering: 0 Nothing is injected here. Please review you URL filters and try again. Lewis

Re: Re: crawl job failed error

2013-06-04 Thread Lewis John Mcgibbney
. [local ]$ cat urls/seed.txt http://nutch.apache.org/ Thanks hechuan At 2013-06-05 11:43:38,Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, On Tue, Jun 4, 2013 at 7:42 PM, RS tinyshr...@163.com wrote: InjectorJob: total number of urls injected after normalization and filtering

Re: How to add field to index

2013-06-04 Thread Lewis John Mcgibbney
Hi, On Tue, Jun 4, 2013 at 3:21 PM, stone2dbone antoinette.d.stan...@gmail.comwrote: I must still add a field to each document? Please clarify. index-static should work as described out of the box. Make sure that you have a comma-separated list of fields in the form name:value within

Nutch not crawling fully

2013-06-04 Thread Lewis John Mcgibbney
Hi, It is clear that for the configuration you are running NTLM is not authenticating properly. I would run the Http class with TRACE logging activated, this will show the credentials you are after. You should also note the documentation in nutch-default.xml which explicitly states NOTE: For

Re: How to add field to index

2013-06-03 Thread Lewis John Mcgibbney
Hi, You can quickly check which fields will be added to your NutchDocument before being passed to Solr using the IndexFiltersChecker tool. This tool is available in both trunk and 2.x codebases and can be invoked from the command line interface. On Monday, June 3, 2013, stone2dbone

[VOTE] Apache Nutch 2.2 Release Candidate

2013-06-02 Thread lewis john mcgibbney
Good Friday Everyone, Glad to get to a stage where we can VOTE on the release of the Apache Nutch 2.2 artifacts. We solved a stack of issues: http://s.apache.org/LPB SVN source tag: http://svn.apache.org/repos/asf/nutch/tags/release-2.2/ Staging repo:

Re: Generator -adddays

2013-05-31 Thread Lewis John Mcgibbney
Seems like a small cli syntax bug. Please submit a patch and we can commit. Thanks Lewis On Friday, May 31, 2013, Bai Shen baishen.li...@gmail.com wrote: Two quick questions. 1. Why is the parameter -adddays and not -addDays? 2. Should it be changed to match the other parameters or is it

Re: Generator -adddays

2013-05-31 Thread Lewis John Mcgibbney
+1 On Fri, May 31, 2013 at 12:13 PM, Markus Jelsma markus.jel...@openindex.iowrote: Please don't break existing scripts and support lower and uppercase. Markus -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 31-May-2013 19:11 To:

Re: Error in resolving some dependencies

2013-05-31 Thread Lewis John Mcgibbney
+1 dynamite Tejas On Fri, May 31, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Kiran, Happy to know :) Have you faced any problems with it ? I am in middle of editing the wiki page and your comments might help me do that. Thanks, Tejas On Fri, May 31, 2013 at 2:55

Re: Altering webpage ?

2013-05-30 Thread Lewis John Mcgibbney
Hi, Just heads up for the event where you do need to add to the nested Metadata structure within the WebPage.avsc, you can merely write your changes and utilise the ant 'generate-gora-src' target from the build script. The GoraCompiler will then compile everything in /src/gora to the path you

Re: How to setup HBase as backend

2013-05-30 Thread Lewis John Mcgibbney
, similar issue: http://bin.cakephp.org/view/180499048 I've left the defaults for config as they were, except this is in gora.properties in apache nutch. gora.datastore.default=org.apache.gora.hbase.store.HBaseStore On Wed, May 29, 2013 at 7:40 PM, Lewis John Mcgibbney

Re: How to setup HBase as backend

2013-05-30 Thread Lewis John Mcgibbney
around some jar files? On Thu, May 30, 2013 at 6:35 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Make sure that everything is compiled and you are running from runtime or with the Jar in hadoop On Thu, May 30, 2013 at 3:00 PM, Yves S. Garret yoursurrogate...@gmail.comwrote

Re: error crawling

2013-05-29 Thread Lewis John Mcgibbney
Hi Chris, Please check out NUTCH-1545 We'll hopefully be committing this today(ish) and it will hopefully be included in the 2.2 RC which I am about to cut. Your feedback would be great. Thanks On Wednesday, May 29, 2013, Christopher Gross cogr...@gmail.com wrote: I did make some modifications

Re: How to setup HBase as backend

2013-05-29 Thread Lewis John Mcgibbney
This is incompatible. On Wed, May 29, 2013 at 1:59 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: Hi all, I'm using HBase 0.94.7 and Nutch 2.1. On Wed, May 29, 2013 at 4:55 PM, Adriana Farina adriana.farin...@gmail.comwrote: Hi Yves, as Tejas said, your issue is almost

Re: How to setup HBase as backend

2013-05-29 Thread Lewis John Mcgibbney
.X and Nutch 2.1 work? On Wed, May 29, 2013 at 5:05 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This is incompatible. On Wed, May 29, 2013 at 1:59 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: Hi all, I'm using HBase 0.94.7 and Nutch 2.1. On Wed, May

Re: Extracting status code from hbase

2013-05-29 Thread Lewis John Mcgibbney
This is most certainly better aimed at either Gora or HBase lists. Obtaining better (and consistent) understanding and of course abstracting users from such data structures is what we have been addressing in current Gora development. (See GORA-174) You will want to look specifically at some of the

Re: Extracting status code from hbase

2013-05-29 Thread Lewis John Mcgibbney
OH, BTW I meant to refer you to the test in line 178 of [0]. testPutNested hth Lewis On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This is most certainly better aimed at either Gora or HBase lists. Obtaining better (and consistent) understanding

Explanation of RegexURLFIlterTestBase benchmark's

2013-05-23 Thread Lewis John Mcgibbney
Hi All, A really nice aspect of the regex (urlfilter-automaton and urfilter-regex) plugin implementation's in Nutch is that there is a small but very useful RegexURLFilterBaseTest [0] which compares benchmarks for simple regex parsing. The results we get are as follows urls automaton

Re: Explanation of RegexURLFIlterTestBase benchmark's

2013-05-23 Thread Lewis John Mcgibbney
, yadda, yadda. On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, A really nice aspect of the regex (urlfilter-automaton and urfilter-regex) plugin implementation's in Nutch is that there is a small but very useful RegexURLFilterBaseTest [0

Re: Explanation of RegexURLFIlterTestBase benchmark's

2013-05-23 Thread Lewis John Mcgibbney
Hi Kirby, On Thu, May 23, 2013 at 6:36 PM, Kirby Bohling kirby.bohl...@gmail.comwrote: Not that I think you need them in particular, but it seems like Nutch could be doing plenty of benchmarking, and micro benchmarking in particular. I agree with this. It is not my goal to attack this head

Re: Nutch 2.1: extension point ParseFilter: doc is null

2013-05-23 Thread Lewis John Mcgibbney
Hi Martin, I am struggling to understand how the DocumentFragment (populated either by private methods parseTagSoup or parseNeko depending on your config in nutch-site.xml) is null! What you don't mention is some problem you are having? I can't DEBUG the code tonight but I am interested to see

Re: Nutch 2.1 - Unauthorized

2013-05-22 Thread Lewis John Mcgibbney
Hi Feng, Where is the patch please? Thank you very much Lewis On Wednesday, May 22, 2013, feng lu amuseme...@gmail.com wrote: Hi Daniel Now Nutch 2.x can not support solr authentication, I have already open an issue and add a patch , you can patch this and try again. Thanks On Wed, May

Re: error crawling

2013-05-20 Thread Lewis John Mcgibbney
Please search the mailing list for the HBase logging. There was a conversation on this reasonably recently. Please see my other response for the rest. hth Lewis On Monday, May 20, 2013, Christopher Gross cogr...@gmail.com wrote: Ok, so the crawlId isn't like the directories used in the 1.x

Re: nutch crawl

2013-05-20 Thread Lewis John Mcgibbney
Hi Chris, Please see the documentation I put up on the wiki for this phenomenon http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29 Also, please search the mailing list for a recent discussion on the

Re: error crawling

2013-05-20 Thread Lewis John Mcgibbney
Hi Chris, On Mon, May 20, 2013 at 10:21 AM, Christopher Gross cogr...@gmail.comwrote: Lewis -- Is the DEBUG something set in the conf/log4j.properties file? I have the rootLogger set to INFO,DRFA and the threshold is ALL. Everything else is INFO or WARN (no DEBUGs to be found.) Well yes

[REQUEST] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-19 Thread Lewis John Mcgibbney
Hi All, I submitted a patch to upgrade the Nutch 2.x Branch codebase to the newly released Gora 0.3. The patch can be found here [0]. It would be excellent if folks could please test this patch and provide feedback to the dev@ list. The feedback will be very helpful in allowing us to progress

Re: Getting error while running nutch in eclips in window environment

2013-05-18 Thread Lewis John Mcgibbney
You need to follow the tutorial here http://wiki.apache.org/nutch/RunNutchInEclipse If after reading this thoroughly you have some problems please let us know about them. Thank you Lewis On Thu, May 16, 2013 at 12:07 PM, harsh yadav harsh.m...@gmail.com wrote: 2013-05-17 00:33:09,376 WARN

Re: Status of Elasticsearch indexer?

2013-05-18 Thread Lewis John Mcgibbney
Hi Chris, Thanks for getting on the list and discussing these aspects of development :0) From my perspective there are a number of observations BRANCH 2.x * NUTCH-1568 [0] is ripe for development. My sole justification for not addressing this is that we wish to push Nutch 2.2 and it is safe to

Re: nutch

2013-05-15 Thread Lewis John Mcgibbney
Hi Shobha, The is merely a class loading problem. You need to ensure that the class is available on your classpath. Although the problem you are having has nothing to do with this, my advice is to not use Nutch 1.2. Best LEwis On Wed, May 15, 2013 at 5:01 AM, Shobha shobhaendig...@gmail.com

Re: Solrindex -all not working correctly

2013-05-15 Thread Lewis John Mcgibbney
solrindex -all it still indexes everything, not just the newly parsed items. On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: What version are you using? If you can I would advise you to upgrade to 2.x HEAD. On Wed, May 1, 2013 at 4:32 AM, Bai

Re: NUTCH1.2 ,the specific format of the dump text file?

2013-05-13 Thread Lewis John Mcgibbney
there is a good change that the dump is in a foreign language, however this depends on which language you consider as foreign and what language it actually is. AFAIK the encoding should be inferred directly from the page or document markup, however failing this there is a default fall back of

Re: Passing content type last modified from nutch to solr

2013-05-08 Thread Lewis John Mcgibbney
Hi Kneerosh, Golden rule of posting. Which Nutch and which Solr versions are you using? add index-more to your plugin configuration and it will get you two out of three. author... if the marup is there is it trivial. Lewis On Tuesday, May 7, 2013, kneerosh roshni_rajago...@yahoo.co.in wrote:

Re: What's the current status of upgrading nutch 1.* trunk to solr 4?

2013-05-02 Thread Lewis John Mcgibbney
Hi, Have you looked at the patch for NUTCH-1486? this is not just schema changes. The patch is for 2.x but the process of porting it to new pluggable indexing architecture for trunk is trivial. Can you do this? Currently the patch uses concurrentupdatesolrserver where it should use just solr

<    1   2   3   4   5   6   7   8   9   10   >