Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-03 Thread Lewis John Mcgibbney
Hi Paddy, Some comments in addition to my response. You should try upgrading to Nutch 1.10 when we release very shortly. There has been so much work done since 1.8 that you can benefit from. Keep your ears peeled here for a release candidate and then eventual release. Please see response below.

Re: Work remaining for the next 2.x release?

2015-08-28 Thread Lewis John Mcgibbney
Hi David, On Wed, Aug 26, 2015 at 5:05 AM, user-digest-h...@nutch.apache.org wrote: Is there any general feeling towards how close the next 2.x release is? Yes. I feel strongly about getting one out there as soon as we have stabilized the remaining issues

Re: Test fail

2015-08-25 Thread Lewis John Mcgibbney
understand it. 2015-08-25 8:02 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com');: Hi Cihad, Which version of Nutch 2.X are you working with when you get these errors? On Sat, Aug 15, 2015 at 11:04 AM, user-digest-h

Re: tika problem

2015-08-24 Thread Lewis John Mcgibbney
Hi Alp, On Tue, Aug 11, 2015 at 10:03 PM, user-digest-h...@nutch.apache.org wrote: While trying to use 2.4.x for Tika 1.8 (to use tesseract for ocr, actually), tika could not parse application/pdf files. The mapping is correct, in the plugin-xml, * are routed to tika, and the log states that

Re: Test fail

2015-08-24 Thread Lewis John Mcgibbney
Hi Cihad, Which version of Nutch 2.X are you working with when you get these errors? On Sat, Aug 15, 2015 at 11:04 AM, user-digest-h...@nutch.apache.org wrote: I run TestInjector. But there are an exeption as follow: java.util.NoSuchElementException at

Re: Indexing crawlId

2015-08-24 Thread Lewis John Mcgibbney
Hi Amir, On Fri, Jul 17, 2015 at 3:08 AM, user-digest-h...@nutch.apache.org wrote: I'm trying to find a way to know, for every document, which crawl job issued it. I thought of indexing the crawlId as part of the indexed data, and I thought of using the index-metadata plugin with index.db

Re: Problem with crawling for nutch 2.3

2015-08-24 Thread Lewis John Mcgibbney
Hi Divjot, Please see reply below On Wed, Jul 15, 2015 at 1:13 AM, user-digest-h...@nutch.apache.org wrote: I have compiled nutch 2.3 code with gora 0.6 and using cloudera Hbase as backend database. The code compiles fine and I am able to run it using the bin/crawl command. The problem is

Re: Elastic search date, tstamp etc.

2015-08-24 Thread Lewis John Mcgibbney
Hi Alp, On Tue, Aug 11, 2015 at 10:03 PM, user-digest-h...@nutch.apache.org wrote: Hello, [snip] 1. nutch 2.3 sets the timestamp to a month later. date is 1970 Tried to use index-more, but still lastmodified date is null. Investigating the elasticsearch map, date, tstamp fields are set

Re: 2.3.1 and version control

2015-08-24 Thread Lewis John Mcgibbney
Hi Alp, On Tue, Jul 21, 2015 at 10:20 PM, user-digest-h...@nutch.apache.org wrote: I would like to use Tesseract OCR within nutch, in order to parse scanned pdf files (assuming this is the correct (and only?) way of doing that). Skimming through the previous emails, I noticed the support is

Re: KeeperErrorCode = ConnectionLoss for /hbase/master

2015-08-18 Thread Lewis John Mcgibbney
Hi Lê Văn Thiệp, On Fri, Jul 10, 2015 at 4:43 AM, user-digest-h...@nutch.apache.org wrote: Subject: Re: KeeperErrorCode = ConnectionLoss for /hbase/master Hi Lewis John Mcgibbney I am using Nutch 2.x, Gora 0.5, and HBase 0.9.4.x Thanks for your help! Did you ever get this sorted out

Re: Nutch 2.3 : Backend datastorage problem

2015-08-18 Thread Lewis John Mcgibbney
Hi Alexandre, Apologies for the hellishly long time before I've picked up this message! Current status of 2.X branch is that it in need of some attention and major upgrades to Key dependencies. This is inherited through the dependency upon Apache Gora, as we need to release Apache Gora 0.6.1

Re: Nutch tests from Maven

2015-07-29 Thread Lewis John Mcgibbney
Hi Markus, On Tue, Jul 28, 2015 at 5:54 AM, user-digest-h...@nutch.apache.org wrote: Hello - Nutch does not ship unit tests anymore as Maven artifacts, hence we cannot use CrawlDBTestUtil in external projects. Should we ship them? Or just copy the utils? What do you think? Markus I would

Re: Help regarding installation of nutch-gui

2015-07-09 Thread Lewis John Mcgibbney
Hi Aditya, The code and documentation you've referenced below is ancient. If you want to use Nutch with a GUI, you need to use Nutch 2.X [0]. You need to investigate both the nutchserver [1] and webapp [2]. hth Lewis [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ [1]

Re: KeeperErrorCode = ConnectionLoss for /hbase/master

2015-07-09 Thread Lewis John Mcgibbney
Hi ThiepLV, Which version of Nutch,, Gora Hadoop and HBase are you on? On Wed, Jul 8, 2015 at 4:23 AM, user-digest-h...@nutch.apache.org wrote: I am running InjectorJob by tutorial https://wiki.apache.org/nutch/RunNutchInEclipse, But i receive as follows: 2015-07-07 22:33:38,269 ERROR

True Value of fetchQueues.totalSize

2015-06-23 Thread Lewis John Mcgibbney
Hi Folks, It is very common for us to see logging such as the following fetching http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forumsort=ascorder=Topic -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414 What I've noticed for some time is that fetchQueues.totalSize never seems to

Re: Four+ questions Nutch, Solr, and Accumulo

2015-06-22 Thread Lewis John Mcgibbney
Hi Geoffry, On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote: I started with Nutch yesterday and have come up with four+ questions if answered will help me on my way. 1. Is it correct Nutch 2.3 does not work with Solr 5.2.1? There seems to be a dependency

Re: Running Nutch using from Dynamic Web Project

2015-06-22 Thread Lewis John Mcgibbney
Hi Alex, On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote: Is there any recommended or better way of running Nutch 1.x jobs from Dynamic Web Project. You mean using the Java API's we provide on Maven Central?

Re: Nutch 2.3 server job status listener?

2015-06-22 Thread Lewis John Mcgibbney
Hi Jessica, On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote: I'm writing a Java application that uses the Nutch REST API to execute the crawl cycle. I need to be able to call the next job only when the previous job is finished. Right now, the only way I know to

Re: Nutch crawls not appearing in Kibana

2015-06-22 Thread Lewis John Mcgibbney
Hi Jessica and Brooks, On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote: [snip] Notice the 'prevFetchTime' field has been updated to show the next date when this URL should be crawled (30 days from now - July 19). I assume this is exactly what SHOULD

Re: Nutch crawls not appearing in Kibana

2015-06-22 Thread Lewis John Mcgibbney
Actually please just see https://issues.apache.org/jira/browse/NUTCH-2045 If you guys could test it would be great. lewis On Mon, Jun 22, 2015 at 11:21 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Jessica and Brooks, On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h

Re: Nutch crawls not appearing in Kibana

2015-06-17 Thread Lewis John Mcgibbney
Hi Brooks On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote: Hi all, [snip] First things first, can you veryify your elastic search settings in nutch-site.xml? e.g. https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L1240-L1291 Please make sure that

Re: 2.3 REST API and batchId

2015-06-17 Thread Lewis John Mcgibbney
Hi Jessica On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote: I'm having trouble understanding the concept of a batch and which elements of the crawl cycle require a batchId. A patch ID is essentially the same as a segment is in Nutch 1.X branch. It defines a type of

Re: Nutch 2.3 with HDFS as storage

2015-06-17 Thread Lewis John Mcgibbney
Hi Ankit, On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote: Hi All, Is it possible to store data into HDFS directly without using hbase while crawling with Apache Nutch 2.3 Yes it is, please see the AvroStore and DataFileAvroStore Gora implementations for writing

Re: REST API for crawling

2015-06-15 Thread Lewis John Mcgibbney
Hi Jessica, On Fri, Jun 12, 2015 at 7:10 AM, user-digest-h...@nutch.apache.org wrote: Hello. I am trying to test out the 2.3 REST API using curl, but I'm having trouble with the commands. [snip] Did you get this issue sorted out? Are there any more problems? The issue with casting to Long

Re: Can Nutch crawling shortened url?

2015-06-13 Thread Lewis John Mcgibbney
Hi Ankit, On Mon, Jun 8, 2015 at 2:13 AM, user-digest-h...@nutch.apache.org wrote: I tried it with 1.10, but the shortened urls still dont get followed through. Have you tried changing logging level to TRACE within conf/log4j.properties? This may provide more detail for you. I think

Re: Deduplication -- custom Signature

2015-06-02 Thread Lewis John Mcgibbney
Hi Breno, On Tue, Jun 2, 2015 at 1:38 AM, user-digest-h...@nutch.apache.org wrote: We are indexing several domains for a specific project, which may contain duplicated content (e.g. pdf files). The users of the system come from different organisations and wonder why the content is not

Nutch errors on VirtualBox shared folders

2015-06-02 Thread Lewis John Mcgibbney
Hi Folks, I wanted to post to this list some observations and findings we've experienced regarding the above topic and how Nutch is behaving. [0] Essentially, this comes down to the following By default, Vagrant maps the 'source' directory on the host machine to /vagrant on the client. This is

Re: Deduplication -- custom Signature

2015-05-31 Thread Lewis John Mcgibbney
Hi Breno, On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote: I've implemented a custom domain aware Signature to be used in the deduplication phase. Nice! Out of curiosity can you share what your use case is? I would be really interested to hear more as I am

Re: about language extraction for zip documents

2015-05-31 Thread Lewis John Mcgibbney
Hi, On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote: Hi comunity. Im using nutch 1.9 and solr 4.10. I use nutch for parse zip documents, but the field language is empty in solr for all of this documents and this is a problem for me. ParseZip plugin use tika to

Re: Nutch 2.X vs. 1.X

2015-05-31 Thread Lewis John Mcgibbney
Hi Chaushu, On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote: I'm using Nutch 1.9 with Solr 4.10 I wanted to ask what are the advantages of Nutch 2 vs. Nutch 1 and if I use Solr, there is a reason why should I use Nutch 2. Nutch 1.X branch is the more maintained of

Re: ClassPathException sending topN argument for /job/create using Nutch 2.x RESTApi

2015-05-21 Thread Lewis John Mcgibbney
Hi Alex, On Wed, May 20, 2015 at 1:03 AM, user-digest-h...@nutch.apache.org wrote: ‎Hi Lewis, I am using Nutch 2.3 Grand. Thank you for the context. The patch is available at https://issues.apache.org/jira/browse/NUTCH-2019 If you could test against Nutch 2.X HEAD it would be ideal. Lewis

Re: Solr as backend in Nutch 2.3? Which Hbase in 2.3

2015-05-20 Thread Lewis John Mcgibbney
Hi Ralf, On Wed, May 20, 2015 at 1:03 AM, user-digest-h...@nutch.apache.org wrote: So by simply changing the Gora backend it should work? Thank you! I'll try it out soon. Yes. Exactly. If you have problems then get us here. We will be making a releaee of Gora very soon and have fixed a

Re: Nutch-1741 in GSOC 2015

2015-05-18 Thread Lewis John Mcgibbney
and my application is accepted. The main reason why I have choosen the Nutch Project for GSOC is knowing the Nutch closely. My subject is Nutch-1741 - Support of Sitemaps in Nutch 2.x[1] . Thanks Lewis John McGibbney and Talat Uyarer for being my mentors on this process. I hope I can contribute

Re: Nutch-1741 in GSOC 2015

2015-05-18 Thread Lewis John Mcgibbney
[3] https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf Kind Regards 2015-05-19 1:16 GMT+03:00 Cihad Guzel cguz...@gmail.com: Ok Lewis, I signed up to wiki, my wiki username: cihadguzel Thanks 2015-05-18 23:44 GMT+03:00 Lewis John Mcgibbney

Re: Nutch 2.3 and elasticsearch

2015-05-14 Thread Lewis John Mcgibbney
Hi Saurabh On Wed, May 13, 2015 at 7:38 PM, user-digest-h...@nutch.apache.org wrote: But when I run runtime/local/bin/nutch index -all It I get: SolrIndexerJob: java.lang.RuntimeException: job failed: name=Indexer, jobid=job_local830597808_0001 at

Re: GSoC 2015

2015-05-14 Thread Lewis John Mcgibbney
Hi Halil, On Wed, May 13, 2015 at 7:38 PM, user-digest-h...@nutch.apache.org wrote: I had applied the GSoC this year for Nutch Project. Recently I got an email that my application is accepted. My subject is Giving HTML5 support for Apache Nutch 2.x. Lewis John McGibbney and Talat Uyarer

Re: ClassPathException sending topN argument for /job/create using Nutch 2.x RESTApi

2015-05-12 Thread Lewis John Mcgibbney
Hi Alex, Which version of Nutch 2.x are you using? Yes I think this is a bug and a patch would be great. Thanks Lewis On Sat, May 9, 2015 at 4:31 PM, user-digest-h...@nutch.apache.org wrote: Hi Lewis, Thanks for replying, I will try and open a ticket after I'm sure its a Nutch bug and

Re: Where is index-static plugin in nutch 2.x?

2015-05-12 Thread Lewis John Mcgibbney
Hi Luigi, On Mon, May 11, 2015 at 5:53 PM, user-digest-h...@nutch.apache.org wrote: Hi Luigi, Which type static file do you talk ? In 2.x every files store in data store. IndexingJob can index in data store rows The index-static plugin has not been ported to 2.X. If you would like to do

Re: ClassPathException sending topN argument for /job/create using Nutch 2.x RESTApi

2015-05-07 Thread Lewis John Mcgibbney
Hi Alex, On Thu, May 7, 2015 at 11:44 AM, user-digest-h...@nutch.apache.org wrote: Hi All, I'm getting java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long when sending topN argument for /job/create using Nutch 2.x RESTApi. Does any1 knows how to fix that?

[ANNOUNCEMENT] Apache Nutch 1.10 Release

2015-05-07 Thread Lewis John Mcgibbney
Hi Everyone, The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.10, we advise all current users and developers of the 1.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration,

[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.10

2015-05-06 Thread Lewis John Mcgibbney
Hi Folks, I would like to close off the VOTE'ing for the Nutch 1.10 release candidate as below. The VOTE'ing resulted in the following [4] +1 Push the release, I am happy :) Lewis John McGibbney Sebastian Nagel Jorge Luis Betancourt González Chris Mattmann [1] +0 I am not bothered either way

Re: [RESULT] WAS Re: [VOTE] Release Apache Nutch 1.10

2015-05-06 Thread Lewis John Mcgibbney
Hi Folks, The results should have been [4] +1 Push the release, I am happy :) Lewis John McGibbney * Sebastian Nagel * Jorge Luis Betancourt González * Chris Mattmann * Julien Nioche * Asitang Mishra [1] +0 I am not bothered either way John Lafitte [0] -1 I am not happy with this release

Reverse Geocoding with Nutch 1.10

2015-04-30 Thread Lewis John Mcgibbney
hi user@ dev@, Check out some of the services we've implemented in Nutch 1.x. The blog post introduces How we can use Maxminds GeoIP services to implement reverse geocoding for server IP addresses. Enjoy http://blog.maxmind.com Lewis -- *Lewis*

Re: NUTCH REST API for distributed mode

2015-04-29 Thread Lewis John Mcgibbney
Hi Dzmitry, On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote: I found out that there is a REST API for 2.x branch. However it works only for local hadoop mode. How did you verify this? Is there any way to work with REST API in hadoop distributed mode? I suppose

Re: [ANNOUNCE] New Nutch committer and PMC - Guiseppe Totaro

2015-04-29 Thread Lewis John Mcgibbney
Dynamite Giuseppe! On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote:

Re: Nutch 2.3.1 HBASE Invalid Field Values

2015-04-29 Thread Lewis John Mcgibbney
Hi Arthur, On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote: My Nutch is 2.3 with Gora and Hbase, below are the sample field values I have scanned from HBase here: [snip] Q: Is there a way to configure Nutch/Gora/HBase so it will store the value like following

[VOTE] Release Apache Nutch 1.10

2015-04-29 Thread Lewis John Mcgibbney
Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.10. The release candidate comprises the following components.* A staging repository [0] containing various Maven artifacts* A branch-1.10 of the trunk code [1]* The tagged source upon which we are VOTE'ing [2]* Finally, the release

Re: Possible Mismatch Variable Name in nutch-default.xml

2015-04-23 Thread Lewis John Mcgibbney
Hi Jeff, On Thu, Apr 23, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote: I am going through the nutch-default.xml file to learn and understand where and how each of the config values are utilized. The subcollection property in nutch-default.xml is: property

Re: webpage.p table is empty

2015-04-13 Thread Lewis John Mcgibbney
Hi Okello, On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote: I've a setup of Nutch 2.3 with Cassandra 2.0.2. Everything seems to be working fine save for one little issue. I'm using the crawl script. The 'p' table is empty even though in the logs I can see there's no

Re: Nutch and (Postgre|My)SQL

2015-04-13 Thread Lewis John Mcgibbney
Hi Andrzej, On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote: I can't find any information how to do the correct setup with any SQL database. Does someone have any idea what I'm doing wrong? Is the setup using SQL database actually possible? It's safe to say that

Re: bin/nutc webgraph in 2.x

2015-04-13 Thread Lewis John Mcgibbney
Hi Melih, On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote: Based on https://wiki.apache.org/nutch/CommandLineOptions, bin/nutc webgraph, not available for Nutch 2.x, i would like to use this feature, how could i achieve this in nutch 2.3 ? Thanks? You would need

[DEADLINE] Google Summer of Code Deadline Approaching Soon

2015-03-25 Thread Lewis John Mcgibbney
Hi All, The deadline for this years GSoC student submissions is approaching fast and I would be very keen to see more proposals from the communities above. I've been involved on and off with several students from across all of the above communtiies hence the reason I am emailing these lists. I

Re: HTTP Post Authentication

2015-03-12 Thread Lewis John Mcgibbney
Hi Tizy, On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org wrote: Is there any detailed step by step explanation on how to implement HTTPPostAuthentication on Nutch 1.10.? https://github.com/apache/nutch/blob/trunk/conf/httpclient-auth.xml.template#L61-L105

Re: Nutch 2.3 Build Error, Please help

2015-03-12 Thread Lewis John Mcgibbney
Hi Arthur, On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org wrote: I downloaded http://svn.apache.org/repos/asf/nutch/branches/2.x/ re-run the compilation, still got the the error Question: Are the following dependencies are correctly set in my ivy.xml? dependency

Re: need a little bit apache nutch ..

2015-03-05 Thread Lewis John Mcgibbney
Hopefully this makes better sense. Lewis On Thursday, March 5, 2015, Gaplan gap...@gmail.com wrote: thans for answer Lewis. i can't understand this. Also please ensure that your urlfilter permits '?' In URLS entries how can i do that ? On Thu, Mar 5, 2015 at 10:17 PM, Lewis John Mcgibbney

need a little bit apache nutch ..

2015-03-05 Thread Lewis John Mcgibbney
Hi, Please see http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F Also please ensure that your urlfilter permits '?' In URLS entries Hth Lewis On Thursday, March 5, 2015, Gaplan

Re: Nutch 2 with Cassandra as a storage is not crawling data properly

2015-03-03 Thread Lewis John Mcgibbney
Hi Folks Sumant, On Sun, Mar 1, 2015 at 1:14 PM, user-digest-h...@nutch.apache.org wrote: Do you think its the issue of fetch job and parser job ? It is a bug with gora-cassandra which I've logged at the issue below and I am working on a fix right now.

Re: getting Not implemented by the DistributedFileSystem FileSystem implementation

2015-03-03 Thread Lewis John Mcgibbney
Hi yeshwanth, On Tue, Mar 3, 2015 at 1:48 PM, user-digest-h...@nutch.apache.org wrote: any pointers on how to resolve this issue. Yes, please see NUTCH-1946, I just uploaded another patch which is working for me. I am working my way through the =Cassandra bug which is a real PITA. Thanks

Can anyone fetch this page?

2015-02-27 Thread Lewis John Mcgibbney
Hi Folks, I was getting 500 internal server error using Nutch trunk when attempting to fetch content from this domain. http://www.nature.com Just for detail, Nature.com is a catalogue of journals and science resources, including the journal *Nature*. Publishes science news and articles across a

Re: questions about the webui packages

2015-02-25 Thread Lewis John Mcgibbney
Hi lujinhong, On Wed, Feb 25, 2015 at 3:06 PM, user-digest-h...@nutch.apache.org wrote: I found some codes in package “org.apache.nutch.webui” in the nutch source. What are these codes for? They are using the Web Administration UI powered by the Nutch 2.X REST API which is

Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

2015-02-25 Thread Lewis John Mcgibbney
Hi Jonathan, There are another two threads ongoing, namely http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html and http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html Please monitor those links and we can take it from there. I would strongly suggest that you set

Re: Nutch 2 with Cassandra as a storage is not crawling data properly

2015-02-25 Thread Lewis John Mcgibbney
Hi sumant, I've pasted your Hadoop counters below. It would appear that for the ParseJob task, no record is being passed as the input to the MR framework. This is the issue. There is a problem between FetcherJob and ParserJob. Can you readdb between fetching and parsing? If you get out a record

Re: Nutch 2 with Cassandra as a storage is not crawling data properly

2015-02-24 Thread Lewis John Mcgibbney
Hi Sumant, Please see my replies below On Mon, Feb 23, 2015 at 10:11 PM, user-digest-h...@nutch.apache.org wrote: I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb

Re: Nutch 2.3 with Cassandra, not crawling beyond initial seed link.

2015-02-24 Thread Lewis John Mcgibbney
Hi Chris, Please see responses inline On Mon, Feb 23, 2015 at 10:11 PM, user-digest-h...@nutch.apache.org wrote: I am using Nutch with Cassandra to perform web crawling, both for the first time. I have yet been able to retrieve links beyond the first initial seed link. I am using

Re: Nutch 2.3 Build Error, Please help

2015-02-22 Thread Lewis John Mcgibbney
Hi Arthur, This is due to restlet removing some of their dependencies from public consumption I think! It is out of our hands and happened after we released the 2.3 release. Without knowing which backend you are trying to use, I would suggest that you upgrade to the 2.3.1 branch which is the live

Re: about indexing to multiple solr servers

2015-02-22 Thread Lewis John Mcgibbney
Hi On Fri, Feb 20, 2015 at 1:04 PM, user-digest-h...@nutch.apache.org wrote: Thanks Lewis for your answer. I have readed the post and is is great that NUTCH-1480 was assigned to markus. I agree with you that maybe it will be done in nutch 1.10 trunk, however not problem if is for 1.11. I

[ANNOUNCE] Apache Gora 0.6 Released

2015-02-19 Thread Lewis John Mcgibbney
Hi Folks, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.6. This release addresses a modest 47 issues http://s.apache.org/gora-0.6 with some being major improvements, new functionality and dependency upgrades. Most notably the release involves key

Re: about indexing to multiple solr servers

2015-02-18 Thread Lewis John Mcgibbney
Hi Eyeris, On Wed, Feb 18, 2015 at 12:10 PM, user-digest-h...@nutch.apache.org wrote: I have a question and sorry if it is a trivial things. Is there any way to index in multiple solr server (at least 2) using nutch 1.9 ? I have configured solr with one master and 2 slaves, but i need 2

Re: Need to crawl the site that requires flash to be enabled

2015-02-08 Thread Lewis John Mcgibbney
Hi Kartik and Alexis, On Fri, Feb 6, 2015 at 5:19 AM, user-digest-h...@nutch.apache.org wrote: The site you're trying to crawl is a Flash website. Unfortunatly that will be a problem for Nutch. Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or JS that are included

[INVITATION] Apache Nutch Google Summer of Code 2015

2015-02-05 Thread Lewis John Mcgibbney
Hi Folks, The Nutch team are currently on the lookout for interested students willing to engage in this years Google Summer of Code Program [0]. What is GSoC? A global program that offers students stipends to write code for open source projects. In 2014 the Apache Nutch project participated in a

Re: Compiling Nutch 2.3 for Mongo (or Solr)

2015-02-04 Thread Lewis John Mcgibbney
Hi Alexis, On Wed, Feb 4, 2015 at 5:14 AM, user-digest-h...@nutch.apache.org wrote: I've had some luck compiling for Mongo, but I get a NullPointerException while injecting seeds. What version of MongoDb are you using? Supported version is 2.12.2, this is suited to recent Nutch 2.3. It

Re: Nutch IRI URIs

2015-01-28 Thread Lewis John Mcgibbney
Hi Talat, On Wed, Jan 28, 2015 at 10:07 PM, user-digest-h...@nutch.apache.org wrote: Subject: Nutch IRI URIs Hi all, Do you have any idea How can Nutch handle IRI URIs ? My experience using IRI's is limited to the legal informatics domain where they are used pretty extensively in legal

Re: nutch2.3 with hbase 0.90.4

2015-01-28 Thread Lewis John Mcgibbney
Hi Zein, Please see the release announcement regarding versioning for backend datastore support http://nutch.apache.org/#22-january-2015-nutch-23-release On Wed, Jan 28, 2015 at 10:07 PM, user-digest-h...@nutch.apache.org wrote: I am trying to configure nutch2.3 with hbase 0.90.4 on ubuntu

Re: Problems with web sites using HTTPS in Nutch 1.9

2015-01-26 Thread Lewis John Mcgibbney
Hey Yoniel, On Thu, Jan 22, 2015 at 9:00 AM, user-digest-h...@nutch.apache.org wrote: Lewis, I have reviewed the httpclient-configuration but the main problem is that I can't crawl HTTPS site that uses self signed certificate. How I can fix this problem? Did you see this thread and

ProtocolStatus 16 'Exception' for particular domain

2015-01-26 Thread Lewis John Mcgibbney
Hi Folks, I'm working on obtaining forum data posted for various topics from across a number of web sites. An example would be the technolgy-related posts from http://www.hackforums.net. If I take the above site as an example, and attampt to use parsechecker, I get the following with protocol-http

Re: How to collect links with Apache Nutch 2.3

2015-01-24 Thread Lewis John Mcgibbney
Hi Adamantios, On Sat, Jan 24, 2015 at 2:05 PM, user-digest-h...@nutch.apache.org wrote: How to tell Apache Nutch 2.3 to go through all http://URL/?pg={X} pages, with {X} going from 1 to 348, ^(0?[1-9]|[1-4][0-9]|348)$ Please try the above substituting you variable with the proposed regex.

Re: conf files

2015-01-24 Thread Lewis John Mcgibbney
Hi Hesham, On Sat, Jan 24, 2015 at 2:05 PM, user-digest-h...@nutch.apache.org wrote: s in the conf directory all the configuration files for Nutch? Yes Also, if I want to to have a set of configurations for some URLs and another set of configurations for other URLs I have to create a new

[ANNOUNCE] Apache Nutch 2.3 Release

2015-01-23 Thread Lewis John Mcgibbney
Hi Folks, Apache Nutch PMC are very please to announce the release of Apache Nutch v2.3. This release bears the fruits of the first Nutch Google Summer of Code program engagement resulting in a Web Application for the Nutch 2.3 REST API. The release also includes upgrades to Gora dependencies

[RESULT] WAS Re: [VOTE] Release Apache Nutch 2.3

2015-01-22 Thread Lewis John Mcgibbney
Hi Everyone, I am closing off this VOTE thread. The VOTE'ing progressed with the following outcome [4] +1 Push the release, I am happy :) Lewis John McGibbney * Renato Marroquín Mogrovejo Sebastian Nagel * Talat Uyarer * [1] +0 I am not bothered either way John Lafitte [0] -1 I am not happy

Re: [VOTE] Release Apache Nutch 2.3

2015-01-21 Thread Lewis John Mcgibbney
Hi Talat, On Sun, Jan 18, 2015 at 3:49 AM, user-digest-h...@nutch.apache.org wrote: I finish my review yet. - AdaptiveFetchSchedular do not work. In default settings float, it needs integer. Please log an issue and set a fix version, this is trivial to fix but a big which is essential to

Re: Nutch 2.3

2015-01-21 Thread Lewis John Mcgibbney
Hi Shadi, On Thu, Jan 15, 2015 at 6:30 AM, user-digest-h...@nutch.apache.org wrote: Thanks, how can I add avro support? In short you cannot. The next task, if you want to add SQL support back into Gora and subsequently Nutch 2.X, is to write the SQL backend for Gora as suggested here

Re: Problems with web sites using HTTPS in Nutch 1.9

2015-01-20 Thread Lewis John Mcgibbney
Hi Yoniel, Please read the following https://wiki.apache.org/nutch/HttpAuthenticationSchemes#Need_Help.3F If nothing here provides you with a better idea then please write back to us here. Put simply we need more information regarding how your httpclient-configuration has been set up. Thanks

Re: [VOTE] Release Apache Nutch 2.3

2015-01-15 Thread Lewis John Mcgibbney
. Lewis On Fri, Jan 9, 2015 at 3:58 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues as per the release report http://s.apache.org/nutch_2.3 The release candidate comprises

Re: gora-sql module

2015-01-11 Thread Lewis John Mcgibbney
Hi Tamer, On Fri, Jan 9, 2015 at 6:38 PM, user-digest-h...@nutch.apache.org wrote: Guys I came across the thread from Feb 2013 at http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/%3CCAPp-OAu2HT82H8hHZ4B=zxch2+29ncjbfv+wagfp3wdpzex...@mail.gmail.com%3E As I'm trying to use

Differences between parse-html and parse-tika for generation of parse metadata

2015-01-10 Thread Lewis John Mcgibbney
Hi Folks, Is the aim to have identical output from parse-tika and parse-html for rendering of parse metadata? With Nutch 1.10-SNAPSHOT with no local source code modifications, if we take the following page [0], and turn metatags.names to wildcard *, with parse-tika I get Parse Metadata:

Re: Differences between parse-html and parse-tika for generation of parse metadata

2015-01-10 Thread Lewis John Mcgibbney
BOOM https://issues.apache.org/jira/browse/NUTCH-1815 On Sat, Jan 10, 2015 at 10:15 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Is the aim to have identical output from parse-tika and parse-html for rendering of parse metadata? With Nutch 1.10-SNAPSHOT with no local

nutchpy

2015-01-09 Thread Lewis John Mcgibbney
Hi Folks, Just wanted to make folk aware of some work Continuum Analytics have been doing on bringing Nutch to the Python community. https://github.com/ContinuumIO/nutchpy Comtinuum are the folks behind most of the scientific Python stuff you've ever used. If you've used Python before, then

[VOTE] Release Apache Nutch 2.3

2015-01-09 Thread Lewis John Mcgibbney
Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues as per the release report http://s.apache.org/nutch_2.3 The release candidate comprises the following components. * A staging repository [0] containing various Maven artifacts * A

Re: Problems with DomainStatistics

2015-01-07 Thread Lewis John Mcgibbney
Hi Markus, On Wed, Jan 7, 2015 at 7:42 PM, user-digest-h...@nutch.apache.org wrote: Hi - it is a strange piece indeed. You cannot just tell it where the crawldb is, you need to tell it where the directory is, so specifying current is ok, but not part-* Thanks very much. I'll cook a patch up

Re: Potential Bug in 2.X HostDbUpdateReducer

2015-01-07 Thread Lewis John Mcgibbney
Hi Renato, On Thu, Dec 11, 2014 at 5:52 AM, user-digest-h...@nutch.apache.org wrote: From quickly checking out the code (Host.java + HostDB + HostDBUpdateReducer) it would seems like there is a bug exactly where you pointed. LOGGED!!! https://issues.apache.org/jira/browse/NUTCH-1907 WOW

Re: Help regarding headings plugin

2015-01-07 Thread Lewis John Mcgibbney
Hi Krishna, On Thu, Dec 11, 2014 at 5:52 AM, user-digest-h...@nutch.apache.org wrote: When I dump data from segments, I am getting entire html data. Shouldnot it be just headings read from crawling. Why am I getting entire data? Please help me. Thanks in advance. No this is

Problems with DomainStatistics

2015-01-07 Thread Lewis John Mcgibbney
Hi Folks, Does anyone else have problems with the DomainStaticstics [0] tool? I use it as follows ./bin/nutch domainstats /usr/local/.../crawldb/old/part-0/ output tld Although it is generated, nothing is written to the output directory ./bin/nutch domainstats

Potential Bug in 2.X HostDbUpdateReducer

2014-12-08 Thread Lewis John Mcgibbney
Hi Folks, I was looking into the code within Nutch 2.X HostDbUpdateReducer and 'think' I've discovered a bug in the way we output Host data. https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java#L87 I feel that the follwoing code

Re: org.apache.solr.common.SolrException, unknown field 'host'

2014-12-04 Thread Lewis John Mcgibbney
Hey Arthur, On Thu, Dec 4, 2014 at 4:27 AM, user-digest-h...@nutch.apache.org wrote: Any idea why the field ‘host’ is not loaded by SOLR? Apologies for missing this thread. Have you tried restarting your Solr core? I would suggest that you move your old log(s) to an archive directory then

Re: org.apache.solr.common.SolrException, unknown field 'host'

2014-12-04 Thread Lewis John Mcgibbney
Hi Arthur, Additionally, I would suggest that you try both the parse checker and index checker tools on the offending URL http://nutch.apache.org/apidocs/apidocs-1.1/allclasses-frame.html] unknown field 'host' On Thu, Dec 4, 2014 at 4:27 AM, user-digest-h...@nutch.apache.org wrote: user

Re: Nutch 2.X question

2014-11-06 Thread Lewis John Mcgibbney
Copying in user@ On Thu, Nov 6, 2014 at 6:37 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi amit, On Thu, Nov 6, 2014 at 1:54 PM, dev-digest-h...@nutch.apache.org wrote: I have a small question about Nutch 2.X source code, i hope this is the right mailing list for that. i

Re: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

2014-10-30 Thread Lewis John Mcgibbney
Hi Segar, On Wed, Oct 29, 2014 at 11:40 PM, user-digest-h...@nutch.apache.org wrote: Follow the following steps: 1. Execute 'ant job' i.e. open build.xml and execute 'runtime(default)' target. It will generate 'runtime' folder in project. 2. Open nutch-default.xml and update

Re: SOLR + Nutch save the seeds in Solr

2014-10-20 Thread Lewis John Mcgibbney
Hi Pablo, This question has been raised a number of times of the user@nutch list, you can use the archives linked to from the Nutch website. I would suggest that the seed be populated to a new page metadata, which could then be added via an indexing filter. There may be other ways for achieving

Re: Integrating Nutch search functionality into a Java application

2014-10-17 Thread Lewis John Mcgibbney
Hi ozzy19 On Fri, Oct 17, 2014 at 11:09 AM, user-digest-h...@nutch.apache.org wrote: Running the code on this url: http://wiki.apache.org/nutch/JavaDemoApplication I get the following message: Found 0 hits. how so? why do not you search pages that contain the keyword? This page is very

[ANNOUNCEMENT] crawler-commons 0.5 is released

2014-10-15 Thread Lewis John Mcgibbney
15th October 2014 - crawler-commons 0.5 is released We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6 http://tika.apache.org. See the CHANGES.txt

<    1   2   3   4   5   6   7   8   9   10   >