Hey,
I finally solved it! It was to do with my Cassandra cluster. My hadoop and
cassandra clusters were in two different datacenters. This caused cassandra
requests to timeout. And that meant the generate phase didn’t have any input!
Works like a charm now :)
Regards
--
Manikandan Saravanan
}, FileSystemCounters={FILE_BYTES_READ=6,
HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File
Output Format Counters ={BYTES_WRITTEN=86
14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls:0
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
I built it from Nutch 2.2.1 (src-tar.gz).
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 6 June 2014 at 1:03:18 am, Lewis John Mcgibbney (lewis.mcgibb...@gmail.com)
wrote:
which version of Nutch are you using?
Nutch 2 what?
On Thu, Jun 5, 2014 at 12:14 PM, Manikandan
!-- removes duplicate slashes --
regex
pattern(?lt;!:)/{2,}/pattern
substitution//substitution
/regex
/regex-normalize
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 6 June 2014 at 1:54:02 am, Lewis John Mcgibbney (lewis.mcgibb...@gmail.com)
wrote:
I suspect that your generator
)
snapshot=1992785920
14/05/28 07:19:33 INFO mapred.JobClient: Map output records=0
14/05/28 07:19:33 INFO mapred.JobClient: SPLIT_RAW_BYTES=877
14/05/28 07:19:33 INFO solr.SolrIndexerJob: SolrIndexerJob: done.
Am I missing anything?
--
Manikandan Saravanan
Architect - Technology
Hi,
I’m running Nutch 2 on a Hadoop 1.2.1 cluster with 2 nodes. I’m running Solr 4
separately on a box and I replaced Solr’s schema with Nutch’s Solr-4 schema.
When I run a crawl, I get the following error at the end of the job
14/05/26 14:08:32 INFO solr.SolrDeleteDuplicates:
statistics: done
What am I missing? My regex and normalise filters are allowing all URL
patterns. I’m trying to do a whole web crawl.
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
I’m using the crawl script you had given before [0]. What might be wrong?
[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 5 February 2014 at 3:21:25 pm, Lewis John Mcgibbney
(lewis.mcgibb...@gmail.com
?
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 4 February 2014 at 3:11:36 pm, Lewis John Mcgibbney
(lewis.mcgibb...@gmail.com) wrote:
https://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
On Tue, Feb 4, 2014 at 7:04 AM, Manikandan Saravanan
manikan
I’m using the crawl script that you had linked earlier.
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 4 February 2014 at 7:43:49 pm, Manikandan Saravanan
(manikan...@thesocialpeople.net) wrote:
Okay, the crawl runs well for the most part:
I’m running the crawl script
Can you help me out?
I think there’s something wrong with what we’re passing to bin/nutch updatedb
in the crawl script.
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 4 February 2014 at 8:00:24 pm, Manikandan Saravanan
(manikan...@thesocialpeople.net) wrote:
I’m using
I’m getting this when running the crawl script right after the parse phase
Exception in thread main java.lang.IllegalArgumentException: usage: (-crawlId
id)
Something wrong with updatedb?
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 5 February 2014 at 1:20:31 am
] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
How do I run the crawl script on hadoop?
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney
(lewis.mcgibb...@gmail.com) wrote:
Hi Manikandan,
On Mon, Feb 3, 2014 at 3:45 PM, user-digest-h...@nutch.apache.org wrote
snapshot bundled but I’m
not getting other dependencies like thrift etc. Kindly help me with a permanent
solution to this problem.
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
15 matches
Mail list logo