Re: Nutch didn't (fail) to create new segment dir

2014-02-14 Thread Tejas Patil
The logs say this:
 Generator: 0 records selected for fetching, exiting ...
This is because there are no urls that generator could pass to form a
segment.

 Injector: total number of urls injected after normalization and
filtering: 0
Inject did NOT add anything to the crawldb. Check if you are over-filtering
the input urls. Also it would be nice to see the urls that you are
injecting are valid. From the logs looks like there were just 4 urls in the
seeds file.

Thanks,
Tejas


On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:

 Hi,

 From what I know that nutch generate will create a new segment directory
 every round nutch is running.

 I have a problem (never happened before) that nutch won't create new
 segment.
 It always only fetch and parse the latest segment.
 - from the logs:
 2014-02-15 07:20:02,036 INFO  fetcher.Fetcher - Fetcher: segment:
 /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835

 Even though I repeat the processes (generate  fetch  parse  update)
 many times.

 What should I check for the configuration of nutch? or any hints to solve
 this problem.

 I use nutch 1.7.

 And here is the part of hadoop log file:
 http://pastebin.com/kpi48gK6

 Thank you.

 --
 wassalam,
 [bayu]



Re: HTML tag filtering

2014-02-13 Thread Tejas Patil
That means that there were changes to the source files since the patch was
created. You need to manually add the changes from patch to the source
files.

Thanks,
Tejas


On Thu, Feb 13, 2014 at 12:02 AM, Markus Källander 
markus.kallan...@nasdaqomx.com wrote:

 Hi,

 Trying to run the patch command and get this error:

 $ patch -p0  blacklist_whitelist_plugin.patch
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file
 src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistIndexer.java
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file
 src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistParser.java
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/plugin/index-blacklist-whitelist/README.txt
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/plugin/index-blacklist-whitelist/build.xml
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/plugin/index-blacklist-whitelist/ivy.xml
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/plugin/index-blacklist-whitelist/plugin.xml
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/plugin/build.xml
 Hunk #1 FAILED at 62 (different line endings).
 1 out of 1 hunk FAILED -- saving rejects to file src/plugin/build.xml.rej

 Any hints? I try to patch it in the source for the tagged 1.7 release.

 Markus Källander

 Mobile +46 73 622 0547





 -Original Message-
 From: Tejas Patil [mailto:tejas.patil...@gmail.com]
 Sent: den 13 februari 2014 01:40
 To: user@nutch.apache.org
 Subject: Re: HTML tag filtering

 On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander 
 markus.kallan...@nasdaqomx.com wrote:

  Hi,
 
  The patch seems to fulfil my needs, but how do I use it with Nutch 1.7


 From your local trunk checkout, run these commands in shell:
 *wget

 https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch
 
 https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch
 *
 *patch -p0  blacklist_whitelist_plugin.patch* *ant clean runtime*

 Now you have successfully applied that patch to your local copy of nutch
 codebase. That patch is old and I am not sure if it would compile correctly
 so you have to look in the codebase and tweak it.

 Thanks,
 Tejas

 ? Is the patch not release yet?
 
  Markus Källander
 
  Mobile +46 73 622 0547
 
 
 
 
  -Original Message-
  From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
  Sent: den 11 februari 2014 17:44
  To: user@nutch.apache.org
  Subject: Re: HTML tag filtering
 
  Hi Markus,
 
  in short, you have to write a parse filter plugin which does in the
  filter(...) method:
  1. traverse the DOM tree and constructs a clean text by skipping
  certain content. See  o.a.n.utils.NodeWalker
   o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of
  parse-html
  plugin) 2. then replace the old plain text in ParseResult by new clean
  text
 
  Maybe this issue can help (there is also a patch but I'm not sure
  whether it's working and fulfills your needs):
   https://issues.apache.org/jira/browse/NUTCH-585
 
  Sebastian
 
  On 02/11/2014 04:24 PM, Markus Källander wrote:
   Hi,
  
   How do I skip indexing of HTML tags with certain id:s or css
   classes? I
  am using Nutch 1.7.
  
   Thanks
   Markus
  
 
 



Re: sizing guide

2014-02-13 Thread Tejas Patil
On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer deepa.jayav...@tcs.comwrote:

 Thanks for your reply.
   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
 Hbase
 once I get a fair idea about Nutch.
 For our use case, I need to crawl large documents for
 around 100 web sites
  weekly  and our functionality demands to crawl on daily basis or even
 hourly basis to
 extract specific information from around 20 different host. Say,
 Need to extract product details from the retailer's site.
 In that case, we need to recrawl the pages to get the latest information

 As you mentioned, I can do a batch delete the crawled html data once
 I extract the information from the crawled data. I can expect the
 crawled data roughly to be around  1 TB (could be deleted on scheduled
 basis)


If you process the data as soon it is available, then you might not need to
have 1 TB.. unless Nutch gets that much data in a single fetch cycle.


 Will these sizing be fine for Nutch installation in production?
 4 Node Hadoop cluster with 2 TB storage each
 64 GB RAM each
 10 GB heap


Looks fine. You need to monitor the crawl for first week or two so as to
know if you need to change this setup.


 Apart from that, need to do HBase data sizing to store the product
 details(which
 would be around 400 GB of data)
 can I use the same HBase cluster to store the extracted data where Nutch
 is raining


Yes you can. HBase is a black box to me and it would have a bunch of its
own configs which you could tune.


 Can you please let me know your suggestion or recommendations.


 Thanks and Regards
 Deepa Devi Jayaveer
 Mobile No: 9940662806
 Tata Consultancy Services
 Mailto: deepa.jayav...@tcs.com
 Website: http://www.tcs.com
 
 Experience certainty.   IT Services
 Business Solutions
 Consulting
 



 From:
 Tejas Patil tejas.patil...@gmail.com
 To:
 user@nutch.apache.org user@nutch.apache.org
 Date:
 02/13/2014 05:58 AM
 Subject:
 Re: sizing guide



 If you are looking for specific Nutch 2.1 + MySQL combination, I think
 that
 there won;t be any on the project wiki.

 There is no perfect answer for this as it depends on these factors (this
 list may go on):
 - Nature of data that you are crawling: small html files or large
 documents.
 - Is it a continuous crawl or few levels ?
 - Are you re-crawling urls ?
 - How big is the crawl space ?
 - Is it a intranet crawl ? How frequently are the pages changed ?

 Nutch 1.x would be a perfect fit for prod level crawls. If you still want
 to use Nutch 2.x, it would be better to switch to some other datastore
 (eg.
 HBase).

 Below are my experiences with two use cases wherein Nutch was used over
 prod with Nutch 1.x:

 (A) Targeted crawl of a single host
 In this case I wanted to get the data crawled quickly and didn't bother
 about the updates that would happen to the pages. I started off with a
 five
 node Hadoop cluster but later did the math that it won't get my work done
 in few days (remember that you need to have a delay between successive
 requests which the server agrees on else your crawler is banned). Later I
 bumped the cluster to 15 nodes. The pages were HTML files with size
 roughly
 200k. The crawled data roughly needed 200GB and I had storage of about
 500GB.

 (B) Open crawl of several hosts
 The configs and memory settings were driven by the prod hardware. I had a
 4
 node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
 hadoop job with an exception of generate job which needed more heap (8-10
 GB). There was no need to store the crawled data and every batch was
 deleted as soon as it was processed. That said that disk had a capacity of
 2 TB.

 Thanks,
 Tejas

 On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
 deepa.jayav...@tcs.comwrote:

  Hi ,
  Am using Nutch2.1 with MySQL. Is there a sizing guide available for
 Nutch
  2.1?
  Is there any recommendations could be ginven on  sizing memory,CPU and
  Disk Space for crawling.
 
  Thanks and Regards
  Deepa Devi Jayaveer
  Mobile No: 9940662806
  Tata Consultancy Services
  Mailto: deepa.jayav...@tcs.com
  Website: http://www.tcs.com
  
  Experience certainty.   IT Services
  Business Solutions
  Consulting
  
  =-=-=
  Notice: The information contained in this e-mail
  message and/or attachments to it may contain
  confidential or privileged information. If you are
  not the intended recipient, any dissemination, use,
  review, distribution, printing or copying of the
  information contained in this e-mail message
  and/or attachments to it are strictly prohibited. If
  you have received this communication in error,
  please notify us by reply e-mail or telephone and
  immediately

Re: how cam I download the source code of Nutch's dependence jars

2014-02-12 Thread Tejas Patil
Have you tried this ?
http://java.dzone.com/articles/ivy-how-retrieve-source-codes

Thanks,
Tejas


On Wed, Feb 12, 2014 at 12:43 AM, Gavin 274614...@qq.com wrote:

 Maven can do this.
 How can i do this with ivy?




 -- Original --
 From:  274614348;274614...@qq.com;
 Date:  Wed, Feb 12, 2014 04:37 PM
 To:  useruser@nutch.apache.org;

 Subject:  how cam I download the source code of Nutch's dependence jars



 How can I download the dependence jars' source code with ivy.

 Thinks a lot!



Re: sizing guide

2014-02-12 Thread Tejas Patil
If you are looking for specific Nutch 2.1 + MySQL combination, I think that
there won;t be any on the project wiki.

There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?

Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore (eg.
HBase).

Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:

(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.

(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a 4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.

Thanks,
Tejas

On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer deepa.jayav...@tcs.comwrote:

 Hi ,
 Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch
 2.1?
 Is there any recommendations could be ginven on  sizing memory,CPU and
 Disk Space for crawling.

 Thanks and Regards
 Deepa Devi Jayaveer
 Mobile No: 9940662806
 Tata Consultancy Services
 Mailto: deepa.jayav...@tcs.com
 Website: http://www.tcs.com
 
 Experience certainty.   IT Services
 Business Solutions
 Consulting
 
 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you





Re: HTML tag filtering

2014-02-12 Thread Tejas Patil
On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander 
markus.kallan...@nasdaqomx.com wrote:

 Hi,

 The patch seems to fulfil my needs, but how do I use it with Nutch 1.7


From your local trunk checkout, run these commands in shell:
*wget
https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch
https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch*
*patch -p0  blacklist_whitelist_plugin.patch*
*ant clean runtime*

Now you have successfully applied that patch to your local copy of nutch
codebase. That patch is old and I am not sure if it would compile correctly
so you have to look in the codebase and tweak it.

Thanks,
Tejas

? Is the patch not release yet?

 Markus Källander

 Mobile +46 73 622 0547




 -Original Message-
 From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
 Sent: den 11 februari 2014 17:44
 To: user@nutch.apache.org
 Subject: Re: HTML tag filtering

 Hi Markus,

 in short, you have to write a parse filter plugin which does in the
 filter(...) method:
 1. traverse the DOM tree and constructs a clean text by skipping certain
 content. See  o.a.n.utils.NodeWalker
  o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html
 plugin) 2. then replace the old plain text in ParseResult by new clean
 text

 Maybe this issue can help (there is also a patch but I'm not sure whether
 it's working and fulfills your needs):
  https://issues.apache.org/jira/browse/NUTCH-585

 Sebastian

 On 02/11/2014 04:24 PM, Markus Källander wrote:
  Hi,
 
  How do I skip indexing of HTML tags with certain id:s or css classes? I
 am using Nutch 1.7.
 
  Thanks
  Markus
 




Re: Nutch 2.2.1 Build stuck while trying to access http://ant.apache.org/ivy/

2014-02-08 Thread Tejas Patil
This has to do more with ant and nothing about nutch. Here is a wild idea:

Grab a linux box without any internet restrictions, download nutch over it
and build it. In the user home, there would a hidden directory .ivy2
which is a local ivy cache. Create a tarball of the same and scp it over
your work machine, extract it in home directory and then run nutch build.

PS: I have never done this for ivy but for maven and it had worked.

~tejas


On Fri, Feb 7, 2014 at 2:18 PM, A Laxmi a.lakshmi...@gmail.com wrote:

 Hi,

 I am having issues building Nutch 2.2.1 behind my company firewall. My
 build gets stuck here:

 [ivy:resolve] :: loading settings :: file =
 ~/nutchtest/nutch/ivy/ivysettings.xml

 When I contacted the hosting admin, they said - Ant is trying to download
 files from internet and it will have problems with our firewalls. You will
 either have to download the files yourself and then scp/sftp them to the
 machine. Unfortunately we don't have an http proxy.


 From further digging, I could see Ant is trying to access this link
 http://ant.apache.org/ivy/. Could anyone please advise what I should do to
 make Ant compile Nutch without accessing the internet? I can download
 required files from http://ant.apache.org/ivy/ and scp/sftp to the server
 but I am not sure what files to download and where to put them?

 Thanks for your help!!



Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

2014-02-02 Thread Tejas Patil
On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:

 Hi Tejas,

 It's works and great! :)
 After reconfigured and many times of generate, fetch, parse  update, the
 pages on 2nd level is being crawled.

 1 question, Is it fine and correct if I modified my current
 crawler+indexing script into this pseudo (skeleton):

 
 # example number of levels / depth (loop)
 LOOP=4

 nutch-inject()

 loop[ = $LOOP]
 {
 nutch-generate()
 nutch-fetch(a_segment)
 nutch-parse(a_segment)
 nutch-updatedb(a_segment)
 }

 nutch-solrindex()

 I don't think that this should be a problem. Remember to pass all the
segments generated in the crawl loop to the solrindex job using -dir
option.



 Thank you!


 On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
 bwidyasany...@gmail.comwrote:

  OK I will apply it first and update the result.
 
  Thanks.-
 
 
  On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:
 
  Please copy this at the end (but above the end tag '/configuration')
 in
  your $NUTCH/conf/nutch-site.xml:
 
  property
namehttp.content.limit/name
value9/value
  /property
 
  property
namehttp.timeout/name
value2147483640/value
  /property
 
  property
namedb.max.outlinks.per.page/name
value9/value
  /property
 
  Please check if the url got fetched correctly after every round:
  For the first round with seed as http://bappenas.go.id, after
 updatedb
  job, run these to check if they are into the crawldb. The first url must
  be
  db_fetched while the second one must be db_unfetched:
 
  bin/nutch readdb YOUR_CRAWLDB -url http://bappenas.go.id/
  bin/nutch readdb YOUR_CRAWLDB -url
 
 
 http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
 
  Now crawl for the next depth. After updatedbjob, check if the second
 url
  got fetched using the same command again. ie.
  bin/nutch readdb YOUR_CRAWLDB -url
 
 
 http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
 
  Note that if there was any redirection, you need to look out the target
  url
  in the redirection chain and use that url ahead for debugging. Verify if
  the content you got for that url had text Liberal Party in the parsed
  output using this command:
 
  bin/nutch readseg -get LATEST_SEGMENT
 
 
 http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
 
  For larger segments, you might get a OOM error. So in that case, take
 the
  entire segment dump using:
  bin/nutch readseg -dump LATEST_SEGMENT  OUTPUT
 
  After all this is verified and everything looks good from the crawling
  side, run solrindex and check if you get the query results. If not, then
  there was a problem while indexing the stuff.
 
  Thanks,
  Tejas
 
 
  On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
  bwidyasany...@gmail.comwrote:
 
   Hi,
  
   I just realized that my nutch didn't crawl the articles/pages (depth
 2)
   which shown on frontpage.
   My target URL is: http://bappenas.go.id
  
   As shown on that frontpage (top right below the slider banners) three
  is a
   text link:
  
   Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot
   and its URL:
  
  
 
 http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?kid=1390691937
  
   I tried to search with keyword Liberal Party (with quotes) which
  appear
   on link (page) above but has no result :(
  
   Following is the search link queried:
  
  
 
 http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
  
   I use individual script to crawl below:
  
   ===
   # Defines env variables
   export JAVA_HOME=/opt/searchengine/jdk1.7.0_45
   export PATH=$JAVA_HOME/bin:$PATH
   NUTCH=/opt/searchengine/nutch
  
   # Start by injecting the seed url(s) to the nutch crawldb:
   $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
  $NUTCH/urls/seed.txt
  
   # Generate fetch list
   $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
   $NUTCH/BappenasCrawl/segments
  
   # last segment
   export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
   $NUTCH/BappenasCrawl/segments|tail -1`
  
   # Launch the crawler!
   $NUTCH/bin/nutch fetch $SEGMENT -noParsing
  
   # Parse the fetched content:
   $NUTCH/bin/nutch parse $SEGMENT
  
   # We need to update the crawl database to ensure that for all future
   crawls, Nutch only checks the already crawled pages, and only fetches
  new
   and changed pages.
   $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
 -filter
   -normalize
  
   # Indexing our crawl DB with solr
   $NUTCH/bin/nutch solrindex
   http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
   -dir
 
 http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
  $NUTCH/BappenasCrawl/segments
   ===
  
   I run this script daily

Re: Email and blogs crawling

2014-01-28 Thread Tejas Patil
Nutch has these protocols implemented : http, https, ftp, file. As long as
you get links to your documents in those schemes, Nutch would do the crawl.

Thanks,
Tejas



On Tue, Jan 28, 2014 at 10:07 PM, rashmi maheshwari 
maheshwari.ras...@gmail.com wrote:

 I could crawl internet webpage and local directory folder to some extent.

 How to implement email and inranet blogs crawling?

 --
 Rashmi
 Be the change that you want to see in this world!



Re: Order of robots file

2014-01-24 Thread Tejas Patil
Hi Markus,
I am trying to understand the problem you described. You meant that with
the original Nutch's robots parsing code, the robots file below allowed
your crawler to crawl stuff:

User-agent: *
Disallow: /

User-agent: our_crawler
Allow: /

But now that started using the change from NUTCH-1031 [0], (ie. delegation
of robots parsing to crawler commons), it blocked your crawler. To make
things work, you had to change your robots file to this:

User-agent: our_crawler
Allow: /

User-agent: *
Disallow: /

Did I understand the problem correctly ?

[0] : https://issues.apache.org/jira/browse/NUTCH-1031

Thanks,
Tejas


On Fri, Jan 24, 2014 at 7:29 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 I am attempting to merge some Nutch changes back to our own. We aren't
 using Nutch' CrawlerCommons impl but the old stuff. But because of
 recording of response time and rudimentary SSL support i decided to move it
 back to our version. Suddenly i realized a local crawl does not work
 anymore, it seems because of the order of the robots definitions.

 For example:

 User-agent: *
 Disallow: /

 User-agent: our_crawler
 Allow: /

 Does not allow our crawler to fetch URL's. But

 User-agent: our_crawler
 Allow: /

 User-agent: *
 Disallow: /

 Does! This was not the case before, anyone here aware of this? By design?
 Or is it a flaw?

 Thanks
 Markus



Re: Order of robots file

2014-01-24 Thread Tejas Patil
I am working on the scenario you just pointed out. By Apache Nutch, you
mean the current codebase with CC or version before that ?

CC differs from original nutch code as CC has kinda a greedy approach
wherein it tries to get a match / mismatch after every line is sees from
the robots file. While the time I was working on delegation of robots
parsing to Crawler commons (CC), I remember that there was difference in
the semantics of original parsing code and CC's implementation for multiple
robots agents.
Here was my observation at that time:
https://issues.apache.org/jira/browse/NUTCH-1031?focusedCommentId=13558217page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13558217

~tejas


On Fri, Jan 24, 2014 at 8:11 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Tejas, the problem exists in Apache Nutch as well. We'll take localhost as
 example and the following config and robots.txt


 # cat /var/www/robots.txt
 User-agent: *
 Disallow: /

 User-agent: nutch
 Allow: /



 config:
   property
 namehttp.agent.name/name
 valueMozilla/value
   /property
   property
 namehttp.agent.version/name
 value5.0/value
   /property
   property
 namehttp.robots.agents/name
 valuenutch,*/value
   /property
   property
 namehttp.agent.description/name
 valuecompatible; NutchCrawler/value
   /property
   property
 namehttp.agent.url/name
 value+http://example.org//value
   /property


 URL: http://localhost/
 Version: 7
 Status: 3 (db_gone)
 Fetch time: Mon Mar 10 15:36:48 CET 2014
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 3888000 seconds (45 days)
 Score: 0.0
 Signature: null
 Metadata:
 _pst_=robots_denied(18), lastModified=0



 Can you confirm?





 -Original message-
  From:Markus Jelsma markus.jel...@openindex.io
  Sent: Friday 24th January 2014 15:29
  To: user@nutch.apache.org
  Subject: RE: Order of robots file
 
  Hi, sorry for being unclear. You understand correctly, i had to change
 the robots.txt order and put our crawler ABOVE User-Agent: *.
 
  I have tried a unit test for lib-http to demonstrate the problem but i
 does not fail. I think i did something wrong in merging the code base. I'll
 look further.
 
  markus@midas:~/projects/apache/nutch/trunk$ svn diff
 src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
  Index:
 src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
  ===
  ---
 src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
   (revision 1560984)
  +++
 src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
   (working copy)
  @@ -50,6 +50,23 @@
 +  + CR
 + User-Agent: * + CR
 + Disallow: /foo/bar/ + CR;   // no crawl delay for other agents
  +
  +  private static final String ROBOTS_STRING_REVERSE =
  +  User-Agent: * + CR
  +  + Disallow: /foo/bar/ + CR   // no crawl delay for other agents
  +  +  + CR
  +  + User-Agent: Agent1 #foo + CR
  +  + Disallow: /a + CR
  +  + Disallow: /b/a + CR
  +  + #Disallow: /c + CR
  +  + Crawl-delay: 10 + CR  // set crawl delay for Agent1 as 10 sec
  +  +  + CR
  +  +  + CR
  +  + User-Agent: Agent2 + CR
  +  + Disallow: /a/bloh + CR
  +  + Disallow: /c + CR
  +  + Disallow: /foo + CR
  +  + Crawl-delay: 20 + CR;
 
 private static final String[] TEST_PATHS = new String[] {
   http://example.com/a;,
  @@ -80,6 +97,29 @@
 /**
 * Test that the robots rules are interpreted correctly by the robots
 rules parser.
 */
  +  public void testRobotsAgentReverse() {
  +rules = parser.parseRules(testRobotsAgent,
 ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, SINGLE_AGENT);
  +
  +for(int counter = 0; counter  TEST_PATHS.length; counter++) {
  +  assertTrue(testing on agent ( + SINGLE_AGENT + ), and 
  +  + path  + TEST_PATHS[counter]
  +  +  got  + rules.isAllowed(TEST_PATHS[counter]),
  +  rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]);
  +}
  +
  +rules = parser.parseRules(testRobotsAgent,
 ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, MULTIPLE_AGENTS);
  +
  +for(int counter = 0; counter  TEST_PATHS.length; counter++) {
  +  assertTrue(testing on agents ( + MULTIPLE_AGENTS + ), and 
  +  + path  + TEST_PATHS[counter]
  +  +  got  + rules.isAllowed(TEST_PATHS[counter]),
  +  rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]);
  +}
  +  }
  +
  +  /**
  +  * Test that the robots rules are interpreted correctly by the robots
 rules parser.
  +  */
 public void testRobotsAgent() {
   rules = parser.parseRules(testRobotsAgent,
 ROBOTS_STRING.getBytes(), CONTENT_TYPE, SINGLE_AGENT);
 
 
  -Original 

Re: WrongRegionException after updatedb

2014-01-23 Thread Tejas Patil
This is tied with HBase and not Nutch. It would be beneficial if you get a
complete stack trace and post it over the HBase user group too.

~tejas


On Thu, Jan 23, 2014 at 6:49 PM, cervenkovab cervenko...@gmail.com wrote:

 I run generate-fetch-parse and after update I got this exception.

 2014-01-23 13:40:56,905 ERROR store.HBaseStore - Failed 747 actions:
 WrongRegionException: 747 times, servers with issues: server.eu:43556,
 2014-01-23 13:40:56,905 ERROR store.HBaseStore -
 [Ljava.lang.StackTraceElement;@12101d00

 Can you please help me to understand where can be problem?

 using versions: Nutch 2.2.1, Hbase 0.90.6




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/WrongRegionException-after-updatedb-tp4112982.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How to Get Links With Nutch

2014-01-22 Thread Tejas Patil
Correct me if I am wrong: You want the anchor text and the outlink. Right ?
If you crawl the seed url for depth 1 using Nutch 1.x and then get a
segment dump of the segment generated after crawl, it should have that
information.


On Wed, Jan 22, 2014 at 9:46 PM, Teague James teag...@insystechinc.comwrote:

 I am trying to use Nutch to crawl a site and return all of the links that
 are on a page. As a simple example, the page might look like this if its
 address were www.example.com and each of the items in [brackets] were
 links
 of some sort - relative or full URLs:

 Article 1 text blah blah blah [Read more]
 Download [Article 1 PDF]
 Article 2 text blah blah blah [Read more]
 Download [Article 2 PDF]
 In partnership with [Some Partner]
 [Home]|[Articles]|[Contact Us]

 What I want to get is a list of all the links and destination URLs,
 something like:
 [Read more] /article1
 [Article 1 PDF] /pdfs/article1.pdf
 [Read more] /article2
 [Article 2 PDF] /pdfs/article2.pdf
 [Some Partner] www.somepartner.com
 [Home] /home
 [Articles] /articles
 [Contact Us] /contact us

 Note that a lot of the links are relative. I don't care whether I can get
 only the relative /article1 or the full www.example.com/article1 and I
 do not necessarily need Nutch to go to each of those links and crawl them.
 I
 just want Nutch to report on all of the links on the page.

 Can anyone offer me any advice on how to accomplish this?




Re: How to Get Links With Nutch

2014-01-22 Thread Tejas Patil
On Wed, Jan 22, 2014 at 10:33 PM, Teague James teag...@insystechinc.comwrote:

 Tejas,

 Thanks for your response, that is exactly correct. Ultimately I want to be
 able to index the Nutch crawl with Solr to make it all searchable. After
 doing my crawl, I use:

 bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb
 crawl/linkdb -dir crawl/segments/

 But I do not get all of the anchors. I get some anchors as a comma
 delimited
 list in the anchor field.


HTML parser in Nutch is doing this.


 I do not get any of the outlinks.

 I think that the indexer would only index the parsed content from the
segments and the outlinks won't be included. Thats why you see that
happening.

I did a dump with readdb of the crawldb and found that the links I want are
 there. I will take a look at doing a segment dump as you suggest.


ok. Look out for the outlinks section in the segment dump.


 Will that
 make these outlinks available to Solr or are there additional steps I need
 to take?

 I am not sure if there is a better way than this: Write your own HTML
parser (or tweak the one provided with Nutch) which would just emit the
outlinks along with its anchor text.

-Original Message-

 Correct me if I am wrong: You want the anchor text and the outlink. Right ?
 If you crawl the seed url for depth 1 using Nutch 1.x and then get a
 segment
 dump of the segment generated after crawl, it should have that information.


 On Wed, Jan 22, 2014 at 9:46 PM, Teague James
 teag...@insystechinc.comwrote:

  I am trying to use Nutch to crawl a site and return all of the links
  that are on a page. As a simple example, the page might look like this
  if its address were www.example.com and each of the items in
  [brackets] were links of some sort - relative or full URLs:
 
  Article 1 text blah blah blah [Read more] Download [Article 1 PDF]
  Article 2 text blah blah blah [Read more] Download [Article 2 PDF] In
  partnership with [Some Partner] [Home]|[Articles]|[Contact Us]
 
  What I want to get is a list of all the links and destination URLs,
  something like:
  [Read more] /article1
  [Article 1 PDF] /pdfs/article1.pdf
  [Read more] /article2
  [Article 2 PDF] /pdfs/article2.pdf
  [Some Partner] www.somepartner.com
  [Home] /home
  [Articles] /articles
  [Contact Us] /contact us
 
  Note that a lot of the links are relative. I don't care whether I can
  get only the relative /article1 or the full
  www.example.com/article1 and I do not necessarily need Nutch to go to
 each of those links and crawl them.
  I
  just want Nutch to report on all of the links on the page.
 
  Can anyone offer me any advice on how to accomplish this?
 
 




Request for reviewing HostDb and Sitemap features

2014-01-21 Thread Tejas Patil
Hi,

Is anyone interested in reviewing or trying out the patch for these new
features ? I have recently updated [0] and [1] and would like to hear back
comments on the same.

[0] : https://issues.apache.org/jira/browse/NUTCH-1325
[1] : https://issues.apache.org/jira/browse/NUTCH-1465

Thanks,
Tejas


Re: The problem caused by failed with: java.io.IOException: unzipBestEffort returned null

2013-12-29 Thread Tejas Patil
I tried to debug the issue. You are hit by
https://issues.apache.org/jira/browse/NUTCH-1647

Thanks,
Tejas


On Fri, Dec 27, 2013 at 8:54 AM, yan wang dayank...@gmail.com wrote:

 Hi, guys
   Yesterday, I tried to crawl a website (a Chinese website) with some
 seed links like this:
 http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml
 but the crawl process failed because of a problem shown as following:
 fetching http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml (queue
 crawl delay=5000ms)
 fetch of http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml failed
 with: java.io.IOException: unzipBestEffort returned null
 At first, I used nutch-1.5.1 to crawl the website and had the above
 problem, then I changed to use nutch-1.7 to do it again but it failed again.
 Now, I totally have no idea how to handle the problem!
 I would really appreciate any feedback!

 -Yan Wang


Re: Using ParseUtils in MR job (not as part of nutch crawl)

2013-12-22 Thread Tejas Patil
On Sun, Dec 22, 2013 at 4:39 AM, Amit Sela am...@infolinks.com wrote:

 Hi all,

 I'm trying to use the nutch ParseUtil to parse nutch Content with
 parse-tika and parse-html


By nutch content, you mean nutch segment ? Please try using the 'bin/nutch
parse' command instead.

but I keep getting:

 RuntimeException: x point org.apache.nutch.parse.Parser not found


This smells like some problem in loading the plugins.


 I'm running this in a MR outside of the nutch crawl jobs, and when I run it
 in IDE I have to add the build/ directory to project classpath in order to
 solve it.


The bin/nutch script generates appropriate classpath before invoking the
class. You can get the value of CLASSPATH formed by the script and try to
get the same in IDE. Glad that you found a way around.


 I hoped distributing the apache-nutch-1.7.jar (version I use) to data nodes
 classpath directories would help, I even added parse-plugins.xml but it
 won't do...

 I hope that you were running from runtime/deploy for distributed mode.
No need to distribute the jar. Hadoop does that for you. Even the configs
are inside the runtime/deploy/apache-nutch-1.XX-.job file.

Anyone managed that ?

 Thanks,

 Amit.



Re: Crawling a specific site only

2013-12-18 Thread Tejas Patil
You need to provide topN parameter to run Generate.. can't skip that. What
I meant was that set its value more than 2000.
Note: The max allowable value for topN is (2^63)-1. Don't exceed that.

Thanks,
Tejas


On Wed, Dec 18, 2013 at 2:14 AM, Vangelis karv karvouni...@hotmail.comwrote:

 Thanks for the support guys! I'll crawl again with
 generate.count.mode=host and generate.max.count=-1. Although, if i dont set
 -topN in the nutch script it won't let me run GeneratorJob.

  Subject: RE: Crawling a specific site only
  From: markus.jel...@openindex.io
  To: user@nutch.apache.org
  Date: Wed, 18 Dec 2013 09:38:04 +
 
  Increase it to a reasonable high value or don't set it at all, it will
 then attempt to crawl as much as it can. Also check generate.count.mode and
 generate.max.count.
 
 
  -Original message-
   From:Vangelis karv karvouni...@hotmail.com
   Sent: Wednesday 18th December 2013 9:56
   To: user@nutch.apache.org
   Subject: RE: Crawling a specific site only
  
   Can you be a little more specific about that, Tejas?
  
Date: Tue, 17 Dec 2013 23:32:46 -0800
Subject: Re: Crawling a specific site only
From: tejas.patil...@gmail.com
To: user@nutch.apache.org
   
You should bump the value of topN instead of setting to 2000. That
 would
make lot of the urls eligible for fetching.
   
Thanks,
Tejas
   
   
On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv 
 karvouni...@hotmail.comwrote:
   
 Markus and Wang thank you very much for your fast responses. I
 forgot to
 mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
 ignore.external.links ideas are awesome! What really bothers me is
 that
 dreaded -topN. I really want to live without it! :) I hate it
 when I open
 my database and I see that i have for example 2000 links
 unfetched, which
 means they are not parsed-useless, and only 2000 fetched.

  Subject: Re: Crawling a specific site only
  From: wangyi1...@gmail.com
  To: user@nutch.apache.org
  Date: Tue, 17 Dec 2013 18:53:55 +0800
 
  HI
  Just set
  namedb.ignore.external.links/name
  valuetrue/value
  and run crawl script for several times, the default number of
 pages to
  be added is 50,000.
 
  Is it right?
  Wang
 
 
  -Original Message-
  From: Vangelis karv karvouni...@hotmail.com
  Reply-to: user@nutch.apache.org
  To: user@nutch.apache.org user@nutch.apache.org
  Subject: Crawling a specific site only
  Date: Tue, 17 Dec 2013 12:15:00 +0200
 
  Hi again! My goal is to crawl a specific site. I want to crawl
 all the
 links that exist under that site. For example, if i decide to crawl
 http://www.uefa.com/, I want to parse all its inlinks(photos,
 videos,
 htmls etc) and not only the best scoring urls for this site= topN.
 So, my
 question here is: how can we tell Nutch to crawl everything in a
 site and
 not only the sites that have the best score?
 
 
 


  




Re: Memory leak when crawling repeatedly?

2013-12-17 Thread Tejas Patil
You should use the bin/crawl script instead of directly invoking Crawl()

Thanks,
Tejas


On Tue, Dec 17, 2013 at 7:04 AM, yann yann1...@yahoo.com wrote:

 Thanks Julien, I will give it a try and report back.

 Is there sample code in trunk on what to replace the Crawl() with?

 Yann



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-tp4106960p4107114.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Crawling a specific site only

2013-12-17 Thread Tejas Patil
You should bump the value of topN instead of setting to 2000. That would
make lot of the urls eligible for fetching.

Thanks,
Tejas


On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv karvouni...@hotmail.comwrote:

 Markus and Wang thank you very much for your fast responses. I forgot to
 mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
 ignore.external.links ideas are awesome! What really bothers me is that
 dreaded -topN. I really want to live without it! :) I hate it when I open
 my database and I see that i have for example 2000 links unfetched, which
 means they are not parsed-useless, and only 2000 fetched.

  Subject: Re: Crawling a specific site only
  From: wangyi1...@gmail.com
  To: user@nutch.apache.org
  Date: Tue, 17 Dec 2013 18:53:55 +0800
 
  HI
  Just set
  namedb.ignore.external.links/name
  valuetrue/value
  and run crawl script for several times, the default number of pages to
  be added is 50,000.
 
  Is it right?
  Wang
 
 
  -Original Message-
  From: Vangelis karv karvouni...@hotmail.com
  Reply-to: user@nutch.apache.org
  To: user@nutch.apache.org user@nutch.apache.org
  Subject: Crawling a specific site only
  Date: Tue, 17 Dec 2013 12:15:00 +0200
 
  Hi again! My goal is to crawl a specific site. I want to crawl all the
 links that exist under that site. For example, if i decide to crawl
 http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
 htmls etc) and not only the best scoring urls for this site= topN. So, my
 question here is: how can we tell Nutch to crawl everything in a site and
 not only the sites that have the best score?
 
 
 




Re: Memory leak when crawling repeatedly?

2013-12-16 Thread Tejas Patil
Did you see the logs and figure out from stack trace which portion of the
code is responsible for OOM ?

Thanks,
Tejas


On Mon, Dec 16, 2013 at 9:32 AM, yann yann1...@yahoo.com wrote:

 Hi guys,

 I'm writing a server / rest API for Nutch, but I'm running into a memory
 leak issue.

 I simplified the problem down to this: crawling a site repeatedly (as
 below)
 will eventually run out of memory; when looking at the running JVM with
 VisualVM, the permGen space grows indefinitely at the same rate, until it
 runs out and the application crashes.

 I suspect there is a memory leak in Nutch or in Hadoop, as I wouldn't
 expect
 the code below not to grow its memory footprint indefinitely.

 The code:

 while (true) {
 Configuration configuration = NutchConfiguration.create();
  String crawlArg = config/urls/dev -dir crawls/dev -threads 5 -depth 2
 -topN 100 ;
  ToolRunner.run(configuration, new Crawl(),
 MiscUtils.tokenize(crawlArg));
 }

 Anything I can do on my side to fix this?

 Thanks for all comments,

 Yann




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-tp4106960.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Tejas Patil
 I am not able to locate the logs either to confirm that. Can you please
let me know how to retrieve logs from Nutch on Hadoop?
You should have seen the jobs over Hadoop UI and for each mapper and
reducer you can view the logs over the browser by clicking on the job.

 YARN supposed to be backward compatible?
It is. Nutch has not migrated fully to support the new MapReduce API from
which YARN was sprung.


On Mon, Dec 9, 2013 at 3:11 PM, S.L simpleliving...@gmail.com wrote:

 Isnt ,YARN supposed to be backward compatible?

 Sent from my HTC Inspire™ 4G on ATT

 - Reply message -
 From: Julien Nioche lists.digitalpeb...@gmail.com
 To: user@nutch.apache.org user@nutch.apache.org
 Cc: d...@nutch.apache.org d...@nutch.apache.org
 Subject: Nutch with YARN (aka Hadoop 2.0)
 Date: Mon, Dec 9, 2013 3:54 am


 I don't think Nutch has been fully ported to the new mapreduce API which is
 a prerequisite for running it on Hadoop 2.
 I can't think of a reason why that the performance would be any different
 with Yarn.

 Julien


 On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com wrote:

  Has anyone tried out running Nutch over YARN ? If so, were there were any
  performance gains with the same ?
 
  Thanks,
  Tejas
 



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Re: Nutch Hadoop Job plugins property

2013-12-09 Thread Tejas Patil
When you run Nutch over Hadoop ie. deploy mode, you use the job file
(apache-nutch-1.X.job). This is nothing but a big fat zip file
containing (you can unzip it and verify yourself) :
(a) all the nutch classes compiled,
(b) config files and
(c) dependent jars

When hadoop launches map-reduce jobs for nutch:
1. This nutch job file is copied over to the node where your job is
executed (say map task),
2. It is unpacked
3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the configs.
4. By default, plugin.folders is set to plugins which is a relative path.
It would search the plugin classes in the classpath under a directory named
plugins.
5. The plugins directory is under a directory named classes which is in
the classpath (this is inside the extracted job file). Now, required plugin
classes are loaded from here and everything runs fine.

In short: Leave it as it is. It should work over Hadoop by default.

Thanks,
Tejas

On Mon, Dec 9, 2013 at 4:54 PM, S.L simpleliving...@gmail.com wrote:

 What should be the plugins property be set to when running Nutch as a
 Hadoop job ?

 I just created a deploy mode jar running the ant script , I see that the
 value of the plugins property is being copied and used from the
 confiuration into the hadoop job. While it seems to be getting the plugins
 directory  because Hadoop is being run on the same machine , I am sure it
 will fail when moved to a different machine.

 How should I set the plugins property so that it is relative to the hadoop
 job?

 Thanks



Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Tejas Patil
On Mon, Dec 9, 2013 at 4:50 PM, S.L simpleliving...@gmail.com wrote:

 It is. Nutch has not migrated fully to support the new MapReduce API
 from
 which YARN was sprung.

 Yes Nutch 1.7 is using old API, does this   mean that Nutch 1.7 is
 inompatible with Hadoop 2.2  ? I am thinking it is compatible in its
 current state and that's why they are calling Hadoop 2.2 as backward
 compatible. Isn't that the case?

 Well, I have not ran it with latest 2.X hadoop version, but you are right
about backward compatibility. They would just not make the old code break
unless some strong reason. Having said that, if Nutch keeps using the old
deprecated hadoop API, you won't get any benefits of the changes done in
newer map reduce versions.

PS: Here is the relevant nutch jira for up-gradation to new Hadoop API
https://issues.apache.org/jira/browse/NUTCH-1219

Thanks,
Tejas


 On Mon, Dec 9, 2013 at 6:22 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

   I am not able to locate the logs either to confirm that. Can you
 please
  let me know how to retrieve logs from Nutch on Hadoop?
  You should have seen the jobs over Hadoop UI and for each mapper and
  reducer you can view the logs over the browser by clicking on the job.
 
   YARN supposed to be backward compatible?
  It is. Nutch has not migrated fully to support the new MapReduce API from
  which YARN was sprung.
 
 
  On Mon, Dec 9, 2013 at 3:11 PM, S.L simpleliving...@gmail.com wrote:
 
   Isnt ,YARN supposed to be backward compatible?
  
   Sent from my HTC Inspire™ 4G on ATT
  
   - Reply message -
   From: Julien Nioche lists.digitalpeb...@gmail.com
   To: user@nutch.apache.org user@nutch.apache.org
   Cc: d...@nutch.apache.org d...@nutch.apache.org
   Subject: Nutch with YARN (aka Hadoop 2.0)
   Date: Mon, Dec 9, 2013 3:54 am
  
  
   I don't think Nutch has been fully ported to the new mapreduce API
 which
  is
   a prerequisite for running it on Hadoop 2.
   I can't think of a reason why that the performance would be any
 different
   with Yarn.
  
   Julien
  
  
   On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com
 wrote:
  
Has anyone tried out running Nutch over YARN ? If so, were there were
  any
performance gains with the same ?
   
Thanks,
Tejas
   
  
  
  
   --
  
   Open Source Solutions for Text Engineering
  
   http://digitalpebble.blogspot.com/
   http://www.digitalpebble.com
   http://twitter.com/digitalpebble
  
 



Re: load plugin from jar file

2013-12-09 Thread Tejas Patil
I am not familiar with Clojure at all. Nutch plugin loading code is tricky
and hacking it to straight away invoke Clojure code should be a significant
work. My feeling is that if you cook up with a plugin in java and then
call Clojure
code through this java wrapper, this might work.

Thanks,
Tejas


On Mon, Dec 9, 2013 at 5:03 PM, Olle Romo oller...@metasound.ch wrote:

 Hi All,

 According to NUTCH-609 there were some talk about allowing plugins to load
 from jar files. Looks like that was from a few years ago. Can I write a
 plugin in Clojure and have it load ok?

 Best,
 Olle




Re: Nutch Hadoop Job plugins property

2013-12-09 Thread Tejas Patil
On Mon, Dec 9, 2013 at 6:07 PM, S.L simpleliving...@gmail.com wrote:

 Thanks for a great reply!

 Right now I have a 4 urls in my seed file with domains d1,d2,d3,d4.


 I see that when the nutch job is being run on Hadoop its only picking up
 URLs for d4, there does not seem to be any parallelism .


I would recommend you to run all phases of nutch INDIVIDUALLY and look into
the logs for the generate and fetch phases. Set log level for generate to
DEBUG.
One possible reason: All urls of host 'd4' had more score than the other
ones. This is less likely to cause this issue as your topN value is large.


 I am running the Nutch job using the following command.

 bin/hadoop jar
 /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
 org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000
 -topN 3


I am not sure but I think that the crawl command is deprecated. You might
have to use 'bin/crawl' script instead.





 On Mon, Dec 9, 2013 at 8:16 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  When you run Nutch over Hadoop ie. deploy mode, you use the job file
  (apache-nutch-1.X.job). This is nothing but a big fat zip file
  containing (you can unzip it and verify yourself) :
  (a) all the nutch classes compiled,
  (b) config files and
  (c) dependent jars
 
  When hadoop launches map-reduce jobs for nutch:
  1. This nutch job file is copied over to the node where your job is
  executed (say map task),
  2. It is unpacked
  3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the
 configs.
  4. By default, plugin.folders is set to plugins which is a relative
 path.
  It would search the plugin classes in the classpath under a directory
 named
  plugins.
  5. The plugins directory is under a directory named classes which is
 in
  the classpath (this is inside the extracted job file). Now, required
 plugin
  classes are loaded from here and everything runs fine.
 
  In short: Leave it as it is. It should work over Hadoop by default.
 
  Thanks,
  Tejas
 
  On Mon, Dec 9, 2013 at 4:54 PM, S.L simpleliving...@gmail.com wrote:
 
   What should be the plugins property be set to when running Nutch as a
   Hadoop job ?
  
   I just created a deploy mode jar running the ant script , I see that
 the
   value of the plugins property is being copied and used from the
   confiuration into the hadoop job. While it seems to be getting the
  plugins
   directory  because Hadoop is being run on the same machine , I am sure
 it
   will fail when moved to a different machine.
  
   How should I set the plugins property so that it is relative to the
  hadoop
   job?
  
   Thanks
  
 



Re: Unsuccessful fetch/parse of large page with many outlinks

2013-12-09 Thread Tejas Patil
I think that you narrowed it down and most probably its some
bug/incompatibility of the HTTP library which nutch uses to talk with the
server. Were both the servers where you hosted the url of IIS 6.0 ? If yes,
then there is more :)

Thanks,
Tejas


On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata ilopa...@hotmail.com wrote:

 Out of ideas at this point.

 I can retrieve the page with Curl
 I can retrieve the page with Wget
 I can view the page in my browser
 I can retrieve the page by opening a socket from a PHP script
 I can retrieve the page with nutch if I move the page to another host

 But

 Any page I try and fetch from www.friedfrank.com with Nutch reads just 198
 bytes and then closes the stream.

 Debug code inserted in HttpResponse and WireShark both show that this is
 the
 case.

 Could someone else please try and fetch a page from this host from your
 config?

 My suspicion is that it is related to this host being on IIS 6.0 with this
 problem being a potential cause: http://support.microsoft.com/kb/919797

 -Original Message-
 From: Iain Lopata [mailto:ilopa...@hotmail.com]
 Sent: Monday, December 09, 2013 7:36 AM
 To: user@nutch.apache.org
 Subject: RE: Unsuccessful fetch/parse of large page with many outlinks

 Parses 652 outlinks from the ebay url without any difficulty.

 Didn't want to change the title and thereby break this thread, but at this
 point, and as stated in my last post,  I am reasonably confident that for
 some reason the InputReader in HttpResponse.java sees the stream as closed
 after reading only 198 bytes.  Why I do not know.

 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Sunday, December 08, 2013 11:44 PM
 To: user@nutch.apache.org
 Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

 I faced a similar problem with this page
 http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
 running Nutch from within eclipse , I was able to crawl all the outlinks
 successfully when I ran nutch as a jar outside of eclipse, at that point it
 was considered to be an issue with running ti in eclipse.

 Can you please try this URL with your setup ? this has atleast 600+
 outlinks.


 On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata ilopa...@hotmail.com wrote:

  Some further analysis - no solution.
 
  The pages in question do not return a Content-Length header.
 
  Since the http.content.limit is set to -1, http-protocol sets the
  maximum read length to 2147483647.
 
  At line 231 of HttpResponse.java the loop:
 
  for (int i = in.read(bytes); i != -1  length + i = contentLength; i
  =
  in.read(bytes))
 
  executes once and once only and returns a stream of just 198 bytes.
  No exceptions are thrown.
 
  So, I think, the question becomes why would this connection close
  before the end of the stream?  It certainly seems to be server
  specific since I can retrieve the file successfully from a different
  host domain.
 
  -Original Message-
  From: Tejas Patil [mailto:tejas.patil...@gmail.com]
  Sent: Sunday, December 08, 2013 2:29 PM
  To: user@nutch.apache.org
  Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
 
   debug code that I have inserted in a custom filter shows that the
   file
  that was retrieved is only 198 bytes long.
  I am assuming that this code did not hinder the crawler. A better way
  to see the content would be to take a segment dump [0] and then
  analyse it.
  Also, turn on DEBUG mode of the log4j for the http protocol classes
  and fetcher class.
 
   attempted to crawl it from that site and it works fine, retrieving
   all
  597KB and parsing it successfully.
  You mean that you ran a nutch crawl with the problematic url as a seed
  and used the EXACT same config on both machines. One machine gave
  perfect content and the other one was not. Note that using EXACT same
  config over these 2 runs is important.
 
   the page has about 350 characters of LineFeeds CarriageRetruns and
   spaces
  No way. The HTTP request gets a byte stream as response. Also, had it
  been the case that LF or CR chars create problem, then it must hit
  nutch irrespective of from which machine you run nutch...but thats not
  what your experiments suggest.
 
  [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
 
 
 
  On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata ilopa...@hotmail.com
 wrote:
 
   I do not know whether this would be a factor, but I have noticed
   that the page has about 350 characters of LineFeeds CarriageRetruns
   and spaces before the !DOCTYPE declaration.  Could this be causing
   a problem for
   http-protocol in some way?   Howver, I can't explain why the same file
  with
   the same LF, CR and whitespace would read correctly from a different
  host.
  
   -Original Message-
   From: Iain Lopata [mailto:ilopa...@hotmail.com]
   Sent: Sunday, December 08, 2013 12:06 PM
   To: user@nutch.apache.org
   Subject: Unsuccessful fetch/parse of large page

Re: Unsuccessful fetch/parse of large page with many outlinks

2013-12-08 Thread Tejas Patil
 debug code that I have inserted in a custom filter shows that the file
that was retrieved is only 198 bytes long.
I am assuming that this code did not hinder the crawler. A better way to
see the content would be to take a segment dump [0] and then analyse it.
Also, turn on DEBUG mode of the log4j for the http protocol classes and
fetcher class.

 attempted to crawl it from that site and it works fine, retrieving all
597KB and parsing it successfully.
You mean that you ran a nutch crawl with the problematic url as a seed and
used the EXACT same config on both machines. One machine gave perfect
content and the other one was not. Note that using EXACT same config over
these 2 runs is important.

 the page has about 350 characters of LineFeeds CarriageRetruns and spaces
No way. The HTTP request gets a byte stream as response. Also, had it been
the case that LF or CR chars create problem, then it must hit nutch
irrespective of from which machine you run nutch...but thats not what your
experiments suggest.

[0] : http://wiki.apache.org/nutch/bin/nutch_readseg



On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata ilopa...@hotmail.com wrote:

 I do not know whether this would be a factor, but I have noticed that the
 page has about 350 characters of LineFeeds CarriageRetruns and spaces
 before
 the !DOCTYPE declaration.  Could this be causing a problem for
 http-protocol in some way?   Howver, I can't explain why the same file with
 the same LF, CR and whitespace would read correctly from a different host.

 -Original Message-
 From: Iain Lopata [mailto:ilopa...@hotmail.com]
 Sent: Sunday, December 08, 2013 12:06 PM
 To: user@nutch.apache.org
 Subject: Unsuccessful fetch/parse of large page with many outlinks

 I am running Nutch 1.6 on Ubuntu Server.



 I am experiencing a problem with one particular webpage.



 If I use parsechecker against the problem url the output shows (host name
 changed to example.com):



 

 fetching: http://www.example.com/index.cfm?pageID=12

 text/html

 parsing: http://www.example.com/index.cfm?pageID=12

 contentType: text/html

 signature: a9c640626fcad48caaf3ad5f94bea446

 -

 Url

 ---

 http://www.example.com/index.cfm?pageID=12

 -

 ParseData

 -

 Version: 5

 Status: success(1,0)

 Title:

 Outlinks: 0

 Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
 Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; charset=UTF-8
 Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0

 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

 



 However, this page has 3775 outlinks.



 If I run a  crawl with this page as a seed the log file shows that the file
 it fetched successfully, but debug code that I have inserted in a custom
 filter shows that the file that was retrieved is only 198 bytes long.  For
 some reason the file would seem to be truncated or otherwise corrupted.



 I can retrieve the file with wget and can see that the file is 597KB.



 I copied the file that I retrieved with wget to another web server and
 attempted to crawl it from that site and it works fine, retrieving all
 597KB
 and parsing it successfully.  This would suggest that my current
 configuration does not have a problem processing this large file.



 I have checked the robots.txt file on the original host and it allows
 retrieval of this web page.



 Other relevant configuration settings may be:



 property

 namehttp.content.limit/name

 value-1/value

 /property

 property

  namehttp.timeout/name

  value6/value

  description/description

 /property



 Any ideas on what to check next?







Nutch with YARN (aka Hadoop 2.0)

2013-12-08 Thread Tejas Patil
Has anyone tried out running Nutch over YARN ? If so, were there were any
performance gains with the same ?

Thanks,
Tejas


Re: Manipulating Nutch 2.2.1 scoring system

2013-12-07 Thread Tejas Patil
Hi Vangelis,

You can write your own implementation of scoring and make nutch to use it
via a plugin.
- Go through [0] to understand how to write a custom plugin
- For scoring, you class should implement the ScoringFilter interface [1].

[0] : http://wiki.apache.org/nutch/WritingPluginExample
[1] :
http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java

Thanks,
Tejas


On Sat, Dec 7, 2013 at 4:32 AM, Talat UYARER talat.uya...@agmlab.comwrote:

 Hi Olle,

 I dont know happened any working diagram for Nutch 1.x But If you can, we
 will be glad. :)

 Talat


 06-12-2013 14:00 tarihinde, Olle Romo yazdı:

 Hi Talat,

 I'm at an early stage of learning Nutch and your diagram is _very_
 helpful. Would you happen to have a diagram for 1.x too? Or is there not
 much difference at he architecture level?

 Best,
 Olle


 On Dec 5, 2013, at 9:07 PM, Talat UYARER talat.uya...@agmlab.com wrote:

  Hi Vangelis,

 I draw a Nutch Software Architecture diagram. Maybe it can be help you.

 https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/
 edit?usp=sharing

 Talat


 05-12-2013 19:09 tarihinde, Vangelis karv yazdı:

 It is clear that OPICScoringfilter does all the job in creating the
 scores for the urls. I just wanted to know if it is possible to implement
 another function for scoring and if there are any available on the
 Internet. Does anybody know where exactly in the code, Nutch calls that
 function(Generator,Fetcher,Parser)?








Re: Manipulating Nutch 2.2.1 scoring system

2013-12-07 Thread Tejas Patil
I think that the CrawlDatum of each url contains its score. You can get a
crawldb dump and see that.

Tejas


On Sat, Dec 7, 2013 at 1:37 PM, Ing. Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

 How can I send the linkrank or opic scoring into solr/hbase ?

 - Mensaje original -
 De: Tejas Patil tejas.patil...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Sábado, 7 de Diciembre 2013 12:44:16
 Asunto: Re: Manipulating Nutch 2.2.1 scoring system

 Hi Vangelis,

 You can write your own implementation of scoring and make nutch to use it
 via a plugin.
 - Go through [0] to understand how to write a custom plugin
 - For scoring, you class should implement the ScoringFilter interface [1].

 [0] : http://wiki.apache.org/nutch/WritingPluginExample
 [1] :

 http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java

 Thanks,
 Tejas


 On Sat, Dec 7, 2013 at 4:32 AM, Talat UYARER talat.uya...@agmlab.com
 wrote:

  Hi Olle,
 
  I dont know happened any working diagram for Nutch 1.x But If you can, we
  will be glad. :)
 
  Talat
 
 
  06-12-2013 14:00 tarihinde, Olle Romo yazdı:
 
  Hi Talat,
 
  I'm at an early stage of learning Nutch and your diagram is _very_
  helpful. Would you happen to have a diagram for 1.x too? Or is there not
  much difference at he architecture level?
 
  Best,
  Olle
 
 
  On Dec 5, 2013, at 9:07 PM, Talat UYARER talat.uya...@agmlab.com
 wrote:
 
   Hi Vangelis,
 
  I draw a Nutch Software Architecture diagram. Maybe it can be help you.
 
  https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/
  edit?usp=sharing
 
  Talat
 
 
  05-12-2013 19:09 tarihinde, Vangelis karv yazdı:
 
  It is clear that OPICScoringfilter does all the job in creating the
  scores for the urls. I just wanted to know if it is possible to
 implement
  another function for scoring and if there are any available on the
  Internet. Does anybody know where exactly in the code, Nutch calls
 that
  function(Generator,Fetcher,Parser)?
 
 
 
 
 
 


 
 III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
 del 2014. Ver www.uci.cu


 
 III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
 del 2014. Ver www.uci.cu



Re: Cannot run program /bin/ls: java.io.IOException: error=11, Resource temporarily unavailable

2013-12-07 Thread Tejas Patil
java.io.IOException: error=11, Resource temporarily unavailable can be
accounted to 2 reasons:

Too many processes are already running (
http://stackoverflow.com/questions/8384000/java-io-ioexception-error-11) OR
Too many files open (
http://stackoverflow.com/questions/15494749/java-lang-outofmemoryerror-unable-to-create-new-native-thread-for-big-data-set
)

Coming to your load/environment: Nutch / Hadoop won't leave behind stale
processes or open files which would pile up after running several rounds.
After every round, it is expected to clear up things. To verify this, have
a script to capture #live-processes and #open-file-handles periodically
while nutch is running.

Thanks,
Tejas

On Sat, Dec 7, 2013 at 2:19 PM, Martin Aesch martin.ae...@googlemail.comwrote:

 Dear Jon, dear nutchers,

 thanks - do you remember by chance more about the backgrounds of that
 problem? I am using nutch-1.7 and have currently the same issue while
 parsing.

 Nutch-1.7 in pseudo distributed mode, 32GB total, 768M per
 mapper/reducer task, 8G for hadoop, 2G for nutch. 6 mappers, max total
 in segment 10 URLs. According to the log, each URL takes 0-1 ms to
 be parsed.


 Suddenly, the 1min load of the machine goes up to 200 and higher, even
 (1/5/15-200-500-1200) I could see. But there is moderate CPU-usage, low
 IO-Wait and ~50 percent idle.

 Currently, I am running under same conditions, but only 10k URLs per
 segment. Up to now for 30 generate-fetch-parse-update-cycles no problem.

 I am already a veteran with ulimit problems and set values (ulimit -n:
 25, ulimit -u 32) very high. Now I am out of ideas.

 Any ideas, suggestions?

 Cheers,
 Martin

 -Original Message-
 From: Jon Uhal jonu...@gmail.com
 Reply-to: user@nutch.apache.org
 To: user@nutch.apache.org
 Subject: Cannot run program /bin/ls: java.io.IOException: error=11,
 Resource temporarily unavailable
 Date: Wed, 20 Nov 2013 16:47:33 -0500

 I just wanted to leave this here since it took me way too long to figure
 out. For some people, this might be an obvious problem, but since it wasn't
 to me, I want to make sure anyone else that gets this can have this answer.

 I kept getting the following error when I was running a crawl. For me, it
 was consistently happening, but I couldn't find any similar issues or
 solutions on the typical sites. The closest thing I could find was this:
 http://www.nosql.se/2011/10/hadoop-tasktracker-java-lang-outofmemoryerror/

 Below is the error I was seeing. This is just one of several exceptions
 that would happen during the parse but in the end, the parse step would
 have too many errors and fail the Nutch error limit.

 13/11/20 20:14:19 INFO parse.ParseSegment: ParseSegment: segment:
 test/segments/20131120201240
 13/11/20 20:14:20 INFO mapred.FileInputFormat: Total input paths to process
 : 2
 13/11/20 20:14:21 INFO mapred.JobClient: Running job: job_201311202006_0017
 13/11/20 20:14:22 INFO mapred.JobClient:  map 0% reduce 0%
 13/11/20 20:14:34 INFO mapred.JobClient:  map 40% reduce 0%
 13/11/20 20:14:36 INFO mapred.JobClient:  map 50% reduce 0%
 13/11/20 20:14:36 INFO mapred.JobClient: Task Id :
 attempt_201311202006_0017_m_01_0, Status : FAILED
 java.lang.RuntimeException: Error while running command to get file
 permissions : java.io.IOException: Cannot run program /bin/ls:
 java.io.IOException: error=11, Resource temporarily unavailable
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
 at org.apache.hadoop.util.Shell.run(Shell.java:182)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
 at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
 at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:712)
 at

 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:448)
 at

 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:431)
 at
 org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
 at

 org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 Caused by: java.io.IOException: java.io.IOException: error=11, Resource
 temporarily unavailable
 at java.lang.UNIXProcess.init(UNIXProcess.java:148)
 at java.lang.ProcessImpl.start(ProcessImpl.java:65)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
 ... 15 more

 at

Re: Possible Bug Nutch 1.7 Crawl Script

2013-09-05 Thread Tejas Patil
Thanks for reporting that. I think this point was brought up over [0] but
was left off. Could you try out this and tell if it works ?
SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`

[0] :
https://issues.apache.org/jira/browse/NUTCH-1087?focusedCommentId=13554353page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13554353



On Thu, Sep 5, 2013 at 5:18 AM, Scheffel, Aaron aaron.schef...@washpost.com
 wrote:

 The 1.7 crawl script has the following line 125

 SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e s/ /\\n/g | egrep 20[0-9]+
 | sort -n | tail -n 1`

 There is an ls -l  there which is causing the script to behave badly
 (not work) at least on OS X. A simple ls seems to fix it.

 -Aaron



Re: nutch data from HBase to Oracle

2013-09-05 Thread Tejas Patil
This question is orthogonal to nutch. You could write a code to read data
from HBase and then write to Oracle.


On Thu, Sep 5, 2013 at 10:51 AM, A Laxmi a.lakshmi...@gmail.com wrote:

 Since HBase was the only stable datastore option recommeded for nutch
 2.2.1, I wanted to know if it is possible to migrate/move crawled data
 stored in HBase to Oracle. The reason I am interested to have data
 available in Oracle is because I want to utilize some of the features in
 Oracle db such as triggers what HBase cannot offer.

 Please let me know if it is possible to move crawled data from HBase to
 Oracle?



Re: 回复: Aborting with 10 hung threads?

2013-08-30 Thread Tejas Patil
Could you check if urls could make it to the crawldb through the inject
operation ? They can be filtered due to regex urlfilter.
Run your urls against:

bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter



On Fri, Aug 30, 2013 at 1:42 AM, Jonathan.Wei 252637...@qq.com wrote:

 I run bin/nutch readdb -stats.
 return this message for me:


 hadoop@nutch1:/data/projects/clusters/apache-nutch-2.2/runtime/local$
 bin/nutch readdb -stats
 WebTable statistics start
 Statistics for WebTable:
 jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
 counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
 Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
 REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
 COMMITTED_HEAP_BYTES=449839104, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1146,
 COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0,
 COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
 VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0},
 FileSystemCounters={FILE_BYTES_READ=914496, FILE_BYTES_WRITTEN=1036378},
 File Output Format Counters ={BYTES_WRITTEN=98
 TOTAL urls: 0
 WebTable statistics: done
 jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
 counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
 Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
 REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
 COMMITTED_HEAP_BYTES=449839104, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1146,
 COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0,
 COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
 VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0},
 FileSystemCounters={FILE_BYTES_READ=914496, FILE_BYTES_WRITTEN=1036378},
 File Output Format Counters ={BYTES_WRITTEN=98
 TOTAL urls: 0



 Why is 0 urls?
 The urls file has 216 urls!




 -- 原始邮件 --
 发件人: kaveh minooie [via Lucene]ml-node+s472066n408744...@n3.nabble.com
 ;
 发送时间: 2013年8月30日(星期五) 下午4:30
 收件人: 基勇252637...@qq.com;

 主题: Re: Aborting with 10 hung threads?



 so fetch does hang when there is nothing for it to fetch. the most
 likely thing that has happened here is that your inject command did not
 go through successfully. you can check it by looking in to your hbase
 and see if the webpage table has been created and has values (your urls
 that you injected) in it. alliteratively you can just run 'nutch readdb
 -stats' and see what you get. if there was nothing there double check
 your config files.

 On 08/30/2013 12:12 AM, Jonathan.Wei wrote:
  And I run bin/nutch inject urls and bin/nutch generate -topN250.
  I checked the hbase,not any data!
 
  Where is the problem?
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087433p4087438.html
  Sent from the Nutch - User mailing list archive at Nabble.com.




 If you reply to this email, your message will be
 added to the discussion below:

 http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087433p4087447.html
 To unsubscribe from Aborting with
 10 hung threads?, click here.
 NAML



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087450.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How nutch2.2 to parse rss?

2013-08-29 Thread Tejas Patil
AFAIK, the RSS plugin in 2.x ain't migrated.. i mean its code copied from
1.x trunk and would need modifications to get things working with 2.x.
Thats why it was disabled in the build file.



On Thu, Aug 29, 2013 at 6:34 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Jonathan,
 This has been a long outstanding issue IIRC.
 I have not used Nutch for feed crawling for a while if I am honest, and I
 honestly can't recall when and if I have done it with 2.x.
 You will see [0], that by default the plugin is not actually initialized.
 So for starters you should uncomment the various targets within this file
 [0] to get it working and to have it cleaned up etc.
 You can then try building... but I have a feeling that it will not build.
 Please check on our Jira for issues related to this... there may be patches
 but I am not sure.
 Kiran did some work a while back IIRC concerning getting following plugins
 to compile and run

  ant dir=feed target=deploy/
  ant dir=parse-ext target=deploy/
  ant dir=parse-swf target=deploy/
  ant dir=parse-zip target=deploy/

 But there is more work to be done.
 Please keep us updated on this on. Sorry for late reply.

 [0]
 http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/build.xml


 On Thu, Aug 29, 2013 at 1:29 AM, Jonathan.Wei 252637...@qq.com wrote:

  Hello!Every body!
   I want to use nutch2.2 to parse RSS !
   But nutch2.x different with nutch1.x!So I down know how to parse
  rss!Can you help me?
 
 
  Use crawl command grab 24 URL, but the results suggestAborting with 10
  hung
  threads.
  log content is :
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 466 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 461 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 455 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 450 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 444 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 439 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 434 0 kb/s,
 13
  URLs in 1 queues
  Aborting with 10 hung threads.
 
  What causes this?
 
  How I can fix it?
 
  Thank you!
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087168.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 



 --
 *Lewis*



Re: How nutch2.2 to parse rss?

2013-08-29 Thread Tejas Patil
The 1.x RSS plugin works post this jira (
https://issues.apache.org/jira/browse/NUTCH-1494). There is open jira (
https://issues.apache.org/jira/browse/NUTCH-1515) for its 2.x counterpart



On Thu, Aug 29, 2013 at 8:10 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 yeah there is work to be done here for sure. there must be an issue open
 for this?

 On Thursday, August 29, 2013, Jonathan.Wei 252637...@qq.com wrote:
  Thank's!
  I try it!
  But I have a felling that it will not build too!
  Because some class file not find in nutch2.2!
  example :
  ParseData
  ParseResult!
 
 
 
 
  Thank you!
 
 
 
 
  -- 原始邮件 --
  发件人: lewis john mcgibbney [via Lucene]
 ml-node+s472066n4087394...@n3.nabble.com;
  发送时间: 2013年8月30日(星期五) 上午9:34
  收件人: 基勇252637...@qq.com;
 
  主题: Re: How nutch2.2 to parse rss?
 
 
 
  Hi Jonathan,
  This has been a long outstanding issue IIRC.
  I have not used Nutch for feed crawling for a while if I am honest, and I
  honestly can't recall when and if I have done it with 2.x.
  You will see [0], that by default the plugin is not actually initialized.
  So for starters you should uncomment the various targets within this file
  [0] to get it working and to have it cleaned up etc.
  You can then try building... but I have a feeling that it will not build.
  Please check on our Jira for issues related to this... there may be
 patches
  but I am not sure.
  Kiran did some work a while back IIRC concerning getting following
 plugins
  to compile and run
 
   ant dir=feed target=deploy/
   ant dir=parse-ext target=deploy/
   ant dir=parse-swf target=deploy/
   ant dir=parse-zip target=deploy/
 
  But there is more work to be done.
  Please keep us updated on this on. Sorry for late reply.
 
  [0]
 http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/build.xml
 
 
  On Thu, Aug 29, 2013 at 1:29 AM, Jonathan.Wei [hidden email] wrote:
 
  Hello!Every body!
   I want to use nutch2.2 to parse RSS !
   But nutch2.x different with nutch1.x!So I down know how to parse
  rss!Can you help me?
 
 
  Use crawl command grab 24 URL, but the results suggestAborting with 10
  hung
  threads.
  log content is :
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 466 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 461 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 455 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 450 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 444 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 439 0 kb/s,
 13
  URLs in 1 queues
  0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 434 0 kb/s,
 13
  URLs in 1 queues
  Aborting with 10 hung threads.
 
  What causes this?
 
  How I can fix it?
 
  Thank you!
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087168.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 
  --
  *Lewis*
 
 
 
  If you reply to this email, your message will be
 added to the discussion below:
 

 http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-and-Aborting-with-10-hung-threads-question-tp4087168p4087394.html
  To unsubscribe from How nutch2.2
 to parse rss? and Aborting with 10 hung threads question, click here.
  NAML
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087399.html
  Sent from the Nutch - User mailing list archive at Nabble.com.

 --
 *Lewis*



Re: Issues Running Nutch 1.7 in Eclipse-- Please Help

2013-08-19 Thread Tejas Patil
The logs say:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

Please get the segment dump and analyse it for the outlinks extracted. Also
check your filters.


On Sun, Aug 18, 2013 at 8:02 PM, S.L simpleliving...@gmail.com wrote:

 Hello All,

 I am running Nutch 1.7 in eclipse and I start out with the Crawl job with
 the following settings.

 Main Class :org.apache.nutch.crawl.Crawl
 Arguments : urls -dir crawl -depth 10 -topN 10

 In the urls directory I have only one URL http://www.ebay.com and I
 expect the whole website to be crawled , however I get the following log
 output and the crawl seems to stop after a few urls are fetched.

 I use the nutch-default.xml and have already set http.content.limit to -1
 in it as mentioned in the other message in this mailing list. However  the
 crawl stops after a few URLs are fetched, please see the log below and
 advise.

 I am running eclipse on CentOS 6.4/Nutch 1.7

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in

 [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in

 [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl
 rootUrlDir = urls
 threads = 10
 depth = 10
 solrUrl=null
 topN = 10
 Injector: starting at 2013-08-18 22:48:45
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: total number of urls rejected by filters: 1
 Injector: total number of urls injected after normalization and filtering:
 1
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02
 Generator: starting at 2013-08-18 22:48:47
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 10
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/segments/20130818224849
 Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03
 Fetcher: starting at 2013-08-18 22:48:51
 Fetcher: segment: crawl/segments/20130818224849
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Fetcher: throughput threshold: -1
 Fetcher: throughput threshold retries: 5
 fetching http://www.ebay.com/ (queue crawl delay=5000ms)
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=1
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05
 ParseSegment: starting at 2013-08-18 22:48:56
 ParseSegment: segment: crawl/segments/20130818224849
 Parsed (15ms):http://www.ebay.com/
 ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01
 CrawlDb update: starting at 2013-08-18 22:48:57
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/segments/20130818224849]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01
 Generator: starting at 2013-08-18 22:48:58
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 10
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2013-08-18 22:48:59
 LinkDb: linkdb: crawl/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: internal links will be ignored.
 LinkDb: adding segment:
 

Re: Issues Running Nutch 1.7 in Eclipse-- Please Help

2013-08-19 Thread Tejas Patil
As I said earlier, take a dump of the segment. Use the dump option of the
command here (http://wiki.apache.org/nutch/bin/nutch_readseg). Once thats
done, in the output file, look for the entry for the parent link (
http://www.ebay.com/) and check its status, what was crawled, what was
parsed, were any outlinks extracted.


On Mon, Aug 19, 2013 at 2:22 PM, S.L simpleliving...@gmail.com wrote:

 How can I analyze the segment dump from Nutch , its has number of folders
 it seems , can you please let me know which specific folder in the segments
 folder do I need to look into , also the index and the data files are not
 exactly text files to make any sense out of them .

 I am using the default regex-filter that comes with nutch 1.7 , I have not
 changed that.

 Thank You.


 On Mon, Aug 19, 2013 at 4:07 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  The logs say:
  Generator: 0 records selected for fetching, exiting ...
  Stopping at depth=1 - no more URLs to fetch.
 
  Please get the segment dump and analyse it for the outlinks extracted.
 Also
  check your filters.
 
 
  On Sun, Aug 18, 2013 at 8:02 PM, S.L simpleliving...@gmail.com wrote:
 
   Hello All,
  
   I am running Nutch 1.7 in eclipse and I start out with the Crawl job
 with
   the following settings.
  
   Main Class :org.apache.nutch.crawl.Crawl
   Arguments : urls -dir crawl -depth 10 -topN 10
  
   In the urls directory I have only one URL http://www.ebay.com and I
   expect the whole website to be crawled , however I get the following
 log
   output and the crawl seems to stop after a few urls are fetched.
  
   I use the nutch-default.xml and have already set http.content.limit to
 -1
   in it as mentioned in the other message in this mailing list. However
   the
   crawl stops after a few URLs are fetched, please see the log below and
   advise.
  
   I am running eclipse on CentOS 6.4/Nutch 1.7
  
   SLF4J: Class path contains multiple SLF4J bindings.
   SLF4J: Found binding in
  
  
 
 [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in
  
  
 
 [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
   explanation.
   solrUrl is not set, indexing will be skipped...
   crawl started in: crawl
   rootUrlDir = urls
   threads = 10
   depth = 10
   solrUrl=null
   topN = 10
   Injector: starting at 2013-08-18 22:48:45
   Injector: crawlDb: crawl/crawldb
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: total number of urls rejected by filters: 1
   Injector: total number of urls injected after normalization and
  filtering:
   1
   Injector: Merging injected urls into crawl db.
   Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02
   Generator: starting at 2013-08-18 22:48:47
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 10
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl/segments/20130818224849
   Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03
   Fetcher: starting at 2013-08-18 22:48:51
   Fetcher: segment: crawl/segments/20130818224849
   Using queue mode : byHost
   Fetcher: threads: 10
   Fetcher: time-out divisor: 2
   QueueFeeder finished: total 1 records + hit by time limit :0
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Fetcher: throughput threshold: -1
   Fetcher: throughput threshold retries: 5
   fetching http://www.ebay.com/ (queue crawl delay=5000ms)
   -finishing thread FetcherThread, activeThreads=8
   -finishing thread FetcherThread, activeThreads=8
   -finishing thread FetcherThread, activeThreads=2
   -finishing thread FetcherThread, activeThreads=3
   -finishing thread FetcherThread, activeThreads=4
   -finishing thread FetcherThread, activeThreads=5
   -finishing thread FetcherThread, activeThreads=6
   -finishing thread FetcherThread, activeThreads=7
   -finishing thread FetcherThread, activeThreads=1
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=0
   Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05
   ParseSegment: starting at 2013-08-18 22:48:56
   ParseSegment: segment: crawl/segments

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

2013-08-06 Thread Tejas Patil
Hi Lewis,
Can you try the patch attached over here:
https://issues.apache.org/jira/browse/NUTCH-1483

Thanks,
Tejas


On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi,
 Now using Nutch trunk 1.8-SNAPSHOT HEAD
 Back at this tonight. When attempting to fetch

 file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)

 which contains loads of HTML files, I get the error as below.


 Fetcher: throughput threshold retries: 5
 -finishing thread FetcherThread, activeThreads=1
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
 fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
 org.apache.nutch.protocol.file.FileError: File Error: 404
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02

 I then deleted the crawldb changed the seed URL to

 file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)

 But when I eventually get fetching after a few rounds of generate, fetch,
 parse, updatedb, I am landed with

 fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
 (queue crawl delay=500ms)
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
 fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
 failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
 fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
 (queue crawl delay=500ms)
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
 fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
 failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

 Same as before... this happens with every single URL in the directory I am
 trying to crawl.

 Any advice here please?
 Thanks
 Lewis

 --
 *Lewis*



Re: Way to fetch only new sites

2013-08-01 Thread Tejas Patil
Nutch 2.1 officially had support for MySQL as datastore. There were lot of
issues reported with MySQL and so in the newer version ie. 2.2.X, the MySQL
support is removed. I would recommend using HBase as its the most stable
backend amongst all supported ones.


On Thu, Aug 1, 2013 at 7:01 AM, Jayadeep Reddy
jayad...@ehealthaccess.comwrote:

 Thank you Julien,
 Will get hbase and try to crawl.


 On Thu, Aug 1, 2013 at 7:10 PM, A Laxmi a.lakshmi...@gmail.com wrote:

  Julien - whatever you are saying about Nutch 2.x and SQL - does it apply
  for the recent release 2.2.1 as well?
 
 
  On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com
   wrote:
 
   If you are using Nutch 2.x then you are actually accessing the SQL
  storage
   via Apache GORA. The SQL backend in GORA does not work and it is not
   advised to use it. If you want to use Nutch 2 then use a different
  backend
   like HBase or Cassandra or use Nutch 1.x
  
   On 1 August 2013 14:32, Jayadeep Reddy jayad...@ehealthaccess.com
  wrote:
  
No Julien Using Mysql
   
   
On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche 
lists.digitalpeb...@gmail.com
 wrote:
   
 What GORA backend are you using?


 On 1 August 2013 14:03, Jayadeep Reddy jayad...@ehealthaccess.com
 
wrote:

  I am using Nutch 2.1 every time I run crawl from dmoz directory
 my
 existing
  crawled pages in the database are fetched again(Taking long
 time/).
   Is
  there a way to crawl only new sites.
 
  Thank you
 
  --
  Jayadeep Reddy.S,
  M.D  C.E.O
  e Health Access Pvt.Ltd
  www.ehealthaccess.com
  Hyderabad-Chennai-Banglore
  http://www.youtube.com/watch?v=0k5LX8mw6Sk
 



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble

   
   
   
--
Jayadeep Reddy.S,
M.D  C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk
   
  
  
  
   --
   *
   *Open Source Solutions for Text Engineering
  
   http://digitalpebble.blogspot.com/
   http://www.digitalpebble.com
   http://twitter.com/digitalpebble
  
 



 --
 Jayadeep Reddy.S,
 M.D  C.E.O
 e Health Access Pvt.Ltd
 www.ehealthaccess.com
 Hyderabad-Chennai-Banglore
 http://www.youtube.com/watch?v=0k5LX8mw6Sk



Re: Duplicate Fetches for Fetch Job

2013-07-25 Thread Tejas Patil
1.x has speculative execution turned off:
Fetcher.java:1328:job.setSpeculativeExecution(false);

but 2.x doesn't. It makes sense to do that. I don't see any good reason to
not have it in 2.x. Could you open a jira for this and upload a patch ?


On Wed, Jul 24, 2013 at 11:40 PM, Talat UYARER talat.uya...@agmlab.comwrote:

 Hi,

 We are using nutch for high volume crawls. We noticed that FetcherJob
 ReduceTask fetches some websites multiple times for long lasting queues. I
 have discovered the reason of this is 
 mapred.reduce.tasks.**speculative.execution
 settings in hadoop. This comes true as default. I suggest this value should
 be false for FetcherJob. What do you think?

 Talat



Re: Nutch 2.2.1 - scripts crawl and nutch

2013-07-12 Thread Tejas Patil
bin/nutch : allows to run individual commands separately.
bin/crawl : contains calls to the bin/nutch script and invokes nutch
commands required for a typical nutch crawl cycle. This makes life easy for
users as you need not know the internal phases (and thus commands) of nutch
and yet run a crawl.


On Fri, Jul 12, 2013 at 8:09 AM, A Laxmi a.lakshmi...@gmail.com wrote:

 Hello,

 I have installed Nutch 2.2.1 without any issues. However, I could find two
 scripts crawl and nutch instead of one script - nutch  like in
 earlier releases.

 Could anyone tell me why we have two scripts? what is the advantage of
 using one over the other?

 Thanks for your help!



Re: Nutch scalability tests

2013-07-03 Thread Tejas Patil
 The second run, still shows 1 reduce running, although it shows as 100%
complete, so my thought is it is writing out to the disk, though it has
been about 30+ minutes.
 This one reducers log on the jobtracker however, is empty.

This is weird. There can be a explanation for first line: The data crawled
was large so dumping would take a lot of time but as you said there were
very less urls so it should not take 30+ mins unless you crawled some super
large files.
Have you checked the job attempts for the job ? If there are no logs there
then there is something weird going on with your cluster.


On Wed, Jul 3, 2013 at 8:32 AM, h b hb6...@gmail.com wrote:

 oh and yes, generate.max.count is set to 5000


 On Wed, Jul 3, 2013 at 8:29 AM, h b hb6...@gmail.com wrote:

  I dropped my webpage database, restarted with 5 seed urls. First fetch
  completed in a few seconds. The second run, still shows 1 reduce running,
  although it shows as 100% complete, so my thought is it is writing out to
  the disk, though it has been about 30+ minutes.
  Again, I had 80 reducers, when I look at the log of these reducers in the
  hadoop jobtracker, I see
 
  0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
 URLs in 0 queues
 
  in all of them, which leads me to think that the completed 79 reducers
 actually fetched nothing, which might explain why this 1 stuck reducer is
 working so hard.
 
  This may be expected, since I am crawling a single domain. This one
 reducers log on the jobtracker however, is empty. Don't know what to make
 of that.
 
 
 
 
 
  On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
  Hi,
 
  On Tue, Jul 2, 2013 at 3:53 PM, h b hb6...@gmail.com wrote:
 
   So, I tried this with the generate.max.count property set to 5000,
  rebuild
   ant; ant jar; ant job and reran fetch.
   It still appears the same, first 79 reducers zip through and the last
  one
   is crawling, literally...
  
 
  Sorry I should have been more explicit. This property does not directly
  affect fetching. It is used when GENERATING fetch lists. Meaning that it
  needs to be present and acknowledged at the generate phase... before
  fetching is executed.
  Besides this, is there any progress being made at all on the last
 reduce?
  if you look at your CPU (and heap) for the box this is running on, it is
  usual to notice high levels for both of these respectively. Maybe this
  output writer is just taking a good while to write data down to HDFS...
  assuming you are using 1.x.
 
 
  
   As for the logs, I mentioned on one of my earlier threads that when I
  run
   from the deploy directory, I am not getting any logs generated.
   I looked for the logs directory under local as well as under deploy,
 and
   just to make sure, also in the grid. I do not see the logs directory.
  So I
   created it manually under deploy before starting fetch, and still
 there
  is
   nothing in this directory,
  
  
  OK so when you run Nutch as a deployed job in your logs are present
 within
  $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g.
  you will be able to see the reduce tasks for the fetch job and you will
  also be able to see varying snippets or all of the log here.
 
 
 



Re: Nutch scalability tests

2013-07-03 Thread Tejas Patil
The steps you performed are right.

Did you get the log for that one hardworking reducer ? It will hint us
why the job took so much. Ideally you should get logs for every job and its
attempts. If you cannot get the log for that reducer, then I feel that your
cluster is having some problem and this needs to be addressed.


On Wed, Jul 3, 2013 at 8:47 AM, h b hb6...@gmail.com wrote:

 Hi Tejas, looks like we were tying at the same time
 So anyway, my job ended fine, just to be sure what I am doing is right, I
 have cleared the db and started another round again. If I stumble again,
 will respond back on this thread.


 On Wed, Jul 3, 2013 at 8:43 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:

   The second run, still shows 1 reduce running, although it shows as 100%
  complete, so my thought is it is writing out to the disk, though it has
  been about 30+ minutes.
   This one reducers log on the jobtracker however, is empty.
 
  This is weird. There can be a explanation for first line: The data
 crawled
  was large so dumping would take a lot of time but as you said there were
  very less urls so it should not take 30+ mins unless you crawled some
 super
  large files.
  Have you checked the job attempts for the job ? If there are no logs
 there
  then there is something weird going on with your cluster.
 
 
  On Wed, Jul 3, 2013 at 8:32 AM, h b hb6...@gmail.com wrote:
 
   oh and yes, generate.max.count is set to 5000
  
  
   On Wed, Jul 3, 2013 at 8:29 AM, h b hb6...@gmail.com wrote:
  
I dropped my webpage database, restarted with 5 seed urls. First
 fetch
completed in a few seconds. The second run, still shows 1 reduce
  running,
although it shows as 100% complete, so my thought is it is writing
 out
  to
the disk, though it has been about 30+ minutes.
Again, I had 80 reducers, when I look at the log of these reducers in
  the
hadoop jobtracker, I see
   
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
   URLs in 0 queues
   
in all of them, which leads me to think that the completed 79
 reducers
   actually fetched nothing, which might explain why this 1 stuck reducer
 is
   working so hard.
   
This may be expected, since I am crawling a single domain. This one
   reducers log on the jobtracker however, is empty. Don't know what to
 make
   of that.
   
   
   
   
   
On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:
   
Hi,
   
On Tue, Jul 2, 2013 at 3:53 PM, h b hb6...@gmail.com wrote:
   
 So, I tried this with the generate.max.count property set to 5000,
rebuild
 ant; ant jar; ant job and reran fetch.
 It still appears the same, first 79 reducers zip through and the
  last
one
 is crawling, literally...

   
Sorry I should have been more explicit. This property does not
  directly
affect fetching. It is used when GENERATING fetch lists. Meaning
 that
  it
needs to be present and acknowledged at the generate phase... before
fetching is executed.
Besides this, is there any progress being made at all on the last
   reduce?
if you look at your CPU (and heap) for the box this is running on,
 it
  is
usual to notice high levels for both of these respectively. Maybe
 this
output writer is just taking a good while to write data down to
  HDFS...
assuming you are using 1.x.
   
   

 As for the logs, I mentioned on one of my earlier threads that
 when
  I
run
 from the deploy directory, I am not getting any logs generated.
 I looked for the logs directory under local as well as under
 deploy,
   and
 just to make sure, also in the grid. I do not see the logs
  directory.
So I
 created it manually under deploy before starting fetch, and still
   there
is
 nothing in this directory,


OK so when you run Nutch as a deployed job in your logs are present
   within
$HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp
  e.g.
you will be able to see the reduce tasks for the fetch job and you
  will
also be able to see varying snippets or all of the log here.
   
   
   
  
 



Re: Integration of Apache-nutch and eclipse.

2013-07-03 Thread Tejas Patil
Have to looked at http://wiki.apache.org/nutch/RunNutchInEclipse ?
This is recently been updated and worked for several people over the
user-group. It has some cool screen shots which would make your life easy
setting up Nutch with eclipse.


On Wed, Jul 3, 2013 at 12:39 AM, Ramakrishna ramakrishna...@dioxe.comwrote:

 Guys.. I'm extremely sorry for posting/asking same doubt again.. After
 reading many documents also i dint get how to integrate nutch and eclipse.
 I've apache-nutch-2.2 and eclipse-juno versions are there. Plz tel me step
 by step, how to integrate eclipse and nutch with documentation. If possible
 plz send me screenshots of every steps from the beginning File-new-java
 project.. Also i'm working on windows.. My colleague did simple project
 without using cygwin/svn.. so plz don't tel again to use these
 software's/tools and don't send any links again.. already fed-up with those
 things.

 Thanks in advance.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Integration-of-Apache-nutch-and-eclipse-tp4075003.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: a plugin extending IndexWriter

2013-07-01 Thread Tejas Patil
From the info you gave, its hard to tell. Can you look at
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
and compare. Looking at a indexing plugin which works will help you figure
if you missed something.

Also, had you added the entry for your plugin into nutch-site.xml -
plugin.includes property before running ?
The default value is:

property
 nameplugin.includes/name
 
valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  /description
/property


On Mon, Jul 1, 2013 at 6:59 AM, Sznajder ForMailingList 
bs4mailingl...@gmail.com wrote:

  I wrote a plugin implementing IndexWriter

 It compiles and creates the jar as requested.

 However, Nutch does not succeed to find my IndexWriter in its constructor:

 this.indexWriters = (IndexWriter[]) objectCache
 .getObject(IndexWriter.class.getName());

 After this call the indexWriters field is null...





 Where should I define that ?

 Best regards
 Benjamin



Re: Updating the documentation for crawl via 2.x

2013-06-30 Thread Tejas Patil
I think that the wiki page was made with an intention that users knew about
1.x and would now be switching to 2.x. So it had only the gora and
datastore setup steps. I agree with you that it should contain complete set
of steps.

*@dev:* Unless there is any objection or better suggestion, I would get
this done in coming days.

On Sun, Jun 30, 2013 at 4:14 AM, Sznajder ForMailingList 
bs4mailingl...@gmail.com wrote:

 Hi

 I think we may update the documentation of crawl instructions

 Currently, the instructions stop at the inject step.

 And we are supposed to follow the instructions in Nutch 1.x

 However in these instructions, the syntax is quite different
 For example:

 bin/nutch generate

 does not expect crawldb and segents path etc...

 I think an update would be very useful.

 Benjamin



Re: Crawl in Nutch2.2

2013-06-30 Thread Tejas Patil
I think that you are hitting something that one the users faced few of days
back. Can you try the things mentioned here:

http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%3CCAFKhtFwPozH3dokk%2B_bZKqVT81h86aCpQzbL4rR4U3wZ-%2BOmHg%40mail.gmail.com%3E


On Sun, Jun 30, 2013 at 5:10 AM, Sznajder ForMailingList 
bs4mailingl...@gmail.com wrote:

 Thanks a lot for your help

 however, I still did not resovle this issue...


 I attach there the logs after 2 rounds of
 generate/fetch/parse/updatedb

 the DB still contains only the seed url , not more...




 On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Try each step with a crawlId and see if this provides you with better
 results.

 Unless you truncated all data between Nutch tasks then you should be
 seeing
 more data in HBase.
 As Tejas asked... what do the logs say?


 On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList 
 bs4mailingl...@gmail.com wrote:

  Hi Lewis,
 
  Thanks for your reply
 
  I just set the values:
 
   gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
 
 
  I already removed the Hbase table in the past. Can it be a cause?
 
  Benjamin
 
 
 
 
  On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
   Have you changed from the default MemStore gora storage to something
  else?
  
   On Tuesday, June 25, 2013, Sznajder ForMailingList 
   bs4mailingl...@gmail.com
   wrote:
thanks Tejas
   
Yes, I cheecked the logs and  no Error appears in them
   
I let the http.content.limit and parser.html.impl with their default
value...
   
Benajmin
   
   
On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil 
 tejas.patil...@gmail.com
   wrote:
   
Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any
 exception
  or
error messages ?
Also you might have a look at these configs in nutch-site.xml
 (default
values are in nutch-default.xml):
http.content.limit and parser.html.impl
   
   
On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList 
bs4mailingl...@gmail.com wrote:
   
 Hello

 I installed Nutch 2.2 on my linux machine.

 I defined the seed directory with one file containing:
 http://en.wikipedia.org/
 http://edition.cnn.com/


 I ran the following:
 sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/

 After this step:
 the call
 -bash-4.1$ sh bin/nutch readdb -stats

 returns
 TOTAL urls: 2
 status 0 (null):2
 avg score:  1.0


 Then, I ran the following:
 bin/nutch generate -topN 10
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch generate -topN 1000
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb


 However, the stats call after these steps is still:
 the call
 -bash-4.1$ sh bin/nutch readdb -stats
 status 5 (status_redir_perm):   1
 max score:  2.0
 TOTAL urls: 3
 avg score:  1.334



 Only 3 urls?!
 What do I miss?

 thanks

 Benjamin

   
   
  
   --
   *Lewis*
  
 



 --
 *Lewis*





Re: nutch2.x in cluster mode ?

2013-06-30 Thread Tejas Patil
I have never used 2.x on prod but this is what I would do:
The datastore backend needs to be setup on the cluster. Even Hadoop must be
installed. Export all relevant environment variables. Nutch 2.x source must
be downloaded to the master node. Then modify the required configs and run
ant runtime to create nutch binaries inside NUTCH_HOME/runtime/deploy.
Trigger the crawl command from the master node.


On Sun, Jun 30, 2013 at 1:16 AM, Tony Mullins tonymullins...@gmail.comwrote:

 Tejas , that [0] is for nutch 1.x which uses hdfs for its data storage. And
 as new nutch 2.x uses hbase (backend) which is already  based on hadoop
 (hdfs). If I deploy my hbase in cluster mode on 3 different nodes then do I
 still need to deploy nutch 2.x on these 3 nodes as well ?

 Could you please care to add some little more information for nutch2.x +
 hbase + hadoop ?

 Regards,
 Khan

 [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial


 On Sat, Jun 29, 2013 at 10:50 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Sat, Jun 29, 2013 at 10:36 AM, imran khan imrankhan.x...@gmail.com
  wrote:
 
   Greetings,
  
   Is there any guide for setting up nutch2.x in cluster mode ?
  
 
  [0] is a relevant wiki page .. which has not been updated since a long
  time.
  I am guessing that you have already tried running in local mode as given
 in
  [1]. For cluster mode, have hadoop 1.2.0 setup and its variables
 exported,
  set nutch configs as per your requirements, run 'ant' and then run nutch
  commands from $NUTCH_HOME/runtime/deploy
 
 
   And which versions of hadopp nutch2.x/hbase works well in cluster mode
 ?
  
 
  Use Nutch 2.2 and HBase 0.90.x
 
  
   Regards,
   Khan
  
 
  [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial
  [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
 



Re: Questions/issues with nutch

2013-06-30 Thread Tejas Patil
I am curious to know why do needed the raw html content instead of parsed
stuff. Search engines are meant to index parsed text. The data to be stored
and indexed reduces after parsing.


On Sat, Jun 29, 2013 at 9:20 PM, h b hb6...@gmail.com wrote:

 Thanks Tejas,
 I have just 2 urls in my seed file, and the second run of fetch ran for a
 few hours. I will verify if I got what I wanted.

 Regarding the raw html, its a ugly hack, so I did not really create a
 patch. But this is what I did


 In
 src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
 getParse method,

   //text = sb.toString();
   text = new String(page.getContent().array());

 Would be nice to make this as a configuration in the plugin xml.

 Other thing I will try soon is to extract the content only for a specific
 depth.



 On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  Yes. Nutch would parse the HTML and extract the content out of it.
 Tweaking
  around the code surrounding the parser would have made that happen. If
 you
  did something else, would you mind sharing it ?
 
  The depth is used by the Crawl class in 1.x which is deprecated in 2.x.
  Use bin/crawl instead.
  While running the bin/crawl script, the numberOfRounds option is
  nothing but the depth till which you want the crawling to be performed.
 
  If you want to use the individual commands instead, run generate - fetch
  - parse - update multiple times. The crawl script internally does the
  same thing.
  eg. If you want to fetch till depth 3, this is how you could do:
  inject - (generate - fetch - parse - update)
- (generate - fetch - parse - update)
- (generate - fetch - parse - update)
 - solrindex
 
  On Fri, Jun 28, 2013 at 7:24 PM, h b hb6...@gmail.com wrote:
 
   Ok, I tweaked the code a bit to extract the html as is from the parser,
  to
   realize that it is too much of a text and too much depth of crawling.
 So
  I
   am looking to see if I can somehow limit the depth. Nutch 1.x docs
  mention
   about the -depth parameter. However, I do not see this in the
   nutch-default.xml under Nutch 2.x. The -topN is used for number of
 links
   per depth. So for Nutch 2.x where/how do I set the depth?
  
  
   On Fri, Jun 28, 2013 at 11:32 AM, h b hb6...@gmail.com wrote:
  
Ok, SO i also got this work with Solr 4 no errors, I think the key
 was
   not
using a crawl id.
I had to comment the updatelog in solrconfig.xml because I got some
_version_ related error.
   
My next questions is, my solr document, or for that matter even the
  hbase
value of the html content is 'not html'. It appears that nutch is
extracting out text only. How do I retain the html content as is.
   
   
   
   
   
   
On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil 
  tejas.patil...@gmail.com
   wrote:
   
Kewl !!
   
I wonder why org.apache.solr.common.SolrException: undefined field
   text
happens.. Anybody who can throw light on this ?
   
   
On Fri, Jun 28, 2013 at 10:45 AM, h b hb6...@gmail.com wrote:
   
 Thanks Tejas
 I tried these steps, One step I added, was updatedb

 *bin/nutch updatedb*

 Just to be consistent with the doc, and your suggestion on some
  other
 thread, I used solr 3.6 instead of 4.x
 I copied the schema.xml from nutch/conf (rootlevel) and started
  solr.
   It
 failed with

 SEVERE: org.apache.solr.common.SolrException: undefined field text


 One of the google thread, suggested I ignore this error, so I
  ignored
and
 indexed anyway

 So now I got it to work. Playing some more with the queries




 On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil 
   tejas.patil...@gmail.com
 wrote:

  The storage.schema.webpage seems messed up but I don't have
  ample
time
  now to look into it. Here is what I would suggest to get things
working:
  *
  *
  *[1] Remove all the old data from HBase*
 
  (I assume that HBase is running while you do this)
  *cd $HBASE_HOME*
  *./bin/hbase shell
  *
  In the HBase shell, use list to see all the tables, delete all
  of
those
  related to Nutch (ones named as *webpage).
  Remove them using disable and drop commands.
 
  eg. if one of the tables is webpage, you would run this:
  *disable 'webpage'
  *
  *drop 'webpage'*
  * *
 
  *[2] Run crawl*
  I assume that you have not changed storage.schema.webpage is
  nutch-site.xml and nutch-default.xml. If yes, revert it to:
 
  *property*
  *  namestorage.schema.webpage/**name*
  *  valuewebpage/value*
  *  descriptionThis value holds the schema name used for Nutch
  web
db.*
  *  Note that Nutch ignores the value in the gora mapping files,
  and
uses*
  *  this as the webpage schema name.*
  *  /description

Re: How to use Nutch 2.2.1 with Solr

2013-06-30 Thread Tejas Patil
On Sun, Jun 30, 2013 at 5:57 AM, Hung Nguyen Dang 
nguyendanghung2...@gmail.com wrote:

 Hello,

 I'm new to use nutch, but I found that, the document is confuse with me,
 Could you please help me show a basic document step by step to config and
 run nutch?
 Do wee need to config :
 1. Hadoop,


http://hadoop.apache.org/docs/stable/single_node_setup.html
http://hadoop.apache.org/docs/stable/cluster_setup.html


 2. HBase


http://hbase.apache.org/book/quickstart.html


 3. Gora


https://wiki.apache.org/nutch/Nutch2Tutorial


 to run nutch?

 Thanks,
 Nguyen Dang Hung



Re: Questions/issues with nutch

2013-06-29 Thread Tejas Patil
Yes. Nutch would parse the HTML and extract the content out of it. Tweaking
around the code surrounding the parser would have made that happen. If you
did something else, would you mind sharing it ?

The depth is used by the Crawl class in 1.x which is deprecated in 2.x.
Use bin/crawl instead.
While running the bin/crawl script, the numberOfRounds option is
nothing but the depth till which you want the crawling to be performed.

If you want to use the individual commands instead, run generate - fetch
- parse - update multiple times. The crawl script internally does the
same thing.
eg. If you want to fetch till depth 3, this is how you could do:
inject - (generate - fetch - parse - update)
  - (generate - fetch - parse - update)
  - (generate - fetch - parse - update)
   - solrindex

On Fri, Jun 28, 2013 at 7:24 PM, h b hb6...@gmail.com wrote:

 Ok, I tweaked the code a bit to extract the html as is from the parser, to
 realize that it is too much of a text and too much depth of crawling. So I
 am looking to see if I can somehow limit the depth. Nutch 1.x docs mention
 about the -depth parameter. However, I do not see this in the
 nutch-default.xml under Nutch 2.x. The -topN is used for number of links
 per depth. So for Nutch 2.x where/how do I set the depth?


 On Fri, Jun 28, 2013 at 11:32 AM, h b hb6...@gmail.com wrote:

  Ok, SO i also got this work with Solr 4 no errors, I think the key was
 not
  using a crawl id.
  I had to comment the updatelog in solrconfig.xml because I got some
  _version_ related error.
 
  My next questions is, my solr document, or for that matter even the hbase
  value of the html content is 'not html'. It appears that nutch is
  extracting out text only. How do I retain the html content as is.
 
 
 
 
 
 
  On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:
 
  Kewl !!
 
  I wonder why org.apache.solr.common.SolrException: undefined field
 text
  happens.. Anybody who can throw light on this ?
 
 
  On Fri, Jun 28, 2013 at 10:45 AM, h b hb6...@gmail.com wrote:
 
   Thanks Tejas
   I tried these steps, One step I added, was updatedb
  
   *bin/nutch updatedb*
  
   Just to be consistent with the doc, and your suggestion on some other
   thread, I used solr 3.6 instead of 4.x
   I copied the schema.xml from nutch/conf (rootlevel) and started solr.
 It
   failed with
  
   SEVERE: org.apache.solr.common.SolrException: undefined field text
  
  
   One of the google thread, suggested I ignore this error, so I ignored
  and
   indexed anyway
  
   So now I got it to work. Playing some more with the queries
  
  
  
  
   On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil 
 tejas.patil...@gmail.com
   wrote:
  
The storage.schema.webpage seems messed up but I don't have ample
  time
now to look into it. Here is what I would suggest to get things
  working:
*
*
*[1] Remove all the old data from HBase*
   
(I assume that HBase is running while you do this)
*cd $HBASE_HOME*
*./bin/hbase shell
*
In the HBase shell, use list to see all the tables, delete all of
  those
related to Nutch (ones named as *webpage).
Remove them using disable and drop commands.
   
eg. if one of the tables is webpage, you would run this:
*disable 'webpage'
*
*drop 'webpage'*
* *
   
*[2] Run crawl*
I assume that you have not changed storage.schema.webpage is
nutch-site.xml and nutch-default.xml. If yes, revert it to:
   
*property*
*  namestorage.schema.webpage/**name*
*  valuewebpage/value*
*  descriptionThis value holds the schema name used for Nutch web
  db.*
*  Note that Nutch ignores the value in the gora mapping files, and
  uses*
*  this as the webpage schema name.*
*  /description*
*/property*
   
Run crawl commands:
*bin/nutch inject urls/*
*bin/nutch generate -topN 5  -noFilter -adddays 0*
*bin/nutch fetch -all -threads 5  *
*bin/nutch parse -all *
   
*[3] Perform indexing*
I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml
  copied
   in
${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details.
Start solr and run the indexing command:
*bin/nutch solrindex  $SOLR_URL -all *
   
[0] : http://wiki.apache.org/nutch/NutchTutorial
   
Thanks,
Tejas
   
On Thu, Jun 27, 2013 at 1:47 PM, h b hb6...@gmail.com wrote:
   
 Ok, so avro did not work quite well for me, I got a test grid with
   hbase,
 and I started using that for now. All steps ran without errors
 and I
   see
my
 crawled doc in hbase.
 However, after running the solr integration, and querying solr, I
  get
back
 nothing. Index files look very tiny. The one thing I noted is a
  message
 during almost every step

 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass
  match
   but
 mismatching table names  mappingfile schema is 'webpage' vs

Re: nutch2.x in cluster mode ?

2013-06-29 Thread Tejas Patil
On Sat, Jun 29, 2013 at 10:36 AM, imran khan imrankhan.x...@gmail.comwrote:

 Greetings,

 Is there any guide for setting up nutch2.x in cluster mode ?


[0] is a relevant wiki page .. which has not been updated since a long time.
I am guessing that you have already tried running in local mode as given in
[1]. For cluster mode, have hadoop 1.2.0 setup and its variables exported,
set nutch configs as per your requirements, run 'ant' and then run nutch
commands from $NUTCH_HOME/runtime/deploy


 And which versions of hadopp nutch2.x/hbase works well in cluster mode ?


Use Nutch 2.2 and HBase 0.90.x


 Regards,
 Khan


[0] : http://wiki.apache.org/nutch/NutchHadoopTutorial
[1] : http://wiki.apache.org/nutch/Nutch2Tutorial


Re: NUTCH, SOLR and HBase integration

2013-06-28 Thread Tejas Patil
Try using Solr 3.x and follow the steps (4-6) given in
http://wiki.apache.org/nutch/NutchTutorial


On Thu, Jun 27, 2013 at 11:39 PM, Mariam Salloum
mariam.sall...@gmail.comwrote:

 Hi Tejas,

 Thanks for your response. I'm using the latest version solr-4.3.1.


 On Jun 27, 2013, at 11:10 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  Which version of SOLR are you using ? It should go well with Solr 3.x
 
  http://wiki.apache.org/nutch/NutchTutorial
 
 
  On Thu, Jun 27, 2013 at 11:02 PM, Mariam Salloum
  mariam.sall...@gmail.comwrote:
 
  I'm having problems with integrating SOLR and NUTCH. I have done the
  following:
 
  1 - Installed/configured NUTCH, SOLR, and HBase.
 
  2 - The crawl script did not work for me, so I'm using the step-by-step
  commands
 
  3 - I ran inject, generate, fetch, and parse and all ran successfully.
 I'm
  able to see the table in HBase and see the fetch and parse flags set for
  the entries.
 
  4 - I copied the /conf/schema.xml from the Nutch directory into the SOLR
  config directory and verified its using the right schema.xml file.
 
  5 - I made sure that I updated schema.xml to set indexed and stored
  property to true
  field name=content type=text stored=true indexed=true/
 
  6 - Finally, I started SOLR and tried running bin/nutch solrindex …
 
  SOLR runs without errors (checked the solr.log). However, nothing is
  loaded to SOLR. It states number of documents loaded is 0, and the query
  *:* returns nothing.
 
  What could be the problem? Any ideas will be appreciated.
 
  Thanks
 
  Mariam




Re: Questions/issues with nutch

2013-06-28 Thread Tejas Patil
 directory structure.
  Make changes to conf/nutch-site.xml, build the job jar, navigate to
  runtime/deploy, run the code.
  It's easier to make the job jar and scripts in deploy available to the
  job
  tracker.
  You also didn't comment on the counters for the inject job. Do you see
  any?
  Best
  Lewis
 
  On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote:
   Here is an example of what I am saying about the config changes not
  taking
   effect.
  
   cd runtime/deploy
   cat ../local/conf/nutch-site.xml
   ..
  
 property
   namestorage.data.store.class/name
   valueorg.apache.gora.avro.store.AvroStore/value
 /property
   .
  
   cd ../..
  
   ant job
  
   cd runtime/deploy
   bin/nutch inject urls -crawlId crawl1
   .
   13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
   org.apache.gora.memory.store.MemStore as the Gora storage class.
   .
  
   So the nutch-site.xml was changed to use AvroStore as storage class
 and
  job
   was rebuilt, and I reran inject, the output of which still shows that
  it
  is
   trying to use Memstore.
  
  
  
  
  
  
  
  
   On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
   The Gora MemStore was introduced to deal predominantly with test
  scenarios.
   This is justified as the 2.x code is pulled nightly and after every
  commit
   and tested.
   It is nnot thread safe and should not be used (until we fix some
  issues)
   for any kind of serious deployment.
   From your inject task on the job tracker, you will be able to see
   'urls_injected' counters which represent the number of urls actually
   persisted through Gora into the datastore.
   I understand that HBase is not an option. Gora should also support
  writing
   the output into Avro sequence files... which can be pumped into
 hdfs.
  We
   have done some work on this so I suppose that right now is as good a
  time
   as any for you to try it out.
   use the default datastore as org.apache.gora.avro.store.AvroStore I
  think.
   You can double check by looking into gora.properties
   As a note, youu should use nutch-site.xml within the top level conf
   directory for all your Nutch configuration. You should then create a
  new
   job jar for use in hadoop by calling 'ant job' after the changes are
  made.
   hth
   Lewis
  
   On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote:
The quick responses flowing are very encouraging. Thanks Tejas.
Tejas, as I mentioned earlier, in fact I actually ran it step by
  step.
   
So first I ran the inject command and then the readdb with dump
  option
   and
did not see anything in the dump files, that leads me to say that
  the
inject did not work.I verified the regex-urlfilter and made sure
  that
  my
url is not getting filtered.
   
I agree that the second link is about configuring HBase as a
  storageDB.
However, I do not have Hbase installed and dont foresee getting it
installed any sooner, hence using HBase for storage is not a
 option,
  so I
am going to have to stick to Gora with memory store.
   
   
   
   
On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil 
  tejas.patil...@gmail.com
   wrote:
   
On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote:
   
 Thanks for the response Lewis.
 I did read these links, I mostly followed the first link and
  tried
   both
the
 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
   exception
on
 solr, so I figured that I should first deal with getting the
  crawl
   part
to
 work and then deal with solr indexing. Hence I went back to
  trying
  it
 stepwise.

   
You should try running the crawl using individual commands and
 see
  where
the problem is. The nutch tutorial which Lewis pointed you to had
  those
commands. Even peeking into the bin/crawl script would also help
  as it
calls the nutch commands.
   

 As for the second link, it is more about using HBase as store
  instead
   of
 gora. This is not really a option for me yet, cause my grid
 does
  not
   have
 hbase installed yet. Getting it done is not much under my
 control

   
HBase is one of the datastores supported by Apache Gora. That
  tutorial
speaks about how to configure Nutch (actually Gora) to use HBase
  as a
backend. So, its wrong to say that the tutorial was about HBase
 and
  not
Gora.
   

 the FAQ link is the one I had not gone through until I checked
  your
 response, but I do not find answers to any of my questions
 (directly/indirectly) in it.

   
Ok
   




 On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney 
  *Lewis*
  
  
 
  --
  *Lewis*
 
 
 
 



Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

2013-06-28 Thread Tejas Patil
On Thu, Jun 27, 2013 at 12:24 AM, Tony Mullins tonymullins...@gmail.comwrote:

 I am grateful for the help community is giving me and I wont be able to do
 it without their help.

 When I was using Cassandra, it only created sinlge 'webpage' table,  if I
 ran my jobs without crawlId (directly from eclipse) or with crawlId it
 always used the same 'webpage' table.
 This is not the case with HBase, as HBase creates a table like
 'crawlId_webpage' , so what I was saying is it possible to achieve the same
 behavior (Cassandra's)  with Hbase  ( to make HBase only create single
 'webpage' table even if I give crawlId to my bin/crawl script ) ?


You can customize your bin/crawl script to get that done. Currently it
passes the crawlId argument to the nutch commands. You can can check the
usage of those commands and figure out if they accept -all
AFAIK, fetch and parse commands have a -all param which you can use.
Updatedb does not need it as by default it works over all batches.

And I think this log is generated due to the same issue I mentioned above :
 Keyclass and nameclass match but mismatching table names  mappingfile
 schema is 'webpage' vs actual schema 'C11_webpage' , assuming they are the
 same.


I have no clue what this is about. I will be looking into this in coming
days.


 And what do you meant by the status of URLs ?


Those indicate the status of the url. [0] is a shameless plug of my answer
over stackoverflow which tells what each status stands for.

These are the logs when I run my job for the first time ( Inject -
 generate - fetch - parse - DBUpdate) and for 2 or 3 depth levels (
 generate - fetch - parse - DBUpdate)

 I always get these
 *status:2 (status_fetched)*
 fetchTime:0
 prevFetchTime:0
 fetchInterval:0
 retriesSinceFetch:0
 modifiedTime:0
 prevModifiedTime:0
 protocolStatus:(null)


 Thanks again for your help.
 Tony.


[0]  :
http://stackoverflow.com/questions/16853155/where-can-i-find-documentation-about-nutch-status-codes/16869165#16869165




 On Thu, Jun 27, 2013 at 2:33 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  On Wed, Jun 26, 2013 at 4:30 AM, Tony Mullins tonymullins...@gmail.com
  wrote:
 
  
   Is it possible to crawl with crawlId but HBase only crates 'webpage'
  table
   without crawlId prefix , just like Cassandra does?
  
 
  I can't understand this question Tony.
 
 
  
   And my other problems of DBUpdateJob's exception on some random urls
 and
   repeating/mixed html of all urls present in seed.txt are also resolved
   (disappeared) with HBase backend.
  
 
  Good
 
 
   Am I suppose to get proper values here or these are the expected output
  in
   ParseFilter plugin ?
  
   What is the status of the URLs which have the null or 0 values for the
  fields you posted?
 
 
 
   PS. Now I am getting correct HTML in ParseFilter with HBase backend.
  
   Good
 



Re: Questions/issues with nutch

2013-06-27 Thread Tejas Patil
On Wed, Jun 26, 2013 at 10:26 PM, h b hb6...@gmail.com wrote:

 The quick responses flowing are very encouraging. Thanks Tejas.
 Tejas, as I mentioned earlier, in fact I actually ran it step by step.

 So first I ran the inject command and then the readdb with dump option and
 did not see anything in the dump files, that leads me to say that the
 inject did not work.I verified the regex-urlfilter and made sure that my
 url is not getting filtered.

  and you see nothing interesting in the logs. Oh boy... If this happens
w/o any config changes over the distribution (apart from http.agent.name),
then it should have been reported by now. You might set the loggers to
lower level to get more details. I have a feeling that mostly the reason is
the datastore used is buggy.

I agree that the second link is about configuring HBase as a storageDB.
 However, I do not have Hbase installed and dont foresee getting it
 installed any sooner, hence using HBase for storage is not a option, so I
 am going to have to stick to Gora with memory store.

 Ok. There were Jiras logged regarding memory store not working correctly
(it was in reference to junits being failing). Lewis / Renato might have
more knowledge about it. Being honest, I doubt it anybody .. out there ..
is actually using memstore. HBase seems to be the most cheered backend.




 On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote:
 
   Thanks for the response Lewis.
   I did read these links, I mostly followed the first link and tried both
  the
   3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
 exception
  on
   solr, so I figured that I should first deal with getting the crawl part
  to
   work and then deal with solr indexing. Hence I went back to trying it
   stepwise.
  
 
  You should try running the crawl using individual commands and see where
  the problem is. The nutch tutorial which Lewis pointed you to had those
  commands. Even peeking into the bin/crawl script would also help as it
  calls the nutch commands.
 
  
   As for the second link, it is more about using HBase as store instead
 of
   gora. This is not really a option for me yet, cause my grid does not
 have
   hbase installed yet. Getting it done is not much under my control
  
 
  HBase is one of the datastores supported by Apache Gora. That tutorial
  speaks about how to configure Nutch (actually Gora) to use HBase as a
  backend. So, its wrong to say that the tutorial was about HBase and not
  Gora.
 
  
   the FAQ link is the one I had not gone through until I checked your
   response, but I do not find answers to any of my questions
   (directly/indirectly) in it.
  
 
  Ok
 
  
  
  
  
   On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
Hi Hemant,
I strongly advise you to take some time to look through the Nutch
   Tutorial
for 1.x and 2.x.
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Nutch2Tutorial
Also please see the FAQ's, which you will find very very useful.
http://wiki.apache.org/nutch/FAQ
   
Thanks
Lewis
   
   
On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote:
   
 Hi,
 I am first time user of nutch. I installed
 nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
 webpage.

 I am running nutch step by step. These are the problems I came
  across -

 1. Inject did not work, i..e the url does not reflect in the
 webdb(gora-memstore). The way I verify this is after running
 inject,
  i
run
 readdb with dump. This created a directory in hdfs with 0 size part
   file.

 2. config files - This confused me a lot. When run from deploy
   directory,
 does nutch use the config files from local/conf? Changes made to
 local/conf/nutch-site.xml did not take effect after editing this
  file.
   I
 had to edit this in order to get rid of the 'http.agent.name'
  error. I
 finally ended up hard-coding this in the code, rebuilding and
 running
   to
 keep going forward.

 3. how to interpret readdb - Running readdb -stats, shows a lot out
output
 but I do not see my url from seed.txt in there. So I do not know if
  the
 entry in webdb actually reflects my seed.txt at all or not.

 4. logs - When nutch is run from the deploy directory, the
logs/hadoop.log
 is not generated anymore, not locally, nor on the grid. I tried to
  make
it
 verbose by changing log4j.properties to DEBUG, but still had not
 file
 generated.

 Any help with this would help me move forward with nutch.

 Regards
 Hemant

   
   
   
--
*Lewis*
   
  
 



Re: Questions/issues with nutch

2013-06-27 Thread Tejas Patil
Hi Lewis,
Thanks for details.

One quickie: By using memstore as the datastore, will the results be
persisted across runs ? I mean, after injecting stuff, where would the
crawl datums get stored on to the disk so that the generate phase gets
those ? I believe that memstore won't do it and would give up everything
once the process ends.


On Wed, Jun 26, 2013 at 11:06 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 On Wed, Jun 26, 2013 at 10:26 PM, h b hb6...@gmail.com wrote:

 The quick responses flowing are very encouraging. Thanks Tejas.
 Tejas, as I mentioned earlier, in fact I actually ran it step by step.

 So first I ran the inject command and then the readdb with dump option and
 did not see anything in the dump files, that leads me to say that the
 inject did not work.I verified the regex-urlfilter and made sure that my
 url is not getting filtered.

  and you see nothing interesting in the logs. Oh boy... If this happens
 w/o any config changes over the distribution (apart from http.agent.name),
 then it should have been reported by now. You might set the loggers to
 lower level to get more details. I have a feeling that mostly the reason is
 the datastore used is buggy.

 I agree that the second link is about configuring HBase as a storageDB.
 However, I do not have Hbase installed and dont foresee getting it
 installed any sooner, hence using HBase for storage is not a option, so I
 am going to have to stick to Gora with memory store.

 Ok. There were Jiras logged regarding memory store not working correctly
 (it was in reference to junits being failing). Lewis / Renato might have
 more knowledge about it. Being honest, I doubt it anybody .. out there ..
 is actually using memstore. HBase seems to be the most cheered backend.




 On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote:
 
   Thanks for the response Lewis.
   I did read these links, I mostly followed the first link and tried
 both
  the
   3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
 exception
  on
   solr, so I figured that I should first deal with getting the crawl
 part
  to
   work and then deal with solr indexing. Hence I went back to trying it
   stepwise.
  
 
  You should try running the crawl using individual commands and see where
  the problem is. The nutch tutorial which Lewis pointed you to had those
  commands. Even peeking into the bin/crawl script would also help as it
  calls the nutch commands.
 
  
   As for the second link, it is more about using HBase as store instead
 of
   gora. This is not really a option for me yet, cause my grid does not
 have
   hbase installed yet. Getting it done is not much under my control
  
 
  HBase is one of the datastores supported by Apache Gora. That tutorial
  speaks about how to configure Nutch (actually Gora) to use HBase as a
  backend. So, its wrong to say that the tutorial was about HBase and not
  Gora.
 
  
   the FAQ link is the one I had not gone through until I checked your
   response, but I do not find answers to any of my questions
   (directly/indirectly) in it.
  
 
  Ok
 
  
  
  
  
   On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
Hi Hemant,
I strongly advise you to take some time to look through the Nutch
   Tutorial
for 1.x and 2.x.
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Nutch2Tutorial
Also please see the FAQ's, which you will find very very useful.
http://wiki.apache.org/nutch/FAQ
   
Thanks
Lewis
   
   
On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote:
   
 Hi,
 I am first time user of nutch. I installed
 nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a
 single
 webpage.

 I am running nutch step by step. These are the problems I came
  across -

 1. Inject did not work, i..e the url does not reflect in the
 webdb(gora-memstore). The way I verify this is after running
 inject,
  i
run
 readdb with dump. This created a directory in hdfs with 0 size
 part
   file.

 2. config files - This confused me a lot. When run from deploy
   directory,
 does nutch use the config files from local/conf? Changes made to
 local/conf/nutch-site.xml did not take effect after editing this
  file.
   I
 had to edit this in order to get rid of the 'http.agent.name'
  error. I
 finally ended up hard-coding this in the code, rebuilding and
 running
   to
 keep going forward.

 3. how to interpret readdb - Running readdb -stats, shows a lot
 out
output
 but I do not see my url from seed.txt in there. So I do not know
 if
  the
 entry in webdb actually reflects my seed.txt at all or not.

 4. logs - When nutch is run from the deploy directory, the
logs/hadoop.log
 is not generated anymore, not locally, nor on the grid. I tried

Re: Segments / Database in Nutch 2.X

2013-06-27 Thread Tejas Patil
On Thu, Jun 27, 2013 at 3:38 AM, Sznajder ForMailingList 
bs4mailingl...@gmail.com wrote:

 Hi

 I do not see the usage of Segments in nutch 2.x

 In addition, I do not see DB path .


segments and crawldb are notions in 1.x representing the dir over FS
which has the crawlers' data in it (those are nothing but Hadoops' Map
files and Sequence files).
2.x leverages datastores to store the crawled data. A table is created in
the datastore to have all the information.


 In such condition, how can we two separate crawls, one starting from url1
 and the second from another seed, for example?


You could specify different crawlIDs. Being honest, I have never tried
running multiple crawls at the same time with 2.x.
Its not seen to be a good thing to do as mentioned by Julien in this thread:
http://lucene.472066.n3.nabble.com/Concurrently-running-multiple-nutch-crawls-td3166207.html


 Benjamin



Re: nutch issues

2013-06-27 Thread Tejas Patil
On Thu, Jun 27, 2013 at 4:30 AM, devang pandey devangpande...@gmail.comwrote:

 I am quite new to nutch. I have crawled a site successfully using nutch 1.2


You should use the latest version (1.7) as it has many bug fixes and
enhancements.


 and extracted segment dump by *readseg* command but issue is that dump
 contains lot of information other than url and outlinks also if i want to
 analyse it, manual approach needs to be adopted.


Did you use the general options ? Those are

-nocontent ignore content directory
 -nofetch ignore crawl_fetch directory
-nogenerate ignore crawl_generate directory
 -noparse ignore crawl_parse directory
-noparsedata ignore parse_data directory
 -noparsetext ignore parse_text directory

To see the usage, just run bin/nutch readseg w/o any params.

It would be really great
 if there is any utiltiy, plugin which export link with out links in machine
 readable format like csv or sql. Please suggest


The dump option of readseg command would give you a dump of the segment
in plain text file which is human readable. You could run some shell
commands to convert it into desired form you want.


Re: Questions/issues with nutch

2013-06-27 Thread Tejas Patil
What is the datastore in gora.properties ?

http://wiki.apache.org/nutch/Nutch2Tutorial


On Wed, Jun 26, 2013 at 11:37 PM, h b hb6...@gmail.com wrote:

 Here is an example of what I am saying about the config changes not taking
 effect.

 cd runtime/deploy
 cat ../local/conf/nutch-site.xml
 ..

   property
 namestorage.data.store.class/name
 valueorg.apache.gora.avro.store.AvroStore/value
   /property
 .

 cd ../..

 ant job

 cd runtime/deploy
 bin/nutch inject urls -crawlId crawl1
 .
 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
 org.apache.gora.memory.store.MemStore as the Gora storage class.
 .

 So the nutch-site.xml was changed to use AvroStore as storage class and job
 was rebuilt, and I reran inject, the output of which still shows that it is
 trying to use Memstore.








 On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  The Gora MemStore was introduced to deal predominantly with test
 scenarios.
  This is justified as the 2.x code is pulled nightly and after every
 commit
  and tested.
  It is nnot thread safe and should not be used (until we fix some issues)
  for any kind of serious deployment.
  From your inject task on the job tracker, you will be able to see
  'urls_injected' counters which represent the number of urls actually
  persisted through Gora into the datastore.
  I understand that HBase is not an option. Gora should also support
 writing
  the output into Avro sequence files... which can be pumped into hdfs. We
  have done some work on this so I suppose that right now is as good a time
  as any for you to try it out.
  use the default datastore as org.apache.gora.avro.store.AvroStore I
 think.
  You can double check by looking into gora.properties
  As a note, youu should use nutch-site.xml within the top level conf
  directory for all your Nutch configuration. You should then create a new
  job jar for use in hadoop by calling 'ant job' after the changes are
 made.
  hth
  Lewis
 
  On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote:
   The quick responses flowing are very encouraging. Thanks Tejas.
   Tejas, as I mentioned earlier, in fact I actually ran it step by step.
  
   So first I ran the inject command and then the readdb with dump option
  and
   did not see anything in the dump files, that leads me to say that the
   inject did not work.I verified the regex-urlfilter and made sure that
 my
   url is not getting filtered.
  
   I agree that the second link is about configuring HBase as a storageDB.
   However, I do not have Hbase installed and dont foresee getting it
   installed any sooner, hence using HBase for storage is not a option,
 so I
   am going to have to stick to Gora with memory store.
  
  
  
  
   On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil 
 tejas.patil...@gmail.com
  wrote:
  
   On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote:
  
Thanks for the response Lewis.
I did read these links, I mostly followed the first link and tried
  both
   the
3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
  exception
   on
solr, so I figured that I should first deal with getting the crawl
  part
   to
work and then deal with solr indexing. Hence I went back to trying
 it
stepwise.
   
  
   You should try running the crawl using individual commands and see
 where
   the problem is. The nutch tutorial which Lewis pointed you to had
 those
   commands. Even peeking into the bin/crawl script would also help as it
   calls the nutch commands.
  
   
As for the second link, it is more about using HBase as store
 instead
  of
gora. This is not really a option for me yet, cause my grid does not
  have
hbase installed yet. Getting it done is not much under my control
   
  
   HBase is one of the datastores supported by Apache Gora. That tutorial
   speaks about how to configure Nutch (actually Gora) to use HBase as a
   backend. So, its wrong to say that the tutorial was about HBase and
 not
   Gora.
  
   
the FAQ link is the one I had not gone through until I checked your
response, but I do not find answers to any of my questions
(directly/indirectly) in it.
   
  
   Ok
  
   
   
   
   
On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:
   
 Hi Hemant,
 I strongly advise you to take some time to look through the Nutch
Tutorial
 for 1.x and 2.x.
 http://wiki.apache.org/nutch/NutchTutorial
 http://wiki.apache.org/nutch/Nutch2Tutorial
 Also please see the FAQ's, which you will find very very useful.
 http://wiki.apache.org/nutch/FAQ

 Thanks
 Lewis


 On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote:

  Hi,
  I am first time user of nutch. I installed
  nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a
  single
  webpage.
 
  I am running nutch step

Re: 2.x Eclipse Breakpoints

2013-06-27 Thread Tejas Patil
Updated the wiki :)


On Mon, Jun 24, 2013 at 11:34 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 This is based on my personal experience and ain't an exhaustive list. I am
 sure that other folks on @user will have more suggestions. I will put this
 on the relevant wiki page shortly.

 FetcherReducer$FetcherThread run() : line 487 : LOG.info(fetching  +
 fit.url 
   : line 519 : final
 ProtocolStatus status = output.getStatus();

 GeneratorMapper : map() : line 53
 GeneratorReducer : reduce() : line 53
 OutlinkExtractor : getOutlinks() : line 84


 On Mon, Jun 24, 2013 at 6:44 PM, Prashant Ladha 
 prashant.la...@gmail.comwrote:

 Hi,
 Just wanted to share a feedback in the Nutch-Eclipse setup.
 The Debug Nutch in Eclipse section has breakpoints related to Nutch 1.x

 If anyone can document the helpful breakpoints of 2.x, that would helpful.

 [0] - http://wiki.apache.org/nutch/RunNutchInEclipse

 P.S.:  Is this the right forum for discussing these topics?





Re: Questions/issues with nutch

2013-06-26 Thread Tejas Patil
On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote:

 Thanks for the response Lewis.
 I did read these links, I mostly followed the first link and tried both the
 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on
 solr, so I figured that I should first deal with getting the crawl part to
 work and then deal with solr indexing. Hence I went back to trying it
 stepwise.


You should try running the crawl using individual commands and see where
the problem is. The nutch tutorial which Lewis pointed you to had those
commands. Even peeking into the bin/crawl script would also help as it
calls the nutch commands.


 As for the second link, it is more about using HBase as store instead of
 gora. This is not really a option for me yet, cause my grid does not have
 hbase installed yet. Getting it done is not much under my control


HBase is one of the datastores supported by Apache Gora. That tutorial
speaks about how to configure Nutch (actually Gora) to use HBase as a
backend. So, its wrong to say that the tutorial was about HBase and not
Gora.


 the FAQ link is the one I had not gone through until I checked your
 response, but I do not find answers to any of my questions
 (directly/indirectly) in it.


Ok





 On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi Hemant,
  I strongly advise you to take some time to look through the Nutch
 Tutorial
  for 1.x and 2.x.
  http://wiki.apache.org/nutch/NutchTutorial
  http://wiki.apache.org/nutch/Nutch2Tutorial
  Also please see the FAQ's, which you will find very very useful.
  http://wiki.apache.org/nutch/FAQ
 
  Thanks
  Lewis
 
 
  On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote:
 
   Hi,
   I am first time user of nutch. I installed
   nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
   webpage.
  
   I am running nutch step by step. These are the problems I came across -
  
   1. Inject did not work, i..e the url does not reflect in the
   webdb(gora-memstore). The way I verify this is after running inject, i
  run
   readdb with dump. This created a directory in hdfs with 0 size part
 file.
  
   2. config files - This confused me a lot. When run from deploy
 directory,
   does nutch use the config files from local/conf? Changes made to
   local/conf/nutch-site.xml did not take effect after editing this file.
 I
   had to edit this in order to get rid of the 'http.agent.name' error. I
   finally ended up hard-coding this in the code, rebuilding and running
 to
   keep going forward.
  
   3. how to interpret readdb - Running readdb -stats, shows a lot out
  output
   but I do not see my url from seed.txt in there. So I do not know if the
   entry in webdb actually reflects my seed.txt at all or not.
  
   4. logs - When nutch is run from the deploy directory, the
  logs/hadoop.log
   is not generated anymore, not locally, nor on the grid. I tried to make
  it
   verbose by changing log4j.properties to DEBUG, but still had not file
   generated.
  
   Any help with this would help me move forward with nutch.
  
   Regards
   Hemant
  
 
 
 
  --
  *Lewis*
 



Re: 2.x Eclipse Breakpoints

2013-06-25 Thread Tejas Patil
This is based on my personal experience and ain't an exhaustive list. I am
sure that other folks on @user will have more suggestions. I will put this
on the relevant wiki page shortly.

FetcherReducer$FetcherThread run() : line 487 : LOG.info(fetching  +
fit.url 
  : line 519 : final
ProtocolStatus status = output.getStatus();

GeneratorMapper : map() : line 53
GeneratorReducer : reduce() : line 53
OutlinkExtractor : getOutlinks() : line 84


On Mon, Jun 24, 2013 at 6:44 PM, Prashant Ladha prashant.la...@gmail.comwrote:

 Hi,
 Just wanted to share a feedback in the Nutch-Eclipse setup.
 The Debug Nutch in Eclipse section has breakpoints related to Nutch 1.x

 If anyone can document the helpful breakpoints of 2.x, that would helpful.

 [0] - http://wiki.apache.org/nutch/RunNutchInEclipse

 P.S.:  Is this the right forum for discussing these topics?



Re: FBA / Cookies

2013-06-24 Thread Tejas Patil
AFAIK, Nutch would support proxy authentication but won't do Form based
authentication.


On Mon, Jun 24, 2013 at 2:57 AM, jonathan_ou...@mcafee.com wrote:

  Hello there,

 ** **

 I’m sure this has been asked before, however looking online and in
 archived e-mails I cannot find a recent response so…

 ** **

 Does Nutch support Form Based Authentication via credentials and stored
 cookies?

 ** **

 I’m aware that there are a  number of challenges; detecting when a page
 request has been redirected to the authentication page, detecting when
 cookies time out etc.  But I was interested to hear if Nutch now supports
 this feature.

 ** **

 Regards

 *Jonathan Oulds*
 Software Engineer / PSC – Drive Encryption

 *McAfee, Inc.*
 Mocatta House
 Brighton, BN1 4DU

 Direct: +44 (0)1273 669419
 *Web: *www.mcafee.com

 [image: 
 sns_sig_550x120_fnl.png]http://internal.nai.com/division/marketing/BrandMarketing/templates/images/mfe_primary_logo_125x39.png
 

 *The information contained in this email message may be privileged,
 confidential and protected from disclosure. If you are not the intended
 recipient, any review, dissemination, distribution or copying is strictly
 prohibited. If you have received this email message in error, please notify
 the sender by reply email and delete the message and any attachments.*

 ** **



Re: need legends for fetch reduce jobtracker ouput

2013-06-22 Thread Tejas Patil
What will be the right place to add a note about the explanation of the
Fetcher logs ?
I think in FAQs wiki page under fetching section [0] ?

[0] : http://wiki.apache.org/nutch/FAQ#Fetching


On Mon, Apr 22, 2013 at 10:37 PM, kiran chitturi
chitturikira...@gmail.comwrote:

 Yes Lewis. It would be the best way for the permissions right now.

 I will add Tejas once he shares his wiki uid.



 On Tue, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  I agree.
  I can sort this tomorrow.
  @Kiran,
  Are we still working to addition of documentation contributers via
  contributers and admin group since the most recent lockdown?
  Tejas should be added to both groups.
  @Tejas please drop one of us your wiki uid whenever it suits.
  Lewis
 
  On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
   Hi Lewis,
  
   Thanks !!
   I have huge respect for those who engineered the Fetcher class (esp. of
   1.x) as its simply *awesome* and complex piece of code.
   I can polish my post more so that it comes to the wiki quality. I
 don't
   have access to wiki. Can you provide me the same ?
  
   Thanks,
   Tejas
  
  
   On Mon, Apr 22, 2013 at 8:09 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
   hi Tejas,
   this is a real excellent reply and very useful.
   it would be really great if we could somehow have this kind of low
 level
   information readily available on the Nutch wiki.
  
   On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com
  wrote:
Fetcher threads try to get a fetch item (url) from a queue of all
 the
   fetch
items (this queue is actually a queue of queues. For details see
 [0]).
  If
   a
thread doesnt get a fetch-item, it spinwaits for 500ms before
 polling
  the
queue again.
The '*spinWaiting*' count tells us how many threads are in their
spinwaiting state at a given instance.
   
The '*active*' count tells us how many threads are currently
  performing
   the
activities related to the fetch of a fetch-item. This involves
 sending
requests to the server, getting the bytes from the server, parsing,
   storing
etc..
   
'*pages*' is a count for total pages fetched till a given point.
'*errors*' is a count for total errors seen.
   
*Next comes pages/s:*
First number comes from this:
float)pages.get())*10)/elapsed)/10.0
   
second one comes from this:
(actualPages*10)/10.0
   
actualPages holds the count of pages processed in the last 5 secs
  (when
   the
calculation is done).
   
First number can be seen as the overall speed for that execution.
 The
second number can be regarded as the instanteous speed as it just
 uses
   the
#pages in last 5 secs when this calculation is done. See lines
 818-830
  in
[0].
   
*Next comes the kb/s* values which are computed as follows:
(((float)bytes.get())*8)/1024)/elapsed
((float)actualBytes)*8)/1024
   
This is similar to that of pages/sec. See lines 818-830 in [0].
   
'*URLs*' indicates how many urls are pending and '*queues*' indicate
  the
number of queues present. Queues are formed on the basis on hostname
  or
   ip
depending on the configuration set.
   
See FetcherReducer.java [0] for more details.
   
[0] :
   
  
  
 
 
 http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java?view=markup
   
   
On Mon, Apr 22, 2013 at 6:09 PM, kaveh minooie ka...@plutoz.com
  wrote:
   
could someone please tell me one more time, in this line:
0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s,
  2632
7346 kb/s, 989 URLs in 5 queues  reduce
   
what are the two numbers before pages/s and two numbers before
 kb/s?
   
thanks,
   
   
  
   --
   *Lewis*
  
  
 
  --
  *Lewis*
 



 --
 Kiran Chitturi

 http://www.linkedin.com/in/kiranchitturi



Re: need legends for fetch reduce jobtracker ouput

2013-06-22 Thread Tejas Patil
Thanks Lewis. The info of the fetcher log is added in the FAQ page here [0].

[0] :
https://wiki.apache.org/nutch/FAQ#What_do_the_numbers_in_the_fetcher_log_indicate_.3F


On Sat, Jun 22, 2013 at 12:54 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Sounds great Tejas.
 Wow this is a late shift.
 If you can commit your fetcher diagnostics it would be great Tejas.

 On Saturday, June 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
  What will be the right place to add a note about the explanation of the
  Fetcher logs ?
  I think in FAQs wiki page under fetching section [0] ?
 
  [0] : http://wiki.apache.org/nutch/FAQ#Fetching
 
 
  On Mon, Apr 22, 2013 at 10:37 PM, kiran chitturi
  chitturikira...@gmail.comwrote:
 
  Yes Lewis. It would be the best way for the permissions right now.
 
  I will add Tejas once he shares his wiki uid.
 
 
 
  On Tue, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
   I agree.
   I can sort this tomorrow.
   @Kiran,
   Are we still working to addition of documentation contributers via
   contributers and admin group since the most recent lockdown?
   Tejas should be added to both groups.
   @Tejas please drop one of us your wiki uid whenever it suits.
   Lewis
  
   On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com
 wrote:
Hi Lewis,
   
Thanks !!
I have huge respect for those who engineered the Fetcher class (esp.
 of
1.x) as its simply *awesome* and complex piece of code.
I can polish my post more so that it comes to the wiki quality. I
  don't
have access to wiki. Can you provide me the same ?
   
Thanks,
Tejas
   
   
On Mon, Apr 22, 2013 at 8:09 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:
   
hi Tejas,
this is a real excellent reply and very useful.
it would be really great if we could somehow have this kind of low
  level
information readily available on the Nutch wiki.
   
On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com
   wrote:
 Fetcher threads try to get a fetch item (url) from a queue of all
  the
fetch
 items (this queue is actually a queue of queues. For details see
  [0]).
   If
a
 thread doesnt get a fetch-item, it spinwaits for 500ms before
  polling
   the
 queue again.
 The '*spinWaiting*' count tells us how many threads are in their
 spinwaiting state at a given instance.

 The '*active*' count tells us how many threads are currently
   performing
the
 activities related to the fetch of a fetch-item. This involves
  sending
 requests to the server, getting the bytes from the server,
 parsing,
storing
 etc..

 '*pages*' is a count for total pages fetched till a given point.
 '*errors*' is a count for total errors seen.

 *Next comes pages/s:*
 First number comes from this:
 float)pages.get())*10)/elapsed)/10.0

 second one comes from this:
 (actualPages*10)/10.0

 actualPages holds the count of pages processed in the last 5 secs
   (when
the
 calculation is done).

 First number can be seen as the overall speed for that execution.
  The
 second number can be regarded as the instanteous speed as it just
  uses
the
 #pages in last 5 secs when this calculation is done. See lines
  818-830
   in
 [0].

 *Next comes the kb/s* values which are computed as follows:
 (((float)bytes.get())*8)/1024)/elapsed
 ((float)actualBytes)*8)/1024

 This is similar to that of pages/sec. See lines 818-830 in [0].

 '*URLs*' indicates how many urls are pending and '*queues*'
 indicate
   the
 number of queues present. Queues are formed on the basis on
 hostname
   or
ip
 depending on the configuration set.

 See FetcherReducer.java [0] for more details.

 [0] :

   
   
  
  
 

 http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java?view=markup


 On Mon, Apr 22, 2013 at 6:09 PM, kaveh minooie ka...@plutoz.com
 
   wrote:

 could someone please tell me one more time, in this line:
 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12
 pages/s,
   2632
 7346 kb/s, 989 URLs in 5 queues  reduce

 what are the two numbers before pages/s and two numbers before
  kb/s?

 thanks,


   
--
*Lewis*
   
   
  
   --
   *Lewis*
  
 
 
 
  --
  Kiran Chitturi
 
  http://www.linkedin.com/in/kiranchitturi
 
 

 --
 *Lewis*



Re: Most stable backend for Nutch 2.x

2013-06-22 Thread Tejas Patil
Yup. HBase 0.90.x is the best datastore for 2.x


On Sat, Jun 22, 2013 at 8:19 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Imran,
 HBase 0.90.x
 thank you
 Lewis

 On Saturday, June 22, 2013, imran khan imrankhan.x...@gmail.com wrote:
  Greetings,
 
  I have seen many mails here about people having different issues with
  different backends with Nutch 2.x
 
  So which backend is most suited /stable with Nutch 2.x and also which
  version of that suited/stable backend.
 
  Please ignore the 'according to your business requirement' factor as my
  most important requirement is that I want to run Nutch 2.x smoothly
 without
  any issues.
 
  Regards,
  Khan
 

 --
 *Lewis*



Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Tejas Patil
As mentioned in [0], use older (0.90.x) version of HBase. Unfortunaltely,
HBase folks have removed the link from the downloads page. You can grab the
source code from [1] and build it.

[0] : http://wiki.apache.org/nutch/Nutch2Tutorial
[1] : https://svn.apache.org/repos/asf/hbase/tags/0.90.4/


On Fri, Jun 21, 2013 at 11:01 AM, Tony Mullins tonymullins...@gmail.comwrote:

 In site
 http://wiki.apache.org/nutch/Nutch2Tutorial?action=showredirect=GORA_HBase
 its said that
 N.B. It's possible to encounter the following exception:
 java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration;
 this is caused by the fact that sometimes the hbase TEST jar is deployed in
 the lib dir. To resolve this just copy the lib over from your installed
 HBase dir into the build lib dir. (This issue is currently in progress).

 I have tried copying the HBase 0.94.8 lib in Nutch2.x/build/lib but still
 the same error.

 In my Nutch2.x/build/lib there are older versions of zookeeper and hbase
 and if I try to remove them from newer jars in HBase/lib then too it
 doesn't work.

 Please suggest me what else should I do.

 Thanks,
 Tony.


 On Fri, Jun 21, 2013 at 8:12 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi ,
 
  After getting some errors with Cassandra backend with Nutch2.x , I am
  trying now HBase. I have installed HBase 94.8 and have also created
 sample
  table in it.
 
  After following these links
 
  http://wiki.apache.org/nutch/RunNutchInEclipse
 
 http://wiki.apache.org/nutch/Nutch2Tutorial?action=showredirect=GORA_HBase
 
  I am getting this error when I try to run my first injector job:
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/hadoop/hbase/HBaseConfiguration
   at
 org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)
   at
 
 org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
   at
 
 org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
   at
 
 org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
   at
  org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75
 
  I have noticed one strange thing that as we tell gora-properties that our
  cassandra is running at localhots:9160 , we dont do such thing in case of
  HBase  and just telling it that our default datastore is HBase.
 
  So is there any missing step in these tutorials which could cause this
  exception ?
 
  Thanks,
  Tony.
 
 
 



Re: A bug in the crawl secript in Nutch 1.6

2013-06-21 Thread Tejas Patil
Thanks Joe for pointing it out. There was a jira [0] for this bug and the
change is already present in the trunk.

[0] : https://issues.apache.org/jira/browse/NUTCH-1500


On Fri, Jun 21, 2013 at 7:11 PM, Joe Zhang smartag...@gmail.com wrote:

 The new crawl script is quite useful. Thanks for the addition.

 It comes with a bug, though:


 Line 169:
  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
 $CRAWL_PATH/linkdb $SEGMENT

 should be:

  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
 $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

 instead.



Re: confusion over fetch schedule

2013-06-21 Thread Tejas Patil
On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang smartag...@gmail.com wrote:

 Sorry, Nutch is certainly aware of page modification, and it does capture
 lastModified.

Nutch does captures the last modified field but I am not sure if its
value is used ahead. I remember that it was not being used for any logic in
older versions but need to confirm if the code is modified to take that
into account.

The real question is, can nutch get lastModified of a page
 before fetching, and use it to make fetching decisions (e.g,, whether or
 not to override the default interval)?


No. Nutch won't lookup for the lastModified of a page before fetching its
content.



 On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang smartag...@gmail.com wrote:

  If I don't change the default value of db.fetch.interval.default, which
 is
  30 days, does it mean that the URL in the db won't be refetched before
 the
  due time even if it has been modified? In other words, is Nutch aware of
  page modification?
 



Re: confusion over fetch schedule

2013-06-21 Thread Tejas Patil
I just checked the current code and it seems to me that lastModifed
(aka Modified
time in CrawlDatum class) is not used for any further logic. If  you want
to customize the fetch interval for a subset of pages, do as Lewis
suggested. i.e. specify a customized fetch interval for the main pages in
the inject command [0].

[0] : http://wiki.apache.org/nutch/bin/nutch_inject


On Fri, Jun 21, 2013 at 8:06 PM, Joe Zhang smartag...@gmail.com wrote:

 Thanks, guys. So, just to confirm, lastModifed is not use in the fetching
 logic at all.

 Ideally, it should take higher priority than the default interval. This is
 particularly important for sites such as cnn.com, whether the leaf page
 doesn't really change, but the portal page is updated all the time.

 On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang smartag...@gmail.com wrote:
 
   Sorry, Nutch is certainly aware of page modification, and it does
 capture
   lastModified.
 
  Nutch does captures the last modified field but I am not sure if its
  value is used ahead. I remember that it was not being used for any logic
 in
  older versions but need to confirm if the code is modified to take that
  into account.
 
  The real question is, can nutch get lastModified of a page
   before fetching, and use it to make fetching decisions (e.g,, whether
 or
   not to override the default interval)?
  
 
  No. Nutch won't lookup for the lastModified of a page before fetching its
  content.
 
  
  
   On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang smartag...@gmail.com
 wrote:
  
If I don't change the default value of db.fetch.interval.default,
 which
   is
30 days, does it mean that the URL in the db won't be refetched
 before
   the
due time even if it has been modified? In other words, is Nutch aware
  of
page modification?
   
  
 



Re: run nutch-1.6 in eclipse

2013-06-19 Thread Tejas Patil
Hi Mustafa,

Those steps are added recently and would work for the current nutch trunk.
They won't work for nutch 1.6. Please start from step #1 ie. checkout nutch
from repo. If you still face issues, get back to us with precise details
about the error and at which step.

Thanks,
Tejas


On Wed, Jun 19, 2013 at 1:37 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Mustafa,
 Please read this thoroughly

 http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists
 Once we understand from a well detailed email, what the problem is, we are
 more than willing to help.
 Until then it is really difficult to help you out. Sorry.
 Lewis


 On Wed, Jun 19, 2013 at 1:30 PM, Mustafa Elkhiat melkh...@gmail.com
 wrote:

  Hi
  i wana run nutch-1.6 in eclipse i follow guide
  http://wiki.apache.org/nutch/RunNutchInEclipse but having errors
  please any one can help me to run nutch in eclipse in detail
 



 --
 *Lewis*



Re: How to get raw HTML in @override Filter method in ParseFilter class

2013-06-14 Thread Tejas Patil
Hey Jamshaid,
We cannot see any screenshot being attached. Could you upload it somewhere
and share the url ?


On Thu, Jun 13, 2013 at 11:25 PM, Jamshaid Ashraf jamshaid...@gmail.comwrote:

 Hi,

 Thanks for prompt reply!

 I have set debug point on following line in plugin code in eclipse but get
 source not found screen when debugging plugin code in eclipse. Please see
 attached screen shot.

 String content = new String(page.getContent().array());

 What might cause this to happen and how can I fix it?

 Regards,
 Jamshaid


 On Thu, Jun 13, 2013 at 8:34 PM, feng lu amuseme...@gmail.com wrote:

 Hi

 I checked the ParseFilter interface in Nutch 2.x like this.

 Parse filter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags,
 DocumentFragment doc);

 you can through this method to get the raw content of html page.

 String content = new String(page.getContent().array());

 and get the parsed text through parse.getText() method.





 On Thu, Jun 13, 2013 at 11:10 PM, Jamshaid Ashraf jamshaid...@gmail.com
 wrote:

  Hi,
 
  Since I'm using nutch 2.2 ParseFilter plugin and I need to extract
 custom
  information from parsed raw html (preferably using JSoup) ... but I
 still
  could't find out how to get the raw html in @override filter () method
 . As
  all the examples I have found are in Nutch 1.x api and doens't work with
  new Nutch 2.x api.
 
 
  Thanks in advance!
 
  Regards,
  Jamshaid
 



 --
 Don't Grow Old, Grow Up... :-)





Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-14 Thread Tejas Patil
sure. I would make a note of it. Do you have any other suggestions that
would make the documentation better ?


On Thu, Jun 13, 2013 at 10:29 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Thanks again Tejas.
 This worked. I think these all configurations should be in wiki.. so the
 new users specially who are coming from non-java background (like me) could
 benifit from this and create less traffic in mailing lists :)

 Thanks.
 Tony.


 On Thu, Jun 13, 2013 at 10:38 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  I can't see the image that you attached.
 
  Anyways, if you are running via command line (ie. runtime/local): set
  plugin.folders to plugins in
  NUTCH_HOME/runtime/local/conf/nutch-site.xml. For running from
  Eclipse, set plugin.folders to the absolute path of directory where the
  plugins are generated (ie. NUTCH_HOME/build/plugins) in
  NUTCH_HOME/conf/nutch-site.xml
 
  On Thu, Jun 13, 2013 at 5:38 AM, Tony Mullins tonymullins...@gmail.com
  wrote:
 
   Tejas,
  
   I can now successfully run the plugin from terminal like bin/nutch
   parsechecker http://www.google.nl
  
   But if I try to run my code directly from eclipse , with main class as
   'org.apache.nutch.parse.ParserChecker' and program arguments as '
   http://www.google.nl' it fails with same exception of ClassNotFound.
  
   Please see the attached image.
  
   [image: Inline image 1]
  
  
   I have tried 'ant clean' in my Nutch2.2 source...  but same error !!!
  
   Could you please help me fixing this issue.
  
   Thanks,
   Tony
  
  
  
  
   On Thu, Jun 13, 2013 at 2:23 PM, Tony Mullins 
 tonymullins...@gmail.com
  wrote:
  
   Thank you very much Tejas. It worked. :)
  
   Just wondering why did you ask me to remove the 'plugin.folders' from
   conf/nutch-site.xml ?
   And the problem was due to bad cache/runtime build ?
  
   Thank you again !!!
   Tony.
  
  
  
   On Thu, Jun 13, 2013 at 1:47 PM, Tejas Patil 
 tejas.patil...@gmail.com
  wrote:
  
   I don't see any attachments with the mail.
  
   Anyways, you need to:
   1. remove all your changes from conf/nutch-default.xml. Make it in
 sync
   with svn. (rm conf/nutch-default.xml  svn up
 conf/nutch-default.xml)
   2. In conf/nutch-site.xml, remove the entry for plugin.folders
   3. run ant clean runtime
  
   Now try again.
  
  
   On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins 
  tonymullins...@gmail.com
   wrote:
  
Hi Tejas,
   
Thanks for pointing out the problem. I have changed the package to
kaqqao.nutch.selector and have also modified the package in java
  source
files as package kaqqao.nutch.selector;
   
But I am still getting the ClassNotFound exception... please see
   attached
images !!!
   
Please note that I am using fresh Nutch 2.2 source without
 additional
patch ... do I need to apply any patch to run this ?
   
Thanks,
Tony.
   
   
   
On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil 
  tejas.patil...@gmail.com
   wrote:
   
The package structure you actually have is:
*kaqqao.nutch.plugin.selector;*
   
In src/plugin/element-selector/plugin.xml you have defined it as:
   
   extension
  id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer
  name=Nutch Blacklist and Whitelist Indexing Filter
  point=org.apache.nutch.indexer.IndexingFilter
  implementation id=HtmlElementSelectorIndexer
  class=*kaqqao.nutch.selector*
.HtmlElementSelectorIndexer/
   /extension
   
It aint the same and thats why it cannot load that class at
 runtime.
   Make
it consistent and try again.
It worked at my end after changing the package structure to
kaqqao.nutch.selector
   
   
On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins 
   tonymullins...@gmail.com
wrote:
   
 Hi Tejas,

 I am following this example
 https://github.com/veggen/nutch-element-selector. And now I
 have
   tried
 this example without any changes to my  fresh source of Nutch
 2.2.

 Attached is my patch ( change set) on fresh Nutch 2.2 source.
 Kindly review it and please let me know if I am missing
 something.

 Thanks,
 Tonny


 On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil 
   tejas.patil...@gmail.com
wrote:

 Weird. I would like to have a quick peek into your changes.
 Maybe
   you
are
 doing something wrong which is hard to predict and figure out
 by
   asking
 bunch of questions to you over email. Can you attach a patch
 file
   of
your
 changes ? Please remove the fluff from it and only keep the
 bare
essential
 things in the patch. Also, if you are working for some company,
   make
sure
 that you attaching some code here should not be against your
 organisational
 policy.

 Thanks,
 Tejas

 On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins 
tonymullins...@gmail.com
 wrote:

  I have done

Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]

2013-06-14 Thread Tejas Patil
There has been discussion about that few months back and I am not aware of
the exact root cause behind it.
See
http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-td4040592.html
http://lucene.472066.n3.nabble.com/Re-nutch-2-1-with-mysql-different-batch-id-null-td4058698.html

There is Jira to track the same:
https://issues.apache.org/jira/browse/NUTCH-1567



On Thu, Jun 13, 2013 at 2:11 PM, Weder Carlos Vieira weder.vie...@gmail.com
 wrote:

 mhmmm got it...

 Tejas can you please explain to me why I put some URL inside urls/seed.txt
 and many pages inside that urls aren't parsed?

 Example:
 Skipping http://wiki.creativecommons.org/Integrate; different batch id
 (null)
 Skipping http://wiki.creativecommons.org/LRMI; different batch id (null)
 Skipping http://wiki.creativecommons.org/Marking; different batch id
 (null)

 This pages are example of many others pages that aren't parsed.
 Like that, there are many other pages that I wanted to be read and recorded
 in the database.


 Thanks again.



 On Thu, Jun 13, 2013 at 6:04 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  Those are all images which wont get parsed by Nutch.
 
 
  On Thu, Jun 13, 2013 at 1:33 PM, Weder Carlos Vieira 
  weder.vie...@gmail.com
   wrote:
 
  
   I extracted 1 row of this urls returned...
  
   It attached in excel format.
  
  
  
 



Re: Nutch job fails with the exception Caused by: java.io.IOException: Could not obtain block: blk

2013-06-14 Thread Tejas Patil
Thanks for sharing. This seems to be more towards HBase end and less of
Nutch.


On Thu, Jun 13, 2013 at 11:52 PM, vivekvl vive...@yahoo.com wrote:

 Few references for this issue..

 http://hbase.apache.org/book.html#dfs.datanode.max.xcievers

 http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html

 http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-job-fails-with-the-exception-Caused-by-java-io-IOException-Could-not-obtain-block-blk-tp4069322p4070438.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Refrence 3rd Party Jar in Nutch 2.x source

2013-06-14 Thread Tejas Patil
Adding things to the classpath in Eclipse won't help. Look into
NUTCH_HOME/src/plugin/parse-swf/plugin.xml. That plugin uses an external
jar : javaswf.jar. This applies if you want the jar to be with the codebase.
If your jar needs to be picked up from maven repo, you will have to specify
it in ivy.xml and plugin.xml.


On Fri, Jun 14, 2013 at 12:43 AM, imran khan imrankhan.x...@gmail.comwrote:

 I need that jar in one of my custom Nutch Plugin.


 On Fri, Jun 14, 2013 at 12:20 PM, imran khan imrankhan.x...@gmail.com
 wrote:

  Greetings,
 
  I want to add 3rd party jar file in my existing Nutch 2.x source. I have
  tried adding it in Libraries dialog of my Eclipse via 'Add External Jars'
  but I am still getting of pakacage errors on import of that jar on
  sourcefile  where I want to use this Jar.
 
  Could you please help me in adding 3rd party Jar to my Nutch 2.x source
 in
  Eclipse.
 
  Regards
  Imran
 



Re: Refrence 3rd Party Jar in Nutch 2.x source

2013-06-14 Thread Tejas Patil
On Fri, Jun 14, 2013 at 1:58 AM, imran khan imrankhan.x...@gmail.comwrote:

 Ok for ivy dependency I will add my 3rd party jar in dependencies , where
 do I have to define it in plugin.xml ?


Look at how commons-net dependency is defined here:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-ftp/plugin.xml?revision=1175075view=markup



 And for 2nd option like parse-swf plugin , I will just place my jar in lib
 directory of plugin and add its refrence in runtime ?

 Yes. Add reference in plugin.xml as done here:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-swf/plugin.xml?revision=1175075view=markup


 Regards,
 Imran


 On Fri, Jun 14, 2013 at 1:19 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  Adding things to the classpath in Eclipse won't help. Look into
  NUTCH_HOME/src/plugin/parse-swf/plugin.xml. That plugin uses an external
  jar : javaswf.jar. This applies if you want the jar to be with the
  codebase.
  If your jar needs to be picked up from maven repo, you will have to
 specify
  it in ivy.xml and plugin.xml.
 
 
  On Fri, Jun 14, 2013 at 12:43 AM, imran khan imrankhan.x...@gmail.com
  wrote:
 
   I need that jar in one of my custom Nutch Plugin.
  
  
   On Fri, Jun 14, 2013 at 12:20 PM, imran khan imrankhan.x...@gmail.com
   wrote:
  
Greetings,
   
I want to add 3rd party jar file in my existing Nutch 2.x source. I
  have
tried adding it in Libraries dialog of my Eclipse via 'Add External
  Jars'
but I am still getting of pakacage errors on import of that jar on
sourcefile  where I want to use this Jar.
   
Could you please help me in adding 3rd party Jar to my Nutch 2.x
 source
   in
Eclipse.
   
Regards
Imran
   
  
 



Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-13 Thread Tejas Patil
Weird. I would like to have a quick peek into your changes. Maybe you are
doing something wrong which is hard to predict and figure out by asking
bunch of questions to you over email. Can you attach a patch file of your
changes ? Please remove the fluff from it and only keep the bare essential
things in the patch. Also, if you are working for some company, make sure
that you attaching some code here should not be against your organisational
policy.

Thanks,
Tejas

On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.comwrote:

 I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml
 . Added the entry in nutch-site.xml and srcpluginbuild.xml.
 But I am still getting PluginRuntimeException:
 java.lang.ClassNotFoundException


 Is there any other configuration that I am missing or its Nutch 2.2 issues
 ?

 Thanks,
 Tony.


 On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  Here is the relevant wiki page:
  http://wiki.apache.org/nutch/WritingPluginExample
 
  Although its old, I think that it will help.
 
 
  On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel 
  wastl.na...@googlemail.com
   wrote:
 
   Hi Tony,
  
   you have to register your plugin in
src/plugin/build.xml
  
   Does your
src/plugin/myplugin/plugin.xml
   properly propagate jar file,
   extension point and implementing class?
  
   And, finally, you have to add your plugin
   to the property plugin.includes in nutch-site.xml
  
   Cheers,
   Sebastian
  
   On 06/12/2013 07:48 PM, Tony Mullins wrote:
Hi,
   
I am trying simple ParseFilter plugin in Nutch 2.2. And I can build
 it
   and
also the srcpluginbuild.xml successfully. But its .jar file is not
   being
created in my runtimelocalpluginsmyplugin directory.
   
And on running
bin/nutch parsechecker http://www.google.nl;
 I get this error  java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
java.lang.ClassNotFoundException:
com.xyz.nutch.selector.HtmlElementSelectorFilter
   
If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar
  with
test  classes directory created there. If I copy .jar  from here and
   paste
it to my runtimelocalpluginsmyplugin directory with plugin.xml
 file
   then
too I get the same exception of class not found.
   
I have not made any changes in srcpluginbuild-plugin.xml.
   
Could you please guide me that what is I am doing wrong here ?
   
Thanks,
Tony
   
  
  
 



Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-13 Thread Tejas Patil
A side note (not related to the problem you faced, but a general rule with
nutch):
In your patch I saw that you were having changes in nutch-default.xml and
nutch-site.xml. Please just modify nutch-site.xml and
keep nutch-default.xml as it is.


On Thu, Jun 13, 2013 at 1:07 AM, Tejas Patil tej...@apache.org wrote:

 The package structure you actually have is:
 *kaqqao.nutch.plugin.selector;*

 In src/plugin/element-selector/plugin.xml you have defined it as:

extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer
   name=Nutch Blacklist and Whitelist Indexing Filter
   point=org.apache.nutch.indexer.IndexingFilter
   implementation id=HtmlElementSelectorIndexer
   class=*kaqqao.nutch.selector*
 .HtmlElementSelectorIndexer/
/extension

 It aint the same and thats why it cannot load that class at runtime. Make
 it consistent and try again.
 It worked at my end after changing the package structure to
 kaqqao.nutch.selector


 On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins 
 tonymullins...@gmail.comwrote:

 Hi Tejas,

 I am following this example
 https://github.com/veggen/nutch-element-selector. And now I have tried
 this example without any changes to my  fresh source of Nutch 2.2.

 Attached is my patch ( change set) on fresh Nutch 2.2 source.
 Kindly review it and please let me know if I am missing something.

 Thanks,
 Tonny


 On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil 
 tejas.patil...@gmail.comwrote:

 Weird. I would like to have a quick peek into your changes. Maybe you are
 doing something wrong which is hard to predict and figure out by asking
 bunch of questions to you over email. Can you attach a patch file of your
 changes ? Please remove the fluff from it and only keep the bare
 essential
 things in the patch. Also, if you are working for some company, make sure
 that you attaching some code here should not be against your
 organisational
 policy.

 Thanks,
 Tejas

 On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  I have done this all. Created my plugin's ivy.xml , plugin.xml ,
 build,xml
  . Added the entry in nutch-site.xml and srcpluginbuild.xml.
  But I am still getting PluginRuntimeException:
  java.lang.ClassNotFoundException
 
 
  Is there any other configuration that I am missing or its Nutch 2.2
 issues
  ?
 
  Thanks,
  Tony.
 
 
  On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com
  wrote:
 
   Here is the relevant wiki page:
   http://wiki.apache.org/nutch/WritingPluginExample
  
   Although its old, I think that it will help.
  
  
   On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel 
   wastl.na...@googlemail.com
wrote:
  
Hi Tony,
   
you have to register your plugin in
 src/plugin/build.xml
   
Does your
 src/plugin/myplugin/plugin.xml
properly propagate jar file,
extension point and implementing class?
   
And, finally, you have to add your plugin
to the property plugin.includes in nutch-site.xml
   
Cheers,
Sebastian
   
On 06/12/2013 07:48 PM, Tony Mullins wrote:
 Hi,

 I am trying simple ParseFilter plugin in Nutch 2.2. And I can
 build
  it
and
 also the srcpluginbuild.xml successfully. But its .jar file is
 not
being
 created in my runtimelocalpluginsmyplugin directory.

 And on running
 bin/nutch parsechecker http://www.google.nl;
  I get this error  java.lang.RuntimeException:
 org.apache.nutch.plugin.PluginRuntimeException:
 java.lang.ClassNotFoundException:
 com.xyz.nutch.selector.HtmlElementSelectorFilter

 If I go to MyNutch2.2Source/build/myplugin , I can see plugin's
 jar
   with
 test  classes directory created there. If I copy .jar  from
 here and
paste
 it to my runtimelocalpluginsmyplugin directory with plugin.xml
  file
then
 too I get the same exception of class not found.

 I have not made any changes in srcpluginbuild-plugin.xml.

 Could you please guide me that what is I am doing wrong here ?

 Thanks,
 Tony

   
   
  
 






Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-13 Thread Tejas Patil
The package structure you actually have is:
*kaqqao.nutch.plugin.selector;*

In src/plugin/element-selector/plugin.xml you have defined it as:

   extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer
  name=Nutch Blacklist and Whitelist Indexing Filter
  point=org.apache.nutch.indexer.IndexingFilter
  implementation id=HtmlElementSelectorIndexer
  class=*kaqqao.nutch.selector*
.HtmlElementSelectorIndexer/
   /extension

It aint the same and thats why it cannot load that class at runtime. Make
it consistent and try again.
It worked at my end after changing the package structure to
kaqqao.nutch.selector


On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Hi Tejas,

 I am following this example
 https://github.com/veggen/nutch-element-selector. And now I have tried
 this example without any changes to my  fresh source of Nutch 2.2.

 Attached is my patch ( change set) on fresh Nutch 2.2 source.
 Kindly review it and please let me know if I am missing something.

 Thanks,
 Tonny


 On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.comwrote:

 Weird. I would like to have a quick peek into your changes. Maybe you are
 doing something wrong which is hard to predict and figure out by asking
 bunch of questions to you over email. Can you attach a patch file of your
 changes ? Please remove the fluff from it and only keep the bare essential
 things in the patch. Also, if you are working for some company, make sure
 that you attaching some code here should not be against your
 organisational
 policy.

 Thanks,
 Tejas

 On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  I have done this all. Created my plugin's ivy.xml , plugin.xml ,
 build,xml
  . Added the entry in nutch-site.xml and srcpluginbuild.xml.
  But I am still getting PluginRuntimeException:
  java.lang.ClassNotFoundException
 
 
  Is there any other configuration that I am missing or its Nutch 2.2
 issues
  ?
 
  Thanks,
  Tony.
 
 
  On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com
  wrote:
 
   Here is the relevant wiki page:
   http://wiki.apache.org/nutch/WritingPluginExample
  
   Although its old, I think that it will help.
  
  
   On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel 
   wastl.na...@googlemail.com
wrote:
  
Hi Tony,
   
you have to register your plugin in
 src/plugin/build.xml
   
Does your
 src/plugin/myplugin/plugin.xml
properly propagate jar file,
extension point and implementing class?
   
And, finally, you have to add your plugin
to the property plugin.includes in nutch-site.xml
   
Cheers,
Sebastian
   
On 06/12/2013 07:48 PM, Tony Mullins wrote:
 Hi,

 I am trying simple ParseFilter plugin in Nutch 2.2. And I can
 build
  it
and
 also the srcpluginbuild.xml successfully. But its .jar file is
 not
being
 created in my runtimelocalpluginsmyplugin directory.

 And on running
 bin/nutch parsechecker http://www.google.nl;
  I get this error  java.lang.RuntimeException:
 org.apache.nutch.plugin.PluginRuntimeException:
 java.lang.ClassNotFoundException:
 com.xyz.nutch.selector.HtmlElementSelectorFilter

 If I go to MyNutch2.2Source/build/myplugin , I can see plugin's
 jar
   with
 test  classes directory created there. If I copy .jar  from here
 and
paste
 it to my runtimelocalpluginsmyplugin directory with plugin.xml
  file
then
 too I get the same exception of class not found.

 I have not made any changes in srcpluginbuild-plugin.xml.

 Could you please guide me that what is I am doing wrong here ?

 Thanks,
 Tony

   
   
  
 





Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-13 Thread Tejas Patil
I don't see any attachments with the mail.

Anyways, you need to:
1. remove all your changes from conf/nutch-default.xml. Make it in sync
with svn. (rm conf/nutch-default.xml  svn up conf/nutch-default.xml)
2. In conf/nutch-site.xml, remove the entry for plugin.folders
3. run ant clean runtime

Now try again.


On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins tonymullins...@gmail.comwrote:

 Hi Tejas,

 Thanks for pointing out the problem. I have changed the package to
 kaqqao.nutch.selector and have also modified the package in java source
 files as package kaqqao.nutch.selector;

 But I am still getting the ClassNotFound exception... please see attached
 images !!!

 Please note that I am using fresh Nutch 2.2 source without additional
 patch ... do I need to apply any patch to run this ?

 Thanks,
 Tony.



 On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 The package structure you actually have is:
 *kaqqao.nutch.plugin.selector;*

 In src/plugin/element-selector/plugin.xml you have defined it as:

extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer
   name=Nutch Blacklist and Whitelist Indexing Filter
   point=org.apache.nutch.indexer.IndexingFilter
   implementation id=HtmlElementSelectorIndexer
   class=*kaqqao.nutch.selector*
 .HtmlElementSelectorIndexer/
/extension

 It aint the same and thats why it cannot load that class at runtime. Make
 it consistent and try again.
 It worked at my end after changing the package structure to
 kaqqao.nutch.selector


 On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi Tejas,
 
  I am following this example
  https://github.com/veggen/nutch-element-selector. And now I have tried
  this example without any changes to my  fresh source of Nutch 2.2.
 
  Attached is my patch ( change set) on fresh Nutch 2.2 source.
  Kindly review it and please let me know if I am missing something.
 
  Thanks,
  Tonny
 
 
  On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.com
 wrote:
 
  Weird. I would like to have a quick peek into your changes. Maybe you
 are
  doing something wrong which is hard to predict and figure out by asking
  bunch of questions to you over email. Can you attach a patch file of
 your
  changes ? Please remove the fluff from it and only keep the bare
 essential
  things in the patch. Also, if you are working for some company, make
 sure
  that you attaching some code here should not be against your
  organisational
  policy.
 
  Thanks,
  Tejas
 
  On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins 
 tonymullins...@gmail.com
  wrote:
 
   I have done this all. Created my plugin's ivy.xml , plugin.xml ,
  build,xml
   . Added the entry in nutch-site.xml and srcpluginbuild.xml.
   But I am still getting PluginRuntimeException:
   java.lang.ClassNotFoundException
  
  
   Is there any other configuration that I am missing or its Nutch 2.2
  issues
   ?
  
   Thanks,
   Tony.
  
  
   On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil 
 tejas.patil...@gmail.com
   wrote:
  
Here is the relevant wiki page:
http://wiki.apache.org/nutch/WritingPluginExample
   
Although its old, I think that it will help.
   
   
On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel 
wastl.na...@googlemail.com
 wrote:
   
 Hi Tony,

 you have to register your plugin in
  src/plugin/build.xml

 Does your
  src/plugin/myplugin/plugin.xml
 properly propagate jar file,
 extension point and implementing class?

 And, finally, you have to add your plugin
 to the property plugin.includes in nutch-site.xml

 Cheers,
 Sebastian

 On 06/12/2013 07:48 PM, Tony Mullins wrote:
  Hi,
 
  I am trying simple ParseFilter plugin in Nutch 2.2. And I can
  build
   it
 and
  also the srcpluginbuild.xml successfully. But its .jar file
 is
  not
 being
  created in my runtimelocalpluginsmyplugin directory.
 
  And on running
  bin/nutch parsechecker http://www.google.nl;
   I get this error  java.lang.RuntimeException:
  org.apache.nutch.plugin.PluginRuntimeException:
  java.lang.ClassNotFoundException:
  com.xyz.nutch.selector.HtmlElementSelectorFilter
 
  If I go to MyNutch2.2Source/build/myplugin , I can see plugin's
  jar
with
  test  classes directory created there. If I copy .jar  from
 here
  and
 paste
  it to my runtimelocalpluginsmyplugin directory with
 plugin.xml
   file
 then
  too I get the same exception of class not found.
 
  I have not made any changes in srcpluginbuild-plugin.xml.
 
  Could you please guide me that what is I am doing wrong here ?
 
  Thanks,
  Tony
 


   
  
 
 
 





Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-13 Thread Tejas Patil
I can't see the image that you attached.

Anyways, if you are running via command line (ie. runtime/local): set
plugin.folders to plugins in
NUTCH_HOME/runtime/local/conf/nutch-site.xml. For running from
Eclipse, set plugin.folders to the absolute path of directory where the
plugins are generated (ie. NUTCH_HOME/build/plugins) in
NUTCH_HOME/conf/nutch-site.xml

On Thu, Jun 13, 2013 at 5:38 AM, Tony Mullins tonymullins...@gmail.comwrote:

 Tejas,

 I can now successfully run the plugin from terminal like bin/nutch
 parsechecker http://www.google.nl

 But if I try to run my code directly from eclipse , with main class as
 'org.apache.nutch.parse.ParserChecker' and program arguments as '
 http://www.google.nl' it fails with same exception of ClassNotFound.

 Please see the attached image.

 [image: Inline image 1]


 I have tried 'ant clean' in my Nutch2.2 source...  but same error !!!

 Could you please help me fixing this issue.

 Thanks,
 Tony




 On Thu, Jun 13, 2013 at 2:23 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Thank you very much Tejas. It worked. :)

 Just wondering why did you ask me to remove the 'plugin.folders' from
 conf/nutch-site.xml ?
 And the problem was due to bad cache/runtime build ?

 Thank you again !!!
 Tony.



 On Thu, Jun 13, 2013 at 1:47 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 I don't see any attachments with the mail.

 Anyways, you need to:
 1. remove all your changes from conf/nutch-default.xml. Make it in sync
 with svn. (rm conf/nutch-default.xml  svn up conf/nutch-default.xml)
 2. In conf/nutch-site.xml, remove the entry for plugin.folders
 3. run ant clean runtime

 Now try again.


 On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi Tejas,
 
  Thanks for pointing out the problem. I have changed the package to
  kaqqao.nutch.selector and have also modified the package in java source
  files as package kaqqao.nutch.selector;
 
  But I am still getting the ClassNotFound exception... please see
 attached
  images !!!
 
  Please note that I am using fresh Nutch 2.2 source without additional
  patch ... do I need to apply any patch to run this ?
 
  Thanks,
  Tony.
 
 
 
  On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:
 
  The package structure you actually have is:
  *kaqqao.nutch.plugin.selector;*
 
  In src/plugin/element-selector/plugin.xml you have defined it as:
 
 extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer
name=Nutch Blacklist and Whitelist Indexing Filter
point=org.apache.nutch.indexer.IndexingFilter
implementation id=HtmlElementSelectorIndexer
class=*kaqqao.nutch.selector*
  .HtmlElementSelectorIndexer/
 /extension
 
  It aint the same and thats why it cannot load that class at runtime.
 Make
  it consistent and try again.
  It worked at my end after changing the package structure to
  kaqqao.nutch.selector
 
 
  On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins 
 tonymullins...@gmail.com
  wrote:
 
   Hi Tejas,
  
   I am following this example
   https://github.com/veggen/nutch-element-selector. And now I have
 tried
   this example without any changes to my  fresh source of Nutch 2.2.
  
   Attached is my patch ( change set) on fresh Nutch 2.2 source.
   Kindly review it and please let me know if I am missing something.
  
   Thanks,
   Tonny
  
  
   On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil 
 tejas.patil...@gmail.com
  wrote:
  
   Weird. I would like to have a quick peek into your changes. Maybe
 you
  are
   doing something wrong which is hard to predict and figure out by
 asking
   bunch of questions to you over email. Can you attach a patch file
 of
  your
   changes ? Please remove the fluff from it and only keep the bare
  essential
   things in the patch. Also, if you are working for some company,
 make
  sure
   that you attaching some code here should not be against your
   organisational
   policy.
  
   Thanks,
   Tejas
  
   On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins 
  tonymullins...@gmail.com
   wrote:
  
I have done this all. Created my plugin's ivy.xml , plugin.xml ,
   build,xml
. Added the entry in nutch-site.xml and srcpluginbuild.xml.
But I am still getting PluginRuntimeException:
java.lang.ClassNotFoundException
   
   
Is there any other configuration that I am missing or its Nutch
 2.2
   issues
?
   
Thanks,
Tony.
   
   
On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil 
  tejas.patil...@gmail.com
wrote:
   
 Here is the relevant wiki page:
 http://wiki.apache.org/nutch/WritingPluginExample

 Although its old, I think that it will help.


 On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel 
 wastl.na...@googlemail.com
  wrote:

  Hi Tony,
 
  you have to register your plugin in
   src/plugin/build.xml
 
  Does your
   src/plugin/myplugin/plugin.xml

Re: Nutch 1.6 on CDH4.2.1

2013-06-13 Thread Tejas Patil
 How should I define the Hue job so that it recognizes Nutch's .job jar
file and/or make the CDH4 Hue consistent with the hadoop/hdfs shell
commands?
Could you try posting to hue and CDH4 user groups ? We dont promise
compatibility across the several hadoop distributions out there.
See https://issues.apache.org/jira/browse/NUTCH-1447

On Thu, Jun 13, 2013 at 7:39 AM, Byte Array byte.arra...@gmail.com wrote:

 Hello!

 I am trying to run a simple crawl with Nutch 1.6 on CDH4.2.1 on Centos 6.2
 cluster.

 First I had problems with
 # hadoop jar apache-nutch-1.6.job org.apache.nutch.fetcher.Fetcher
 /nutch/1.6/crawl/segments/20130613095319
 which was returning:
  java.lang.RuntimeException: problem advancing post rec#0
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1183)
 at

 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:255)
 at

 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:251)
 at

 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:447)
 Caused by: java.io.IOException: can't find class:
 org.apache.nutch.protocol.ProtocolStatus because
 org.apache.nutch.protocol.ProtocolStatus
 at

 org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:206)
 . . .
 Also, I noticed inconsistency between the file system shown with hdfs dfs
 -ls and the one shown in CDH4 Hue GUI. The former seems to simply create
 the folders/files locally and is not aware of the ones I create through Hue
 GUI.
 Therefore, I suspected that the job is not properly running on the CDH4
 cluster and used Hue GUI to create /user/admin/Nutch-1.6 folder and
 urls/seed.txt and upload the Nutch 1.6 .job file (previously configured and
 built with ant in Eclipse).
 When I submit the job through Hue it logs ClassNotFoundException, although
 I properly defined path to the .job file on the hdfs and the class name in
 that file:
 ...
 Failing Oozie Launcher, Main class [org.apache.nutch.crawl.Injector],
 exception invoking main(), java.lang.ClassNotFoundException: Class
 org.apache.nutch.crawl.Injector not found
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 org.apache.nutch.crawl.Injector not found
 ...
 How should I define the Hue job so that it recognizes Nutch's .job jar file
 and/or make the CDH4 Hue consistent with the hadoop/hdfs shell commands?
 This thread looks related:
 http://www.mail-archive.com/user@nutch.apache.org/msg07603.html


 Thank you



Re: what is stored in the hbase after inject job

2013-06-13 Thread Tejas Patil
row-key = url

column=f:fi  : fetchInterval (the delay between re-fetches of a
page)
column=f:ts  : fetchTime (indicates when the url will be elligible
for fetching)
column=mk:_injmrk_   : markers
column=mk:dist
column=mtdt:_csh_: metadata
column=s:s   : status (is the url fetched, unfetched, newly
injected, gone, redirected etc..)



On Thu, Jun 13, 2013 at 6:40 AM, RS tinyshr...@163.com wrote:

 I do not what is sotred in the hbase after inject a website.
 When I use the hbase shell  $ scan 'webpage'  , there are :
 hbase(main):028:0 scan '1_webpage'
 ROW  COLUMN+CELL
  com.xinhuanet.www:http/ column=f:fi, timestamp=1371110099941,
 value=\x00'\x8D\x00
  com.xinhuanet.www:http/ column=f:ts, timestamp=1371110099941,
 value=\x00\x00\x01?\x87\xBA\x0A
  com.xinhuanet.www:http/ column=mk:_injmrk_,
 timestamp=1371110099941, value=y
  com.xinhuanet.www:http/ column=mk:dist,
 timestamp=1371110099941, value=0
  com.xinhuanet.www:http/ column=mtdt:_csh_,
 timestamp=1371110099941, value=?\x80\x00\x00
  com.xinhuanet.www:http/ column=s:s, timestamp=1371110099941,
 value=?\x80\x00\x00
 1 row(s) in 0.0300 seconds


 So, is only 6 column are setted in the hbase ? And what is the real data
 stored in it?
 I find that in the source code, there is a WebPage Class.  I could not
 understand all, but I think there should be 24 fileds in the hbase for each
 webside.
   public static final String[] _ALL_FIELDS =
 {baseUrl,status,fetchTime,prevFetchTime,fetchInterval,retriesSinceFetch,modifiedTime,prevModifiedTime,protocolStatus,content,contentType,prevSignature,signature,title,text,parseStatus,score,reprUrl,headers,outlinks,inlinks,markers,metadata,batchId,};


 Thanks
 HeChuan




Re: Installing Nutch1.6 on Windows7

2013-06-13 Thread Tejas Patil
On Thu, Jun 13, 2013 at 4:50 AM, Andrea Lanzoni a.lanz...@alice.it wrote:

 Hi Lewis and thanks for your reply. I will try to be as much detailed as
 possible to allow you to understand the shortcomings I incurred during
 installation of Nutch and Solr on my PC, which is a 64 bit running Windows
 7.

 I canvassed wiki apache and found this link:

 http://wiki.apache.org/nutch/**FabioGiavazzi/**
 HowtoGettingNutchRunningonWind**ows?highlight=%28download%29|%**
 28nutch%29|%28for%29|%**28windows%29|%287%29http://wiki.apache.org/nutch/FabioGiavazzi/HowtoGettingNutchRunningonWindows?highlight=%28download%29%7C%28nutch%29%7C%28for%29%7C%28windows%29%7C%287%29
 http://wiki.apache.org/nutch/**FabioGiavazzi/**
 HowtoGettingNutchRunningonWind**ows?highlight=%28download%29%**
 7C%28nutch%29%7C%28for%29%7C%**28windows%29%7C%287%29http://wiki.apache.org/nutch/FabioGiavazzi/HowtoGettingNutchRunningonWindows?highlight=%28download%29%7C%28nutch%29%7C%28for%29%7C%28windows%29%7C%287%29
 

 Even though it explains Nutch 1.2 installation on Windows 7 I thought it
 might fit for Nutch 1.6 as well.

 Everything went smooth until step 4 of the presentation. Step 4 puzzled me
 and decided to skip it.

 I went on and made the changes in XML as stated in ensuing steps.

 Problem arouse at Step 7 where it reads: Go to *nutch*-1.2\conf\ and edit
 the file crawl-urlfilter.txt


Those steps are old. Skip step #7 and proceed.


 I didn't find it in my Nutch directories and got into a cul de sac.


PS: I just learned what cul de sac means !! thanks for adding to my vocab
:)


 Presently in my PC I have installed as follows:

 C:\Users\Andrea\Documents\**apache-tomcat-7.0.40\apache-**tomcat-7.0.40

 C:\cygwin\home\apache-nutch-1.**6-bin\apache-nutch-1.6

 C:\cygwin\home\solr-4.2.0\**solr-4.2.0


 I am also asking your advice on this:  is it only my laziness/stubborness
 to continue to install them on Windows, in other terms would you suggest to
 drop Windows and install Ubuntu and restart the procedure in the new op.
 system?
 I happened only once to come across Ubuntu, does it permit to host
 simultaneously and use on the same PC Windows and Ubuntu itself?


You can have Ubuntu + Windows installed on the same m/c. See this:
http://www.ubuntu.com/download/desktop/install-ubuntu-with-windows

Its super easy to setup Ubuntu with that and its worth of spending an hour
or two.


 Thanks for your welcome opinion.
 Andrea




 Il 13/06/2013 01:49, Lewis John Mcgibbney ha scritto:

 Hi Andrea,
 Please describe the problem here. There is an absence of any detail about
 what is wrong here.
 Thanks

 On Wednesday, June 12, 2013, Andrea Lanzoni a.lanz...@alice.it wrote:

 Hi everyone, I am a newcomer to Nutch and Solr and, after studying

 literature available on web, I tried to install them on _Windows 7_.

 I have not been able to match the few instructions on the wikiapache site

 nor I could find a guide updated to Nutch 1.6 but only for older versions

 I tried by following old versions guides on the web but never succeeded

 in the installation, often because of differences from what the guide read
 and what I saw on screen.

 I followed the steps  by installing:
 - Tomcat
 - Java jdk 7
 - Cygwin, Nutch 1.6 and Solr 4

 Everything went apparently smooth and I copied Nutch and Solr in:
 C:\cygwin\home\apache-nutch-1.**6-bin
 and
 C:\cygwin\home\solr-4.2.0\**solr-4.2.0

 Whilst the two folders: jdk1.7.0_21 and jre7, are within the Java folder

 in Programs directory

 I apologize for my dumbness but I couldn't find how to manage it. If

 somebody has a clear and detailed step by step pattern to follow for
 installing Nutch 1.6 and Solr 4 I would be very grateful.

 Thanks in advance.
 Andrea Lanzoni





Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]

2013-06-13 Thread Tejas Patil
On Thu, Jun 13, 2013 at 12:41 PM, Weder Carlos Vieira 
weder.vie...@gmail.com wrote:

 Hello everyone!

 This is my first mail here.

Welcome !!


 I want to know more and more about nutch and share what a find out by
 myself with you. Thanks if someone can help me too.

 I was trying to use nutch. First I setup and test nutch 2.1 and its works
 fine, but many of crawled urls was saved on MySQL with null value, just few
 url with status=2. I don't understand that but I go on...

 Next I try to setup and use (test)  Nutch 2.2, in this case when I start
 command to initiate crawl I get this error below.

 Exception in thread main java.lang.ClassNotFoundException:
 org.apache.gora.sql.store.SqlStore
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) .

 The 'SqlStore' class was removed from nutch 2.2? because on Nutch 2.1 this
 error doesn't appear.

Yes. Nutch 2.2 uses Apache Gora 0.3 which has deprecated their support for
mySQL as a data store. Follow https://wiki.apache.org/nutch/Nutch2Tutorial



 -

 In the other hand I want to ask a second question, How can I improve
 configuration of Nutch 2.1 (that Works fine) to fetch more and more url
 without 'null values'.

What do you mean by w/o null values ?




 Thanks a lot.

 Weder



Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]

2013-06-13 Thread Tejas Patil
On Thu, Jun 13, 2013 at 1:04 PM, Weder Carlos Vieira weder.vie...@gmail.com
 wrote:

 After Nutch 2.2 with gora 0.3 mysql will not be more supported?

Nutch 2.2 which uses gora 0.3 by default wont support MySQL. There might be
a possibility of making that happen by tweaking dependencies but I have
never tried it. See lines 102-112 in ivy/ivy.xml.


 I want mean that crawl doesn't parsing many urls and I don't know why.

Can you share few of those urls ?



 Weder


 On Thu, Jun 13, 2013 at 4:49 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Thu, Jun 13, 2013 at 12:41 PM, Weder Carlos Vieira 
  weder.vie...@gmail.com wrote:
 
   Hello everyone!
  
   This is my first mail here.
  
  Welcome !!
 
  
   I want to know more and more about nutch and share what a find out by
   myself with you. Thanks if someone can help me too.
  
   I was trying to use nutch. First I setup and test nutch 2.1 and its
 works
   fine, but many of crawled urls was saved on MySQL with null value, just
  few
   url with status=2. I don't understand that but I go on...
  
   Next I try to setup and use (test)  Nutch 2.2, in this case when I
 start
   command to initiate crawl I get this error below.
  
   Exception in thread main java.lang.ClassNotFoundException:
   org.apache.gora.sql.store.SqlStore
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366) .
  
   The 'SqlStore' class was removed from nutch 2.2? because on Nutch 2.1
  this
   error doesn't appear.
  
  Yes. Nutch 2.2 uses Apache Gora 0.3 which has deprecated their support
 for
  mySQL as a data store. Follow
 https://wiki.apache.org/nutch/Nutch2Tutorial
 
 
  
   -
  
   In the other hand I want to ask a second question, How can I improve
   configuration of Nutch 2.1 (that Works fine) to fetch more and more url
   without 'null values'.
  
  What do you mean by w/o null values ?
 
  
  
  
   Thanks a lot.
  
   Weder
  
 



Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tejas Patil
Here is the relevant wiki page:
http://wiki.apache.org/nutch/WritingPluginExample

Although its old, I think that it will help.


On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi Tony,

 you have to register your plugin in
  src/plugin/build.xml

 Does your
  src/plugin/myplugin/plugin.xml
 properly propagate jar file,
 extension point and implementing class?

 And, finally, you have to add your plugin
 to the property plugin.includes in nutch-site.xml

 Cheers,
 Sebastian

 On 06/12/2013 07:48 PM, Tony Mullins wrote:
  Hi,
 
  I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it
 and
  also the srcpluginbuild.xml successfully. But its .jar file is not
 being
  created in my runtimelocalpluginsmyplugin directory.
 
  And on running
  bin/nutch parsechecker http://www.google.nl;
   I get this error  java.lang.RuntimeException:
  org.apache.nutch.plugin.PluginRuntimeException:
  java.lang.ClassNotFoundException:
  com.xyz.nutch.selector.HtmlElementSelectorFilter
 
  If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with
  test  classes directory created there. If I copy .jar  from here and
 paste
  it to my runtimelocalpluginsmyplugin directory with plugin.xml file
 then
  too I get the same exception of class not found.
 
  I have not made any changes in srcpluginbuild-plugin.xml.
 
  Could you please guide me that what is I am doing wrong here ?
 
  Thanks,
  Tony
 




Re: Nutch Compilation Error with Eclipse

2013-06-11 Thread Tejas Patil
If you want to find out the java class corresponding to any command, just
peek inside src/bin/nutch script and at the bottom you would find a
switch case with a case corresponding to each command. For 2.x, here are
the important classes:

inject - org.apache.nutch.crawl.InjectorJob
generate - org.apache.nutch.crawl.GeneratorJob
fetch - org.apache.nutch.fetcher.FetcherJob
parse - org.apache.nutch.parse.ParserJob
updatedb - org.apache.nutch.crawl.DbUpdaterJob

Create a separate launcher for each of these. Running these without any i/p
parameters would show you the usage of these commands.


Re: Issues on Compiling Nutch 2.x with Eclipse

2013-06-10 Thread Tejas Patil
Hi Tony,

That tutorial is based on some earlier nutch version. Please follow
http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse.
There has been recent changes to that wiki page and those new steps would
take care of getting automation.jar and etc dependencies in place.


On Sun, Jun 9, 2013 at 11:58 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Hi ,

 The last try I made was with this tutorial '
 https://sites.google.com/site/profilerajanimaski/webcrawlers/run-nutch-in-eclipse'
 ,
 after following word to word ( which didn't work for me) then I made some
 modifications to it as for step 11 I added  'bin' , 'gora' , 'java' ,'test'
 , 'testprocess' , 'testresources' . And for step 14 I couldn't find
 'src/plugin/url-filter-automation/lib/automation.jar' in my source.

 And when I try to run main 'Crawler' project it says there are errors and
 give me option to proceed with errors and when I proceed with errors  I am
 getting this error:

 InjectorJob: Using class org.apache.gora.memory.store.MemStore as the
 Gora storage class.
 InjectorJob: total number of urls rejected by filters: 0
 InjectorJob: total number of urls injected after normalization and
 filtering: 0
 Exception in thread main java.lang.RuntimeException: job failed:
 name=generate: null, jobid=job_local_0002...
 .
 

 So please help me what I am doing wrong here or guide me to a tutorial
 which works
 If the latest Nutch 2.2 source doesn't work with these tutorials then
 which version of 2.x will work and how ?

 Thanks.
 Tony


 On Mon, Jun 10, 2013 at 7:20 AM, Tejas Patil tejas.patil...@gmail.comwrote:

 Could you try closing and re-opening the eclipse and then let eclipse
 rebuild workspace. BTW: On which packages / classes do you see red dots ?


 On Sun, Jun 9, 2013 at 9:23 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi Tony,
  This source has literally just been released. The tutorial on the Nutch
  wiki has also just been updated but you need to follow it closely and
 pay
  attention to each step. It sounds like the red dots problem your having
 is
  explained in the 2nd to last bullet point below
 
 
 http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse
 
  Also, you've not actually said what went wrong!
  Lewis
 
 
  On Sunday, June 9, 2013, Tony Mullins tonymullins...@gmail.com wrote:
   Hi,
  
   I am new to Nutch. I am trying to use Nutch with Cassandra and have
   successfully build the Nutch 2.x (
   http://svn.apache.org/repos/asf/nutch/branches/2.x/).
  
   But I get errors ( different errors after following different
 tutorials)
   when I try to run it directly from Eclipse ( I am on CentOS 6.4) , I
 have
   tried to follow these tutorials to run Nutch source from Eclipse but
 no
  use.
  
   http://wiki.apache.org/nutch/RunNutchInEclipse
   run nutch in eclipse | profilerajanimaski
  
 http://jarpit83.blogspot.com/2012/07/configuring-nutch-in-eclipse.html
   http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
  
   Whatever I do,  I get red * on my source and it doesn't get run by
   Eclipse , but it always get build successfully using Ant.
  
   Plaaase help me here, could any one please guide me to single web
   tutorial which actually could help me compile and run latest Nutch 2.x
  with
   Eclipse (Juno) on CentOS.
  
   Thanksss.
   Tony.
  
 
  --
  *Lewis*
 





Re: Issues on Compiling Nutch 2.x with Eclipse

2013-06-10 Thread Tejas Patil
I have created a google doc [0] with several snapshots describing how to
setup nutch 2.x + eclipse. This is different from the one over the wiki
page and tailored for Nutch 2.x. Please try it out, let us know if you
still have issues with that. Based on your comments, I would add the same
over nutch wiki.

[0] :
https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing


On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.comwrote:

 yes.

- Close the project in eclipse. Right click on the project, click on
Properties and get the location of the project.
- Goto that location in terminal
-

Run 'ant eclipse'. (Note that you need to have Apache 
 Anthttp://ant.apache.org/manual/index.html installed
and configured)

 After going command line, you might as well do this:
 Specify the GORA backend in nutch-site.xml, uncomment its dependency in
 ivy/ivy.xml and ensure that the store you selected is set as the default
 datastore in gora.properties


 On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins 
 tonymullins...@gmail.comwrote:

 Hi,

 So the latest Nutch2.x includes the Teja's Patch (
 https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have
 latest
 source then it already has that patch.

 Now can some one please help me here what is meant by the 2nd last step
 'Run 'ant eclipse'  on http://wiki.apache.org/nutch/RunNutchInEclipse.

 Do I need to go to the location where source is and give ant command 'ant
 -f build.xml' , or its something else ???
 And after refreshing the source, Eclipse would let compile and run my
 code ?

 Thanks,
 Tony


 On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi Lewis,
 
  I understand this, that there may be something wrong on my end. And as I
  said I get different errors on running Nutch 2.x with Eclipse, after
  following different tutorials.
 
  My background is in .NET and I might will just move to JAVA , just
 because
  of this project (Nutch). But at the moment I am having difficult time
  understanding the 'setup/configuration' required to run Nutch in
 Eclipse.
 
  When you say '...*you may find it convenient to patch
 
  your dist with Tejas' Eclipse ant target and simply run 'ant eclipse'
 from
  within your terminal prior to doing a file, import, existing projects
 in to
  workspace from within Eclipse..*.'
 
  which patch do I need to get and how to apply it ?
  And by running 'ant eclipse' , do you mean dropping build.xml to Ant
  window in Eclipse , OR building the Nutch source by using the ant -f
  build.xml command in terminal ?  ( by the way I have done both and both
  successfully builds the source , but eclipse doesn't run the source).
 
  So could you please guide me here in more details, I would be really
  grateful to you and Nutch community.
 
  Thanks,
  Tony.
 
 
  On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
  Hi Tony,
  These issues stem from your environment not being correct.
  I, as many other, have been able to DEBUG and  develop Nutch 1.7 and
 2.x
  series from within Eclipse.
  As you are working with 2.x source, you may find it convenient to patch
  your dist with Tejas' Eclipse ant target and simply run 'ant eclipse'
 from
  within your terminal prior to doing a file, import, existing projects
 in
  to
  workspace from within Eclipse.
  I can guarantee you, the reason the tutorial is on the Nutch wiki is
  because as some stage, someone (many many people), somewhere have
 found it
  useful for developing Nutch in Eclipse. I don't want to sound like a
  baloon
  here, but your java security exceptions are not a problem with Nutch...
  it's your environment.
  hth
 
  On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com
 wrote:
   Hi ,
   Ok now I have followed this tutorial word by word.
 
 http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse.
  
   After getting new source 2.2 , I have build it using Ant - which was
  successful then set the configurations and comment the 'hsqldb'
 dependency
  and uncomment the cassandra dependency ( as I want to run it against
  cassandra). After doing this all when I run the code from eclipse I get
  error
   Exception in thread main java.lang.SecurityException: Prohibited
  package name: java.org.apache.nutch.crawl
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:649)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:785)
   at
 
 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  
   and have red '*' all over my code. Please see the attached image.
  
   Now what I do ?
   Please any one could tell me that is it even possible to
  compile/run/debug latest Nutch 2.x branch from Eclipse ?
  
   I need help here...
  
   Tony !!!
  
   On Mon, Jun 10, 2013 at 12:15 PM, Tejas Patil 
 tejas.patil...@gmail.com
  
  wrote:
  
   Hi Tony

Re: Nutch Compilation Error with Eclipse

2013-06-10 Thread Tejas Patil
I have created a google doc [0] with several snapshots describing how to
setup nutch 2.x + eclipse. This is different from the one over the wiki
page and tailored for Nutch 2.x. Please try it out, let us know if you
still have issues with that. Based on your comments, I would add the same
over nutch wiki.

[0] :
https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing


On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi,
 It is (IMHO) kind of fruitless running the crawl class (which is deprecated
 now and we highly suggest you use and amend the /src/bin/crawl script for
 your usecase) within Eclipse. You will learn far more setting breakpoints
 within individual classes and watching them execute on that basis. I notice
 you've not provided an URL directory to the crawl argument anyway so you
 will need to  sort this one out.
 Best
 Lewis

 On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote:
  I'm performing following tasks:
 
  Commands in Arguments tab:
 
  Program Arguments=urls -dir crawl -depth 3 -topN 50
 
  VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
 
  And then just running the code.
 
  Regards,
  Jamshaid
 
 
  On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList 
  bs4mailingl...@gmail.com wrote:
 
  Hi
 
  Which task do you try to launch?
 
  Benjamin
 
 
  On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf jamshaid...@gmail.com
  wrote:
 
   Hi,
  
   I am new to Nutch. I am trying to use Nutch with Cassandra and have
   successfully build the Nutch 2.x but shows following error when I run
 it
   from latest eclipse.
  
  
   java.lang.NullPointerException
   at org.apache.avro.util.Utf8.init(Utf8.java:37)
   at
  org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
   at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
   at
  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260).
  
   I will be grateful for any help if someone can provide.
  
  
   Thanks.
  
 
 

 --
 *Lewis*



Re: Nutch Compilation Error with Eclipse

2013-06-10 Thread Tejas Patil
Hi Jamshaid,
The simplified steps with snapshots are now added to Nutch wiki [0]. It
would be helpful if you could try those out and lets us know if there are
any improvements or corrections that you think.

PS: Few images look shrinked. I will be fixing it soon.

[0] : https://wiki.apache.org/nutch/RunNutchInEclipse


On Mon, Jun 10, 2013 at 2:58 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 I have created a google doc [0] with several snapshots describing how to
 setup nutch 2.x + eclipse. This is different from the one over the wiki
 page and tailored for Nutch 2.x. Please try it out, let us know if you
 still have issues with that. Based on your comments, I would add the same
 over nutch wiki.

 [0] :
 https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing


 On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi,
 It is (IMHO) kind of fruitless running the crawl class (which is
 deprecated
 now and we highly suggest you use and amend the /src/bin/crawl script for
 your usecase) within Eclipse. You will learn far more setting breakpoints
 within individual classes and watching them execute on that basis. I
 notice
 you've not provided an URL directory to the crawl argument anyway so you
 will need to  sort this one out.
 Best
 Lewis

 On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote:
  I'm performing following tasks:
 
  Commands in Arguments tab:
 
  Program Arguments=urls -dir crawl -depth 3 -topN 50
 
  VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
 
  And then just running the code.
 
  Regards,
  Jamshaid
 
 
  On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList 
  bs4mailingl...@gmail.com wrote:
 
  Hi
 
  Which task do you try to launch?
 
  Benjamin
 
 
  On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf 
 jamshaid...@gmail.com
  wrote:
 
   Hi,
  
   I am new to Nutch. I am trying to use Nutch with Cassandra and have
   successfully build the Nutch 2.x but shows following error when I run
 it
   from latest eclipse.
  
  
   java.lang.NullPointerException
   at org.apache.avro.util.Utf8.init(Utf8.java:37)
   at
 
 org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
   at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
   at
  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260).
  
   I will be grateful for any help if someone can provide.
  
  
   Thanks.
  
 
 

 --
 *Lewis*





Re: Issues on Compiling Nutch 2.x with Eclipse

2013-06-10 Thread Tejas Patil
Hi Tony,

The simplified steps with snapshots are now added to Nutch wiki [0]. It
would be helpful if you could try those out and lets us know if there are
any improvements or corrections that you think.

PS: Few images look shrinked. I will be fixing it soon.

[0] : https://wiki.apache.org/nutch/RunNutchInEclipse


On Mon, Jun 10, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 I have created a google doc [0] with several snapshots describing how to
 setup nutch 2.x + eclipse. This is different from the one over the wiki
 page and tailored for Nutch 2.x. Please try it out, let us know if you
 still have issues with that. Based on your comments, I would add the same
 over nutch wiki.

 [0] :
 https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing


 On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.comwrote:

 yes.

- Close the project in eclipse. Right click on the project, click on
Properties and get the location of the project.
- Goto that location in terminal
-

Run 'ant eclipse'. (Note that you need to have Apache 
 Anthttp://ant.apache.org/manual/index.html installed
and configured)

 After going command line, you might as well do this:
 Specify the GORA backend in nutch-site.xml, uncomment its dependency in
 ivy/ivy.xml and ensure that the store you selected is set as the default
 datastore in gora.properties


 On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins 
 tonymullins...@gmail.comwrote:

 Hi,

 So the latest Nutch2.x includes the Teja's Patch (
 https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have
 latest
 source then it already has that patch.

 Now can some one please help me here what is meant by the 2nd last step
 'Run 'ant eclipse'  on http://wiki.apache.org/nutch/RunNutchInEclipse.

 Do I need to go to the location where source is and give ant command 'ant
 -f build.xml' , or its something else ???
 And after refreshing the source, Eclipse would let compile and run my
 code ?

 Thanks,
 Tony


 On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi Lewis,
 
  I understand this, that there may be something wrong on my end. And as
 I
  said I get different errors on running Nutch 2.x with Eclipse, after
  following different tutorials.
 
  My background is in .NET and I might will just move to JAVA , just
 because
  of this project (Nutch). But at the moment I am having difficult time
  understanding the 'setup/configuration' required to run Nutch in
 Eclipse.
 
  When you say '...*you may find it convenient to patch
 
  your dist with Tejas' Eclipse ant target and simply run 'ant eclipse'
 from
  within your terminal prior to doing a file, import, existing projects
 in to
  workspace from within Eclipse..*.'
 
  which patch do I need to get and how to apply it ?
  And by running 'ant eclipse' , do you mean dropping build.xml to Ant
  window in Eclipse , OR building the Nutch source by using the ant -f
  build.xml command in terminal ?  ( by the way I have done both and
 both
  successfully builds the source , but eclipse doesn't run the source).
 
  So could you please guide me here in more details, I would be really
  grateful to you and Nutch community.
 
  Thanks,
  Tony.
 
 
  On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
  Hi Tony,
  These issues stem from your environment not being correct.
  I, as many other, have been able to DEBUG and  develop Nutch 1.7 and
 2.x
  series from within Eclipse.
  As you are working with 2.x source, you may find it convenient to
 patch
  your dist with Tejas' Eclipse ant target and simply run 'ant eclipse'
 from
  within your terminal prior to doing a file, import, existing projects
 in
  to
  workspace from within Eclipse.
  I can guarantee you, the reason the tutorial is on the Nutch wiki is
  because as some stage, someone (many many people), somewhere have
 found it
  useful for developing Nutch in Eclipse. I don't want to sound like a
  baloon
  here, but your java security exceptions are not a problem with
 Nutch...
  it's your environment.
  hth
 
  On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com
 wrote:
   Hi ,
   Ok now I have followed this tutorial word by word.
 
 http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse
 .
  
   After getting new source 2.2 , I have build it using Ant - which was
  successful then set the configurations and comment the 'hsqldb'
 dependency
  and uncomment the cassandra dependency ( as I want to run it against
  cassandra). After doing this all when I run the code from eclipse I
 get
  error
   Exception in thread main java.lang.SecurityException: Prohibited
  package name: java.org.apache.nutch.crawl
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:649)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:785)
   at
 
 
 java.security.SecureClassLoader.defineClass

Re: Error in NutchHadoopTutorial

2013-06-08 Thread Tejas Patil
Thanks Wahaj for the correction. The wiki page is updated with the same.


On Sat, Jun 8, 2013 at 1:23 AM, Wahaj Ali wahaj...@gmail.com wrote:

 Hello,
 Just wanted to bring to your notice that there is a slight error in
 the NutchHadoopTutorial (http://wiki.apache.org/nutch/NutchHadoopTutorial
 ).
 The command given under Performing a Nutch Crawl is:


 hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir
 urls -depth 3 -topN 5

 It should be:


 hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir
 crawl -depth 3 -topN 5

 This is also in consistent with the immediate line which says:

 We are using the nutch crawl command. The urls dir is the urls directory
 that we added to the distributed filesystem. The -dir crawl is the output
 directory.

 Regards,
 Wahaj



Re: Unable to crawl google search results

2013-06-04 Thread Tejas Patil
Do you mean to turn off the robots processing ?
See the comment by Andrzej over [0]:

 The goal of Nutch is to implement a well-behaved crawler that obeys robot
rules and netiquette. Your patch simply disables these control mechanisms.
If it works for you and you can risk the wrath of webmasters, that's fine,
you are free to use this patch - but Nutch as a project cannot encourage
such practice.

[0] : https://issues.apache.org/jira/browse/NUTCH-938


On Tue, Jun 4, 2013 at 2:58 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote:

 One more question, is it ever a good idea to set this property
 protocol.plugin.check.robots in nutch-site.xml to false?


 On Tue, Jun 4, 2013 at 5:30 PM, Yves S. Garret
 yoursurrogate...@gmail.comwrote:

  Got another issue.  When I run my crawler over google search results, I
  see
  _nothing_ in my HBase table... why?
 
  This is what I'm trying to crawl:
 
 
 https://www.google.com/#output=searchsclient=psy-abq=xboxoq=xboxgs_l=hp.3..0l4.648.1180.0.1354.4.4.0.0.0.0.213.547.0j2j1.3.0...0.0...1c.1.15.psy-ab.jd107GllWZwpbx=1bav=on.2,or.r_cp.r_qf.bvm=bv.47380653,d.eWUfp=13d973d49a29d61dbiw=1280bih=635
 
  Here are my logs:
  http://bin.cakephp.org/view/1619245280
 
  Here is my $NUTCH_HOME/conf/nutch-site.xml:
  http://bin.cakephp.org/view/1304119856
 
  And the output that I see when I run the crawler:
  http://bin.cakephp.org/view/260103467
 
  In nutch-site.xml, I have all of the needed plugin.includes, I believe...
 



Re: [REQUEST] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-06-03 Thread Tejas Patil
CDH4 ?? nope. We support Apache Hadoop only and give no guarantee against
any commercial Hadoop distributions out there.


On Mon, Jun 3, 2013 at 7:08 AM, adfel70 adfe...@gmail.com wrote:

 Hi
 does this patch solves the issue with CDH4 and habse?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/REQUEST-NUTCH-1569-Upgrade-2-x-to-Gora-0-3-tp4064544p4067815.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



  1   2   3   >