from:"peterbarretto"

Re: Usage of nutch:

2013-01-25 Thread peterbarretto

Hi Julien,

Any update on the mongodb plugin for nutch??

Using https://github.com/ctjmorgan/nutch-mongodb-indexer is a problem for me
as i dont know how to create a new package and i cant find the ivy folders.
It way too complex for a non-java developer.
Currently i have installed nutch 1.6 on my windows machine and i need to
integrate it with mongodb


Julien Nioche-4 wrote
 On 16 November 2011 20:27, ctjmorgan lt;

 cmorgan@

 gt; wrote:
 
 Recently went through a similar situation.  Check out the two project
 below
 that I posted up to github.  Hope they help...

 https://github.com/ctjmorgan/nutch-mongdb-parser
 https://github.com/ctjmorgan/nutch-mongodb-indexer

 The first example allows you prepare a set of URLs contained in Mongdb
 for
 Nutch to crawl.
 
 
 Not sure parser is the right name for it, sounds more like a variant of
 the
 injector (haven't looked at code though)
 
 
 The second indexes the information from Nutch into Mongodb
 similiar in the same way the SolrIndexer works.

 
 There are plans for a pluggable indexing backend so that we can send the
 documents to [SOLR|ElasticSearch|...] This would allow to write expose the
 MongoDB indexer as a plugin instead of piggybacking the SOLR code.
 
 Thanks for sharing these links, it's always interesting to know what
 people
 do with/around Nutch
 
 Julien
 
 
 


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Usage-of-nutch-tp1894986p3513843.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 
 
 -- 
 *
 *Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Usage-of-nutch-tp1894986p4036407.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Installation of NUTCH on windows7

2013-01-25 Thread peterbarretto

Hi,

Changing the hadoop jar file to a lower version solved the issue

I removed hadoop-core-1.0.3.jar from the lib folder and replaced it with
hadoop-core-0.20.2.jar file


Sebastian Nagel wrote
 Hi,
 
 that's a known problem with Hadoop on Windows / Cygwin:
 
 https://issues.apache.org/jira/browse/HADOOP-7682
 
 I don't know whether there are is a reliable fix
 or a word-around but you should search for the error
 - you are not alone ;-)
 
 Sebastian
 
 On 01/25/2013 12:49 PM, Revathi R wrote:
 Hello
 
 
 I am Trying to install NUTCH on windows7
 
 I got error loke this
 
 D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\Nutch 1.6\win32\binnutch
 crawl
 D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\URLs -dir
 D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\Nutch 1.6\win32\bin
 File Not Found
 The system cannot find the file specified.
 solrUrl is not set, indexing will be skipped...
 crawl started in: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/Nutch
 1.6/win32/bin
 rootUrlDir = D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/URLs
 threads = 10
 depth = 5
 solrUrl=null
 Injector: starting at 2013-01-25 15:46:14
 Injector: crawlDb: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/Nutch
 1.6/win32/bin/crawldb
 Injector: urlDir: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/URLs
 Injector: Converting injected urls to crawl db entries.
 Exception in thread main java.io.IOException: Failed to set permissions
 of
 path:
 \tmp\hadoop-revathi_ramanadham\mapred\staging\revathi_ramanadham818841982\.staging
 to 0700
 at
 org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
 at
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
 at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
 at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
 at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
 
 
 Regards,
 Revathi R.
 
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Installation-of-NUTCH-on-windows7-tp4036210.html
 Sent from the Nutch - User mailing list archive at Nabble.com.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Installation-of-NUTCH-on-windows7-tp4036210p4036404.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: bin/nutch

2013-01-26 Thread peterbarretto

i get a similar error for nutch 2.1 ,how do i fix it? :

Buildfile: C:\apache-nutch-2.1\build.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-lib:
   [delete] Deleting directory C:\apache-nutch-2.1\build\lib

resolve-default:
[ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file =
C:\apache-nutch-2.1\ivy\ivysettings.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

copy-libs:

compile-core:
[javac] C:\apache-nutch-2.1\build.xml:97: warning: 'includeantruntime'
was not set, defaulting to build.sysclasspath=last; set to false for
repeatable builds
[javac] Compiling 181 source files to C:\apache-nutch-2.1\build\classes
[javac] warning: [path] bad path element
C:\apache-nutch-2.1\build\lib\activation.jar: no such file or directory
[javac] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:23:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Get;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:24:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.ServerResource;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:26:
error: cannot find symbol
[javac] public class APIInfoResource extends ServerResource {
[javac]  ^
[javac]   symbol: class ServerResource
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:23:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Get;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:24:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.ServerResource;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:28:
error: cannot find symbol
[javac] public class AdminResource extends ServerResource {
[javac]^
[javac]   symbol: class ServerResource
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:22:
error: package org.restlet.data does not exist
[javac] import org.restlet.data.Form;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:23:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Delete;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:24:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Get;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:25:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Post;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:26:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.Put;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:27:
error: package org.restlet.resource does not exist
[javac] import org.restlet.resource.ServerResource;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:29:
error: cannot find symbol
[javac] public class ConfResource extends ServerResource {
[javac]   ^
[javac]   symbol: class ServerResource
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:29: error:
package org.apache.avro.util does not exist
[javac] import org.apache.avro.util.Utf8;
[javac]^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:30: error:
package org.apache.gora.query does not exist
[javac] import org.apache.gora.query.Query;
[javac] ^
[javac]
C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:31: error:
package org.apache.gora.query does not exist
[javac] import org.apache.gora.query.Result;
[javac] ^
[javac]

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-27 Thread peterbarretto

I tried increasing the numbers of threads to 50 but the speed is not affected 


I tried changing the partition.url.mode value to byDomain and
fetcher.queue.mode to byDomain but still it does not help the speed.
It seems to get urls from 2 domains now and the other domains are not
getting crawled. Is this due to the url score? if so how do i crawl urls
from all the domains?


lewis john mcgibbney wrote
 Increase number of threads when fetching
 Also please see nutch-deault.xml for paritioning of urls, if you know your
 target domains you may wish to adapt the policy.
 Lewis
 
 On Sunday, January 27, 2013, peterbarretto lt;

 peterbarretto08@

 gt;
 wrote:
 I want to increase the number of urls fetched at a time in nutch. I have
 around 10 websites to crawl. so how can i crawl all the sites at a time ?
 right now i am fetching 1 site with a fetch delay of 2 second but it is
 too
 slow. How to concurrently fetch from different domain?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 -- 
 *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: JAVA_HOME is not set

2013-01-29 Thread peterbarretto

Tried escaping the whitespace but it still did not work so i installed java
in another folder and now the installation work just fine


Stefan Scheffler wrote
 Hi,
 Try to escape the whitespace in Program Files. I think it should look 
 like Program\ Files. But i am not sure
 
 Regards
 Stefan
 
 Am 25.01.2013 19:51, schrieb Gora Mohanty:
 On 25 January 2013 16:05, peterbarretto lt;

 peterbarretto08@

 gt; wrote:
 I still get the below error after setting the java home variable

 lt;http://lucene.472066.n3.nabble.com/file/n4036204/nutch_java_home_error.pnggt;
 Not sure of how much experience you have had with Unix-style
 shell quoting, but this would have been amenable to a simple
 Google search for cygwin export variable space. Here is a
 helpful link: http://cygwin.com/ml/cygwin/2005-08/msg01278.html ,
 but do not have a Cygwin installation to actually try this
 out.

 Regards,
 Gora





--
View this message in context: 
http://lucene.472066.n3.nabble.com/JAVA-HOME-is-not-set-tp617447p4036999.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-29 Thread peterbarretto

Hi Tejas,

I changed the generate.count.mode to domain and generate.max.count to 100
but still it shows queue mode as byhost and not by domain.

peterbarretto wrote
Hi Tejas

The fetcher.threads.per.host property has been depreciated and replaced
with fetcher.threads.per.queue
I am not sue if fetcher.threads.per.queue will hepl the fetching as the
generator only generates the fetchlist from 2 or 3 domain. How can i tell
the generator to create fetchlist with equal number of urls from all
domain?

I am sure there are urls from the other domains but i guess since the url
score is less it fetches from only 2 domains.

I will try increasing fetcher.threads.per.queue to 5 and see if the fetch
speed is increased and let you know
Tejas Patil wrote
Hey Peter,

I am guessing that you have just increased the global thread count. Have
you even increased fetcher.threads.per.host ? This will improve the
crawl
rate as multiple threads can attack the same site. Dont make it too high
or
else the system will get overloaded. The nutch wiki has an article [0]
about the potential reasons for slow crawls and some good suggestions.

[0] : https://wiki.apache.org/nutch/OptimizingCrawls

Thanks,
Tejas Patil

On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto lt;

peterbarretto08@

gt;wrote:

I tried increasing the numbers of threads to 50 but the speed is not
affected

I tried changing the partition.url.mode value to byDomain and
fetcher.queue.mode to byDomain but still it does not help the speed.
It seems to get urls from 2 domains now and the other domains are not
getting crawled. Is this due to the url score? if so how do i crawl urls
from all the domains?

lewis john mcgibbney wrote
Increase number of threads when fetching
Also please see nutch-deault.xml for paritioning of urls, if you know
your
target domains you may wish to adapt the policy.
Lewis

On Sunday, January 27, 2013, peterbarretto lt;

peterbarretto08@

gt;
wrote:
I want to increase the number of urls fetched at a time in nutch. I
have
around 10 websites to crawl. so how can i crawl all the sites at a
time
?
right now i am fetching 1 site with a fetch delay of 2 second but it
is
too
slow. How to concurrently fetch from different domain?

--
View this message in context:

http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
*Lewis*

--
View this message in context:
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
View this message in context:
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-01-30 Thread peterbarretto

I have tried the repo https://github.com/ctjmorgan/nutch-mongodb-indexer and
it does not work
I guess this is not working as it is mentioned it is for nutch 1.3 and i am
using 1.6

I get the below output when i try to rebuild :-

Buildfile: C:\nutch-16\build.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-lib:
   [delete] Deleting directory C:\nutch-16\build\lib

resolve-default:
[ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = C:\nutch-16\ivy\ivysettings.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

copy-libs:

compile-core:
[javac] C:\nutch-16\build.xml:96: warning: 'includeantruntime' was not
set, defaulting to build.sysclasspath=last; set to false for repeatable
builds
[javac] Compiling 1 source file to C:\nutch-16\build\classes
[javac] warning: [path] bad path element
C:\nutch-16\build\lib\activation.jar: no such file or directory
[javac] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
[javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:7:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
[javac] import org.apache.hadoop.mapred.JobConf;
[javac]^
[javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
error: MongodbWriter is not abstract and does not override abstract method
delete(String) in NutchIndexWriter
[javac] public class MongodbWriter  implements NutchIndexWriter{
[javac]^
[javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:23:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
[javac] public void open(JobConf job, String name) throws IOException {
[javac]  ^
[javac] 1 error
[javac] 4 warnings


I have already crawled some urls now and i need to move those to mongodb. Is
there a easy to use code to do that? I am new to java so will require all
the steps of how to add the code and all.



Jorge Luis Betancourt Gonzalez wrote
 I suppose you can write a custom indexer, to store the data in mongodb
 instead of solr, I think there is an open repo on github about this.
 
 - Mensaje original -
 De: peterbarretto lt;

 peterbarretto08@

 gt;
 Para: 

 user@.apache

 Enviados: Martes, 29 de Enero 2013 8:46:04
 Asunto: Re: How to get page content of crawled pages
 
 Hi
 
 Is there a way i can dump the url and url content in mongodb?
 
 
 Klemens Muthmann wrote
 Hi,

 Super. That works. Thank you. I thereby also found the class that shows
 how to achieve this within Java code, which is
 org.apache.nutch.segment.SegmentReader.

 Thanks again and bye
  Klemens

 Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
 Hi Klemens,

 you should run ./bin/nutch readseg!

 For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
 -nofetch -nogenerate -noparse -noparsedata -noparsetex

 Kind Regards from Hannover

 Hannes

 On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann

 
 klemens.muthmann@
 
  wrote:

 Hi,

 I did a small crawl of some pages on the web and want to geht the raw
 HTML
 content of these pages now. Reading the documentation in the wiki I
 guess
 this content might be somewhere under
 crawl/segments/20101122071139/content/part-0.

 I also guess I can access this content using the Hadoop API like
 described
 here: http://wiki.apache.org/nutch/Getting_Started

 However I have absolutely no idea how to configure:

 MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);


 The Hadoop documentation is not very helpful either. May someone please
 point me in the right direction to get the page content?

 Thank you and regards
 Klemens Muthmann



 --
 
 Dipl.-Medieninf., Klemens Muthmann
 Wissenschaftlicher Mitarbeiter

 Technische Universität Dresden
 Fakultät Informatik
 Institut für Systemarchitektur
 Lehrstuhl Rechnernetze
 01062 Dresden
 Tel.: +49 (351) 463-38214
 Fax: +49 (351) 463-38251
 E-Mail:
 
 klemens.muthmann@
 
 
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037283.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-30 Thread peterbarretto

 08:49:34,476 INFO  plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Registered
Extension-Points:
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch Segment 
Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2013-01-29 08:49:34,546 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - fetching
http://www.example.com
2013-01-29 08:49:34,549 INFO  fetcher.Fetcher - Using queue mode : byHost



Tejas Patil wrote
 Hey Peter,
 
 Give a bigger value for topN parameter. Also, use:
 property
   
 name
 generate.max.count
 /name
   
 value
 -1
 /value
 /property
 
 property
   
 name
 generate.count.mode
 /name
   
 value
 domain
 /value
 /property
 Not sure why you see queue mode as byhost and not by domain. Did it print
 that in the logs ?
 I should have asked you this before : Are you using nutch 1.X or 2.x ?
 
 thanks,
 Tejas Patil
 
 
 On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto
 lt;

 peterbarretto08@

 gt;wrote:
 
 Hi Tejas,

 I changed the generate.count.mode to domain and generate.max.count to 100
 but still it shows queue mode as byhost and not by domain.



 peterbarretto wrote
  Hi Tejas
 
  The fetcher.threads.per.host property has been depreciated and replaced
  with fetcher.threads.per.queue
  I am not sue if fetcher.threads.per.queue will hepl the fetching as the
  generator only generates the fetchlist from 2 or 3 domain. How can i
 tell
  the generator to create fetchlist with equal number of urls from all
  domain?
 
  I am sure there are urls from the other domains but i guess since the
 url
  score is less it fetches from only 2 domains.
 
  I will try increasing fetcher.threads.per.queue to 5 and see if the
 fetch
  speed is increased and let you know
  Tejas Patil wrote
  Hey Peter,
 
  I am guessing that you have just increased the global thread count.
 Have
  you even increased fetcher.threads.per.host ? This will improve the
  crawl
  rate as multiple threads can attack the same site. Dont make it too
 high
  or
  else the system will get overloaded. The nutch wiki has an article [0]
  about the potential reasons for slow crawls and some good suggestions.
 
  [0] : https://wiki.apache.org/nutch/OptimizingCrawls
 
  Thanks,
  Tejas Patil
 
 
  On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto lt;

  peterbarretto08@

  gt;wrote:
 
  I tried increasing the numbers of threads to 50 but the speed is not
  affected
 
 
  I tried changing the partition.url.mode value to byDomain and
  fetcher.queue.mode to byDomain but still it does not help the speed.
  It seems to get urls from 2 domains now and the other domains are not
  getting crawled. Is this due to the url score? if so how do i crawl
 urls
  from all the domains?
 
 
  lewis john mcgibbney wrote
   Increase number of threads when fetching
   Also please see nutch-deault.xml for paritioning of urls, if you
 know
  your
   target domains you may wish to adapt the policy.
   Lewis
  
   On Sunday, January 27, 2013, peterbarretto lt;
 
   peterbarretto08@
 
   gt;
   wrote:
   I want to increase the number of urls fetched at a time in nutch.
 I
  have
   around 10 websites to crawl. so how can i crawl all the sites at a
  time
  ?
   right now i am fetching 1 site with a fetch delay of 2 second but
 it
  is
   too
   slow. How to concurrently fetch from different domain?
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
   Sent from the Nutch - User mailing list archive at Nabble.com.
  
  
   --
   *Lewis*
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-30 Thread peterbarretto

Hi Lewis,

You are not getting very many URLs!
Should i increase the fetcher.server.delay from 2 to 5 seconds?
I did not get what you meant by it?

I want somewhat around equal number of urls in the fetchlist from all domain
so that i can fetch more number of urls at a time




lewis john mcgibbney wrote
 You are not getting very many URLs!
 
 On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:
 

 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404

 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
 (db_unfetched):
 85672






--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037612.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-01 Thread peterbarretto

Hi Lewis,

I am new to java and i dont know how to inherit all public methods from
NutchIndexWriter
Can you help me with that? Then i can rebuild and check if it works.


lewis john mcgibbney wrote
 As you will see the code has not been amended in a year or so.
 The positive side is that you only seem to be getting one issue with javac
 
 On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:
 


 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
 error: MongodbWriter is not abstract and does not override abstract
 method
 delete(String) in NutchIndexWriter
 [javac] public class MongodbWriter  implements NutchIndexWriter{

 Sort this error out by inheriting all public methods from
 NutchIndexWriter
 for starts. I take it you are not developing from within Eclipse? As this
 would have been flagged up immediately. This should at least enable you to
 compile the code.
 
 

 I have already crawled some urls now and i need to move those to mongodb.
 Is
 there a easy to use code to do that?
 
 
 Not apart from hacking the code as you are already doing. The code you are
 pulling is not part of the official nutch codebase and to be honest a few
 of us didn't even know about it until you brought it to our attention :0)
 
 There is no silver bullet here, just take your time and we will get it
 working.
 Lewis





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-02-01 Thread peterbarretto

Hi Tejas,

I am currently running nutch 1.6 on windows 7, pentium dual core 2.8Ghz, 2
GB ram 
I will be using amazon ec2 servers later for crawling. 

What was ur hardware when you ran 4 million urls with 80Gb data?

Will nutch 2.1 give a faster crawl speed than 1.6?


Tejas Patil wrote
 I had ran crawls with topN as large as 4 million while having crawldb of
 ~80 GB. It worked fine without any such issue.
 Maybe the hardware / cluster you have is not capable of handling load
 above
 500. Note that if topN is low, then no matter how many fetcher threads you
 create, you wont be able to increase #crawls. Also, as there is a
 considerable amount of time spent in generate and update phase, overall
 crawl rate will be low. If you are planning to use the same machine, you
 will have to work with lower values (and thus expect lower crawl rate).
 
 thanks,
 Tejas Patil
 
 
 On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney 

 lewis.mcgibbney@

 wrote:
 
 You are not getting very many URLs!

 On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt;

 peterbarretto08@

 gt; wrote:

 
  2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
 
  2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
  (db_unfetched):
  85672
 






--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037637.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-08 Thread peterbarretto

Hi Lewis,

I managed to get the code working by adding the below function to
MongodbWriter.java in the public class MongodbWriter  implements
NutchIndexWriter :-

 public void delete(String key) throws IOException{
return;
}

And the crawled data was getting stored in mongodb.
The only issue was it was storing only the text of the page and not the full
html content of the page.
How do i store the full html content of the page also? 
Hope to see the patches soon.
Thanks



lewis john mcgibbney wrote
 Certainly.
 I am currently reviewing the code and will hopefully have patches for
 Nutch trunk cooked up for tomorrow.
 I'll update this thread likewise.
 Thanks
 Lewis
 
 On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
 lt;

 peterbarretto08@

 gt; wrote:
 Hi Lewis,

 I am new to java and i dont know how to inherit all public methods from
 NutchIndexWriter
 Can you help me with that? Then i can rebuild and check if it works.


 lewis john mcgibbney wrote
 As you will see the code has not been amended in a year or so.
 The positive side is that you only seem to be getting one issue with
 javac

 On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:



 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
 error: MongodbWriter is not abstract and does not override abstract
 method
 delete(String) in NutchIndexWriter
 [javac] public class MongodbWriter  implements NutchIndexWriter{

 Sort this error out by inheriting all public methods from
 NutchIndexWriter
 for starts. I take it you are not developing from within Eclipse? As
 this
 would have been flagged up immediately. This should at least enable you
 to
 compile the code.



 I have already crawled some urls now and i need to move those to
 mongodb.
 Is
 there a easy to use code to do that?


 Not apart from hacking the code as you are already doing. The code you
 are
 pulling is not part of the official nutch codebase and to be honest a
 few
 of us didn't even know about it until you brought it to our attention
 :0)

 There is no silver bullet here, just take your time and we will get it
 working.
 Lewis





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
 Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 -- 
 Lewis





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-10 Thread peterbarretto

Hi Lewis,

I downloaded the nutch copy from
http://apache.techartifact.com/mirror/nutch/1.6/


lewis john mcgibbney wrote
 Hi,
 Once I get access to my office I am going to build the patches from trunk.
 Is it trunk that you are using?
 Thanks
 Lewis
 
 On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:
 
 Hi Lewis,

 I managed to get the code working by adding the below function to
 MongodbWriter.java in the public class MongodbWriter  implements
 NutchIndexWriter :-

  public void delete(String key) throws IOException{
 return;
 }

 And the crawled data was getting stored in mongodb.
 The only issue was it was storing only the text of the page and not the
 full
 html content of the page.
 How do i store the full html content of the page also?
 Hope to see the patches soon.
 Thanks



 lewis john mcgibbney wrote
  Certainly.
  I am currently reviewing the code and will hopefully have patches for
  Nutch trunk cooked up for tomorrow.
  I'll update this thread likewise.
  Thanks
  Lewis
 
  On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
  lt;

  peterbarretto08@

  gt; wrote:
  Hi Lewis,
 
  I am new to java and i dont know how to inherit all public methods
 from
  NutchIndexWriter
  Can you help me with that? Then i can rebuild and check if it works.
 
 
  lewis john mcgibbney wrote
  As you will see the code has not been amended in a year or so.
  The positive side is that you only seem to be getting one issue with
  javac
 
  On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;
 
  peterbarretto08@
 
  gt;wrote:
 
 
 
 
 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
  error: MongodbWriter is not abstract and does not override abstract
  method
  delete(String) in NutchIndexWriter
  [javac] public class MongodbWriter  implements NutchIndexWriter{
 
  Sort this error out by inheriting all public methods from
  NutchIndexWriter
  for starts. I take it you are not developing from within Eclipse? As
  this
  would have been flagged up immediately. This should at least enable
 you
  to
  compile the code.
 
 
 
  I have already crawled some urls now and i need to move those to
  mongodb.
  Is
  there a easy to use code to do that?
 
 
  Not apart from hacking the code as you are already doing. The code
 you
  are
  pulling is not part of the official nutch codebase and to be honest a
  few
  of us didn't even know about it until you brought it to our attention
  :0)
 
  There is no silver bullet here, just take your time and we will get
 it
  working.
  Lewis
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
  --
  Lewis





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 
 
 -- 
 *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039613.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-15 Thread peterbarretto

Hi Lewis,

Is this patch done??


lewis john mcgibbney wrote
 Hi,
 Once I get access to my office I am going to build the patches from trunk.
 Is it trunk that you are using?
 Thanks
 Lewis
 
 On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:
 
 Hi Lewis,

 I managed to get the code working by adding the below function to
 MongodbWriter.java in the public class MongodbWriter  implements
 NutchIndexWriter :-

  public void delete(String key) throws IOException{
 return;
 }

 And the crawled data was getting stored in mongodb.
 The only issue was it was storing only the text of the page and not the
 full
 html content of the page.
 How do i store the full html content of the page also?
 Hope to see the patches soon.
 Thanks



 lewis john mcgibbney wrote
  Certainly.
  I am currently reviewing the code and will hopefully have patches for
  Nutch trunk cooked up for tomorrow.
  I'll update this thread likewise.
  Thanks
  Lewis
 
  On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
  lt;

  peterbarretto08@

  gt; wrote:
  Hi Lewis,
 
  I am new to java and i dont know how to inherit all public methods
 from
  NutchIndexWriter
  Can you help me with that? Then i can rebuild and check if it works.
 
 
  lewis john mcgibbney wrote
  As you will see the code has not been amended in a year or so.
  The positive side is that you only seem to be getting one issue with
  javac
 
  On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;
 
  peterbarretto08@
 
  gt;wrote:
 
 
 
 
 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
  error: MongodbWriter is not abstract and does not override abstract
  method
  delete(String) in NutchIndexWriter
  [javac] public class MongodbWriter  implements NutchIndexWriter{
 
  Sort this error out by inheriting all public methods from
  NutchIndexWriter
  for starts. I take it you are not developing from within Eclipse? As
  this
  would have been flagged up immediately. This should at least enable
 you
  to
  compile the code.
 
 
 
  I have already crawled some urls now and i need to move those to
  mongodb.
  Is
  there a easy to use code to do that?
 
 
  Not apart from hacking the code as you are already doing. The code
 you
  are
  pulling is not part of the official nutch codebase and to be honest a
  few
  of us didn't even know about it until you brought it to our attention
  :0)
 
  There is no silver bullet here, just take your time and we will get
 it
  working.
  Lewis
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
  --
  Lewis





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 
 
 -- 
 *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-17 Thread peterbarretto

Thanks for the patch Lewis.

Where do i make the pom.xml changes i cant find the file?

Also in 1.6 if i give the below command it returns the html content
./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch
-nogenerate -noparse -noparsedata -noparsetext

I havent built the patch changes as i cant find pom.xml file.


 

lewis john mcgibbney wrote
 https://issues.apache.org/jira/browse/NUTCH-1528
 
 This is the mongodb indexer patch ported to trunk.
 
 Can I mention that there is usually no time line on these things e.g.
 feature requests.
 I'm sure you can appreciate that we are all extremely busy at work with an
 array of other things so if it takes a bit of time, then thats OK. The
 world goes on and keeps spinning. Even if we are getting bombarded by
 meteorites in Russia!!!
 
 Please check the patch and out comment accordingly.
 
 Regarding your issue with regards to the full page content, I am not sure
 if this is currently available in Nutch trunk with out you writing some
 code.
 Full html markup is certainly stored in 2.x... but I don't know whether
 you
 are prepared to move to 2.x for your operations?
 
 hth
 Lewis
 
 On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:
 
 Hi Lewis,

 Is this patch done??


 lewis john mcgibbney wrote
  Hi,
  Once I get access to my office I am going to build the patches from
 trunk.
  Is it trunk that you are using?
  Thanks
  Lewis
 
  On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt;

  peterbarretto08@

  gt;wrote:
 
  Hi Lewis,
 
  I managed to get the code working by adding the below function to
  MongodbWriter.java in the public class MongodbWriter  implements
  NutchIndexWriter :-
 
   public void delete(String key) throws IOException{
  return;
  }
 
  And the crawled data was getting stored in mongodb.
  The only issue was it was storing only the text of the page and not
 the
  full
  html content of the page.
  How do i store the full html content of the page also?
  Hope to see the patches soon.
  Thanks
 
 
 
  lewis john mcgibbney wrote
   Certainly.
   I am currently reviewing the code and will hopefully have patches
 for
   Nutch trunk cooked up for tomorrow.
   I'll update this thread likewise.
   Thanks
   Lewis
  
   On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
   lt;
 
   peterbarretto08@
 
   gt; wrote:
   Hi Lewis,
  
   I am new to java and i dont know how to inherit all public methods
  from
   NutchIndexWriter
   Can you help me with that? Then i can rebuild and check if it
 works.
  
  
   lewis john mcgibbney wrote
   As you will see the code has not been amended in a year or so.
   The positive side is that you only seem to be getting one issue
 with
   javac
  
   On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;
  
   peterbarretto08@
  
   gt;wrote:
  
  
  
  
 
 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
   error: MongodbWriter is not abstract and does not override
 abstract
   method
   delete(String) in NutchIndexWriter
   [javac] public class MongodbWriter  implements
 NutchIndexWriter{
  
   Sort this error out by inheriting all public methods from
   NutchIndexWriter
   for starts. I take it you are not developing from within Eclipse?
 As
   this
   would have been flagged up immediately. This should at least
 enable
  you
   to
   compile the code.
  
  
  
   I have already crawled some urls now and i need to move those to
   mongodb.
   Is
   there a easy to use code to do that?
  
  
   Not apart from hacking the code as you are already doing. The code
  you
   are
   pulling is not part of the official nutch codebase and to be
 honest
 a
   few
   of us didn't even know about it until you brought it to our
 attention
   :0)
  
   There is no silver bullet here, just take your time and we will
 get
  it
   working.
   Lewis
  
  
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
   Sent from the Nutch - User mailing list archive at Nabble.com.
  
  
  
   --
   Lewis
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 
  --
  *Lewis*





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 
 
 -- 
 *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-02-18 Thread peterbarretto

Hi Lewis,

I have never used a patch before but after searching a bit managed to apply
the patch in cygwin. (had to reinstall cygwin with the patch tool as the
path command was not present in the previous install)

I installed the patch by skipping pom.xml file and it worked.
I can copy all the crawled urls to the mongodb.

I can get the html content of crawled urls from the readseg -dump command in
nutch 1.6 so i guess it will be possible to get full html along with just
the text part?




lewis john mcgibbney wrote
 Hi Peter
 
 On Saturday, February 16, 2013, peterbarretto
 lt;peterbarretto08@gmail.gt;
 Where do i make the pom.xml changes i cant find the file?
 
 What are you talking about? I made a patch which pulls everything for you.
 There should be no changes required.
 
 I havent built the patch changes as i cant find pom.xml file.
 
 The maven project file is in the root project. We do not build nutch with
 ?aven. Currently for development we use ant tasks and ivy for
 dependencies.
 


 lewis john mcgibbney wrote
 https://issues.apache.org/jira/browse/NUTCH-1528

 This is the mongodb indexer patch ported to trunk.

 Can I mention that there is usually no time line on these things e.g.
 feature requests.
 I'm sure you can appreciate that we are all extremely busy at work with
 an
 array of other things so if it takes a bit of time, then thats OK. The
 world goes on and keeps spinning. Even if we are getting bombarded by
 meteorites in Russia!!!

 Please check the patch and out comment accordingly.

 Regarding your issue with regards to the full page content, I am not
 sure
 if this is currently available in Nutch trunk with out you writing some
 code.
 Full html markup is certainly stored in 2.x... but I don't know whether
 you
 are prepared to move to 2.x for your operations?

 hth
 Lewis

 On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto lt;

 peterbarretto08@

 gt;wrote:

 Hi Lewis,

 Is this patch done??


 lewis john mcgibbney wrote
  Hi,
  Once I get access to my office I am going to build the patches from
 trunk.
  Is it trunk that you are using?
  Thanks
  Lewis
 
  On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt;

  peterbarretto08@

  gt;wrote:
 
  Hi Lewis,
 
  I managed to get the code working by adding the below function to
  MongodbWriter.java in the public class MongodbWriter  implements
  NutchIndexWriter :-
 
   public void delete(String key) throws IOException{
  return;
  }
 
  And the crawled data was getting stored in mongodb.
  The only issue was it was storing only the text of the page and not
 the
  full
  html content of the page.
  How do i store the full html content of the page also?
  Hope to see the patches soon.
  Thanks
 
 
 
  lewis john mcgibbney wrote
   Certainly.
   I am currently reviewing the code and will hopefully have patches
 for
   Nutch trunk cooked up for tomorrow.
   I'll update this thread likewise.
   Thanks
   Lewis
  
   On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
   lt;
 
   peterbarretto08@
 
   gt; wrote:
   Hi Lewis,
  
   I am new to java and i dont know how to inherit all public
 methods
  from
   NutchIndexWriter
   Can you help me with that? Then i can rebuild and check if it
 works.
  
  
   lewis john mcgibbney wrote
   As you will see the code has not been amended in a year or so.
   The positive side is that you only seem to be getting one issue
 with
   javac
  
   On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt;
  
   peterbarretto08@
  
   gt;wrote:
  
  
  
  
 

 C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
   error: MongodbWriter is not abstract and does not override
 abstract
   method
   delete(String) in NutchIndexWriter
   [javac] public class MongodbWriter  implements
 NutchIndexWriter{
  
   Sort this error out by inheriting all public methods from
   NutchIndexWriter
   for starts. I take it you are not developing from within
 Eclipse?
 As
   this
   would have been flagged up immediately. This should at least
 enable
  you
   to
   compile the code.
  
  
  
   I have already crawled some urls now and i need to move those
 to
   mongodb.
   Is
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

 
 -- 
 *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4041066.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

2013-04-02 Thread peterbarretto

Hi Lewis,

I tried applying the patch on 2.1 but it gives the below error:
patching file pom.xml
patching file ivy/ivy.xml
Hunk #1 succeeded at 34 with fuzz 2 (offset 4 lines).
patching file src/bin/nutch
Hunk #1 FAILED at 61.
Hunk #2 succeeded at 220 with fuzz 2 (offset 2 lines).
1 out of 2 hunks FAILED -- saving rejects to file src/bin/nutch.rej
patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbWriter.java
patching file
src/java/org/apache/nutch/indexer/mongodb/MongoDbConstants.java
patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbIndexer.java



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4053146.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Usage of nutch:

Re: Installation of NUTCH on windows7

Re: bin/nutch

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Re: JAVA_HOME is not set

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Re: How to get page content of crawled pages

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Re: How to get page content of crawled pages

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Re: How to get page content of crawled pages

Re: How to get page content of crawled pages

Re: How to get page content of crawled pages

Re: How to get page content of crawled pages

Re: How to get page content of crawled pages

Re: How to get page content of crawled pages

17 matches

Site Navigation

Mail list logo

Footer information