Re: AW: Null Indexing

2009-09-30 Thread MEHALA N
hai,
 i am getting the following error while running the crawler by
bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50


Dedup: adding indexes in: crawl_NEW1/indexes
Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
   at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

Can anyone help me to clear this problem.
-N.Mehala.

On Wed, Sep 23, 2009 at 10:14 AM, Cisek faust...@mailinator.com wrote:

 I had the same little big problem - everything seemed OK:

 - bin/nutch org.apache.nutch.searcher.NutchBean search query ... [in my
 case search query = apache] in cygwin returns 62 Total hits on cawled
 +^http://([a-z0-9]*\.)*apache.org/

 - Nutch in Tomcat webapp after deploy seemed fine (no errors)

 - I had NOT created a new xml file named nutch-0.9.xml which contains
 Context path=/nutch-0.9/ debug=5 privileged=true
 docBase=C:\nutch-0.9/ and NOT put it in
 C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had

 - but still got Hits 0-0 (out of about 0 total matching pages): in
 Tomcat-Nutch web interface.

 ... but I have solved it in my case:

 - I forgott to configure the searcher.dir in nutch-site.xml at
 C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in
 http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
 http://wiki.apache.org/nutch/GettingNutchRunningWithWindows  - Set Your
 Searcher Directory

 - and now it works fine - Tomcat-Nutch interface returns 62 matching pages
 :)


 Imam Nur Ramadhany wrote:

 Hello again everyone,

 My detail configuration is just like what
 http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new
 to
 Tomcat  and  Java, so I just followed the instruction.

 I extracted the release at C:\nutch-0.9, made a directory
 named urls with a file also named urls (without extention), then added the
 URLs
 to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also
 have
 crawled  a site (http://localhost/). For
 web interface search I uploaded the nutch WAR file. And created a new xml
 file
 named nutch-0.9.xml which contains Context path=/nutch-0.9/
 debug=5 privileged=true docBase=C:\nutch-0.9
 / and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
 my problems are. Is it the correct path and docbase? When I enter
 http://localhost:8080/nutch-0.9/
 there is a welcome page but when I put a query and click the search it
 wasn't
 returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also
 have
 configured the searcher.dir in nutch-site.xml at
 C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
 anyway.

 Then like Koch Martina's suggestion I tried to search
 directly from the command line in cygwin by the command:
 bin/nutch org.apache.nutch.searcher.NutchBean search
 query.
  It works.
  I'm still working on
 the nutch-0.9.xml to make the webapp works, trying some path and docbase.
 But it would be helpful if you
 have any other suggestions.

 Thanks in advance,Ramadhany



 
 From: Imam Nur Ramadhany ramadhanyov...@yahoo.com
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, January 13, 2009 7:27:21 AM
 Subject: Re: AW: Null Indexing

 Thanks for your info Martina,
 it works with the command line but it doesn't when using the webapp
 (localhost:8080/nutch-0.9)
 is it enough with only deploy the war file using Tomcat manager?
 or should we include some other file to the catalina_home?





 
 From: Koch Martina k...@huberverlag.de
 To: nutch-user@lucene.apache.org nutch-user@lucene.apache.org
 Sent: Friday, January 9, 2009 2:57:24 PM
 Subject: AW: Null Indexing

 Hi Ramadhany,

 the mentioned warnings and fatals you see in the log have nothing to do
 with getting 0 results at searching.
 The fatal message can be eliminated by setting the property
 http.robots.agents in the nutch-site.xml to Imam Spider,*.
 The urlnormalizer warn messages just inform you that you have not
 specified a dedicated urlnormalizer for a certain scope so that the
 default urlnormalizer is used. If you need more information on this, look
 at URLNormalizers.java (package org.apache.nutch.net).

 To narrow down your searching problems, please provide some more details
 on your configuration.
 Did you check the content of your index using Luke
 (http://www.getopt.org/luke/) to make sure that the pages and content you
 are expecting in the index are really in there?
 Did you try a search directly from the command line in cygwin by the
 command:
 bin/nutch org.apache.nutch.searcher.NutchBean search query

 Kind regards,
 Martina

 -Ursprüngliche Nachricht-
 Von: Imam Nur Ramadhany [mailto:ramadhanyov...@yahoo.com]
 Gesendet: 09 January 2009 01:39
 An: nutch-user@lucene.apache.org
 Betreff: Null Indexing

 I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem.
 when I try to 

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Magnús Skúlason
Actually its quite easy to modify the parse-html filter to do this.

That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to go or not. If you can get by with wget -r then its probably an
overkill to use nutch.

Best regards,
Magnus

On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote:

 On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote:
  Sorry for pushing this topic, but I would like to know if Nutch would
 help me get the raw HTML in my situation described below.
 
  I am sure it would be a simple answer to those who know Nutch. If not
 then I guess Nutch is the wrong tool for the job.
 
  Thanks,
  O. O.
 
 
  --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto:
 
  Da: O. Olson olson_...@yahoo.it
  Oggetto: Using Nutch for only retriving HTML
  A: nutch-user@lucene.apache.org
  Data: Giovedì 24 settembre 2009, 20:54
  Hi,
  I am new to Nutch. I would like to
  completely crawl through an Internal Website and retrieve
  all the HTML Content. I don’t intend to do further
  processing using Nutch.
  The Website/Content is rather huge. By crawl, I mean that I
  would go to a page, download/archive the HTML, get the links
  from that page, and then download/archive those pages. I
  would keep doing this till I don’t have any new links.

 I don't think it is possible to retrieve pages and store them as
 separate files, one per page, without modifications in Nutch. I am not
 sure though. Someone would correct me if I am wrong here. However, it
 is easy to retrieve the HTML contents from the crawl DB using the
 Nutch API. But from your post, it seems, you don't want to do this.

 
  Is this possible? Is this the right tool for this job, or
  are there other tools out there that would be more suited
  for my purpose?

 I guess 'wget' is the tool you are looking for. You can use it with -r
 option to recursively download pages and store them as separate files
 on the hard disk, which is exactly what you need. You might want to
 use the -np option too. It is available for Windows as well as Linux.

 Regards,
 Susam Pal



Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Shame on me! Thanks a lot.


 it's some thing like that



 property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
 /property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org

 Hello, all.

 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run

 bin/nutch crawl seed -dir crawl -threads 10 -depth 2

 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
 Plug-in (parse-html)

 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
 Analysis Plug-in (analysis-de)
 !

 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 

Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Ooops. It IS activated.

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
Language Identification Parser/Filter (language-identifier)

But fetched pages are not passed to it, as I recon.


Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Bartosz Gadzimski

Hello,

First - great job, it looks and works very nice.

I have a question about urlfilters. Is this possible to get 
regex-urlfilter per instance (different for each instance) ?


Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

Feature request - option to merge segments or maybe removing old one ?

Thanks,
Bartosz


Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt


On Sep 30, 2009, at 3:47 PM, Bartosz Gadzimski wrote:


Hello,


Hi Bartosz




First - great job, it looks and works very nice.


:) Thanks!




I have a question about urlfilters. Is this possible to get regex- 
urlfilter per instance (different for each instance) ?


good idea. i think you could configure the property  
urlfilter.regex.file via the configuration tab per instance.
for example an instance fast-crawl use the urlfilter file with name  
fast-regex-urlfilter.txt and another instance use another name.


can you test this?




Feature request - option to merge segments or maybe removing old one ?


Ok. Sure. You can create a feature request in the gui issue tracker 
http://oss.101tec.com/jira/browse/NUTCHGUI


thanks for testing the gui
marko




~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com





Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
Any documentation on how to add this GUI to existing NUtch instance?

პატივისცემით,
დავით ჯაში




2009/9/30 Bartosz Gadzimski bartek...@o2.pl:
 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz



Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt

Hi David.
sorry i dont understand your question. documentation about the nutch  
gui can you find here

http://wiki.github.com/101tec/nutch

marko




On Sep 30, 2009, at 4:02 PM, David Jashi wrote:


Any documentation on how to add this GUI to existing NUtch instance?

პატივისცემით,
დავით ჯაში




2009/9/30 Bartosz Gadzimski bartek...@o2.pl:

Hello,

First - great job, it looks and works very nice.

I have a question about urlfilters. Is this possible to get regex- 
urlfilter

per instance (different for each instance) ?

Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

Feature request - option to merge segments or maybe removing old  
one ?


Thanks,
Bartosz







Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
Thanks,

Sorry for my bad English, I`ll rephrase:

Can I add this GUI to existing Nutch installation? I've made some
modifications to mine, so starting from scratch would be quite
time-consuming.

პატივისცემით,
დავით ჯაში




On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote:
 Hi David.
 sorry i dont understand your question. documentation about the nutch gui can
 you find here
 http://wiki.github.com/101tec/nutch

 marko




 On Sep 30, 2009, at 4:02 PM, David Jashi wrote:

 Any documentation on how to add this GUI to existing NUtch instance?

 პატივისცემით,
 დავით ჯაში




 2009/9/30 Bartosz Gadzimski bartek...@o2.pl:

 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get
 regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz






Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt




Sorry for my bad English, I`ll rephrase:


:) No Problem.




Can I add this GUI to existing Nutch installation? I've made some
modifications to mine, so starting from scratch would be quite
time-consuming.


Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what  
for nutch version you have patched?
you can try to make a diff on the release-1.0 to create a patch file.  
after that you can checkout or download the gui and try to apply your  
patch.

maybe this could work.



marko






პატივისცემით,
დავით ჯაში




On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote:

Hi David.
sorry i dont understand your question. documentation about the  
nutch gui can

you find here
http://wiki.github.com/101tec/nutch

marko




On Sep 30, 2009, at 4:02 PM, David Jashi wrote:


Any documentation on how to add this GUI to existing NUtch instance?

პატივისცემით,
დავით ჯაში




2009/9/30 Bartosz Gadzimski bartek...@o2.pl:


Hello,

First - great job, it looks and works very nice.

I have a question about urlfilters. Is this possible to get
regex-urlfilter
per instance (different for each instance) ?

Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

Feature request - option to merge segments or maybe removing old  
one ?


Thanks,
Bartosz












Re: Specify at least one source--a file or resource collection error

2009-09-30 Thread Jaime Martín
I´ve solved this problem using ant 1.6.5 instead of 1.7

El 29 de septiembre de 2009 12:18, Jaime Martín james...@gmail.comescribió:

 Hi again:
 I just want to be able to build nucth in eclipse. What version do you use?
 Is last official release 1.0 not advisable? any plugin or reliable svn
 version required?
 thank you very much.



 El 23 de septiembre de 2009 15:40, Jaime Martín james...@gmail.comescribió:

 Hi:
 I´m following the steps to run Nucth 1.0 release with Eclipse and Windows
 described in this link
 http://wiki.apache.org/nutch/RunNutchInEclipse1.0

 I´m trying to build it, but when I launch the war target I have this error

 C:\ECLIPSE321\workspace\nutch-1.0\build.xml:62: Specify at least one
 source--a file or resource collection.

 any tip?
 thank you!





Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
That's 1.0

Thanks a lot. I'll give it a try.

პატივისცემით,
დავით ჯაში




On Wed, Sep 30, 2009 at 18:37, Marko Bauhardt m...@101tec.com wrote:


 Sorry for my bad English, I`ll rephrase:

 :) No Problem.



 Can I add this GUI to existing Nutch installation? I've made some
 modifications to mine, so starting from scratch would be quite
 time-consuming.

 Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what for
 nutch version you have patched?
 you can try to make a diff on the release-1.0 to create a patch file. after
 that you can checkout or download the gui and try to apply your patch.
 maybe this could work.



 marko





 პატივისცემით,
 დავით ჯაში




 On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote:

 Hi David.
 sorry i dont understand your question. documentation about the nutch gui
 can
 you find here
 http://wiki.github.com/101tec/nutch

 marko




 On Sep 30, 2009, at 4:02 PM, David Jashi wrote:

 Any documentation on how to add this GUI to existing NUtch instance?

 პატივისცემით,
 დავით ჯაში




 2009/9/30 Bartosz Gadzimski bartek...@o2.pl:

 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get
 regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz









RE: Multilanguage support in Nutch 1.0

2009-09-30 Thread BELLINI ADAM

hi,
do you have some metadata 'lang' on the pages . becoz the plugin try first to 
get the language form metadata..
if you see in the java source of the plugin LanguageIndexingFilter.java


// check if LANGUAGE found, possibly put there by HTMLLanguageParser
String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);

// check if HTTP-header tels us the language
if (lang == null) {
lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE);
}

try to use also LUKE to check all your metadata on the index.





 From: da...@jashi.ge
 Date: Wed, 30 Sep 2009 17:22:26 +0400
 Subject: Re: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org
 
 On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:
 
  hi
 
  try to activate the language-identifier plugin
  you must add it in the nutch-site.xml file in the  
  nameplugin.includes/name section.
 
 Ooops. It IS activated.
 
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
 Language Identification Parser/Filter (language-identifier)
 
 But fetched pages are not passed to it, as I recon.
  
_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread O. Olson
Thanks Magnús and Susam for your responses and pointing me in the right 
direction. I think I would spend time over the next few weeks trying out Nutch 
over. I only needed the HTML – I don’t care if it is in the Database or in 
separate files. 

Thanks guys,
O.O. 


--- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:

 Da: Magnús Skúlason magg...@gmail.com
 Oggetto: Re: R: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Mercoledì 30 settembre 2009, 11:48
 Actually its quite easy to modify the
 parse-html filter to do this.
 
 That is saving the HTML to a file or to some database, you
 could then
 configure it to skip all unnecessary plugins. I think it
 depends a lot on
 the other requirements you have whether using nutch for
 this task is the
 right way to go or not. If you can get by with wget -r then
 its probably an
 overkill to use nutch.
 
 Best regards,
 Magnus
 
 On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
 wrote:
 
  On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
 wrote:
   Sorry for pushing this topic, but I would like to
 know if Nutch would
  help me get the raw HTML in my situation described
 below.
  
   I am sure it would be a simple answer to those
 who know Nutch. If not
  then I guess Nutch is the wrong tool for the job.
  
   Thanks,
   O. O.
  
  
   --- Gio 24/9/09, O. Olson olson_...@yahoo.it
 ha scritto:
  
   Da: O. Olson olson_...@yahoo.it
   Oggetto: Using Nutch for only retriving HTML
   A: nutch-user@lucene.apache.org
   Data: Giovedì 24 settembre 2009, 20:54
   Hi,
       I am new to Nutch. I
 would like to
   completely crawl through an Internal Website
 and retrieve
   all the HTML Content. I don’t intend to do
 further
   processing using Nutch.
   The Website/Content is rather huge. By crawl,
 I mean that I
   would go to a page, download/archive the
 HTML, get the links
   from that page, and then download/archive
 those pages. I
   would keep doing this till I don’t have any
 new links.
 
  I don't think it is possible to retrieve pages and
 store them as
  separate files, one per page, without modifications in
 Nutch. I am not
  sure though. Someone would correct me if I am wrong
 here. However, it
  is easy to retrieve the HTML contents from the crawl
 DB using the
  Nutch API. But from your post, it seems, you don't
 want to do this.
 
  
   Is this possible? Is this the right tool for
 this job, or
   are there other tools out there that would be
 more suited
   for my purpose?
 
  I guess 'wget' is the tool you are looking for. You
 can use it with -r
  option to recursively download pages and store them as
 separate files
  on the hard disk, which is exactly what you need. You
 might want to
  use the -np option too. It is available for Windows as
 well as Linux.
 
  Regards,
  Susam Pal
 
 





RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM

hi 
mabe you can run a crawl (dont forget to filter the pages just to keep html or 
htm files (you will do it at conf/crawl-urlfilter.txt) )
after that you will go to the hadoop.log file and grep the sentence 
'fetcher.Fetcher - fetching http' to get all the fetched urls.
dont forget to sort the file and to make it uniq (command uniq -c) becoz 
sometimes the crawl try to fecth the poges several times if they  will not 
answer the first time.

when you have all your urls you can run wget on your file and archive the 
dowlowaded pages.

hope it could help.





 Date: Wed, 30 Sep 2009 20:46:50 +
 From: olson_...@yahoo.it
 Subject: Re: R: Using Nutch for only retriving HTML
 To: nutch-user@lucene.apache.org
 
 Thanks Magnús and Susam for your responses and pointing me in the right 
 direction. I think I would spend time over the next few weeks trying out 
 Nutch over. I only needed the HTML – I don’t care if it is in the Database or 
 in separate files. 
 
 Thanks guys,
 O.O. 
 
 
 --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:
 
  Da: Magnús Skúlason magg...@gmail.com
  Oggetto: Re: R: Using Nutch for only retriving HTML
  A: nutch-user@lucene.apache.org
  Data: Mercoledì 30 settembre 2009, 11:48
  Actually its quite easy to modify the
  parse-html filter to do this.
  
  That is saving the HTML to a file or to some database, you
  could then
  configure it to skip all unnecessary plugins. I think it
  depends a lot on
  the other requirements you have whether using nutch for
  this task is the
  right way to go or not. If you can get by with wget -r then
  its probably an
  overkill to use nutch.
  
  Best regards,
  Magnus
  
  On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
  wrote:
  
   On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
  wrote:
Sorry for pushing this topic, but I would like to
  know if Nutch would
   help me get the raw HTML in my situation described
  below.
   
I am sure it would be a simple answer to those
  who know Nutch. If not
   then I guess Nutch is the wrong tool for the job.
   
Thanks,
O. O.
   
   
--- Gio 24/9/09, O. Olson olson_...@yahoo.it
  ha scritto:
   
Da: O. Olson olson_...@yahoo.it
Oggetto: Using Nutch for only retriving HTML
A: nutch-user@lucene.apache.org
Data: Giovedì 24 settembre 2009, 20:54
Hi,
I am new to Nutch. I
  would like to
completely crawl through an Internal Website
  and retrieve
all the HTML Content. I don’t intend to do
  further
processing using Nutch.
The Website/Content is rather huge. By crawl,
  I mean that I
would go to a page, download/archive the
  HTML, get the links
from that page, and then download/archive
  those pages. I
would keep doing this till I don’t have any
  new links.
  
   I don't think it is possible to retrieve pages and
  store them as
   separate files, one per page, without modifications in
  Nutch. I am not
   sure though. Someone would correct me if I am wrong
  here. However, it
   is easy to retrieve the HTML contents from the crawl
  DB using the
   Nutch API. But from your post, it seems, you don't
  want to do this.
  
   
Is this possible? Is this the right tool for
  this job, or
are there other tools out there that would be
  more suited
for my purpose?
  
   I guess 'wget' is the tool you are looking for. You
  can use it with -r
   option to recursively download pages and store them as
  separate files
   on the hard disk, which is exactly what you need. You
  might want to
   use the -np option too. It is available for Windows as
  well as Linux.
  
   Regards,
   Susam Pal
  
  
 
 
   
  
_
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: R: Using Nutch for only retriving HTML
 Date: Wed, 30 Sep 2009 21:04:03 +
 
 
 hi 
 mabe you can run a crawl (dont forget to filter the pages just to keep html 
 or htm files (you will do it at conf/crawl-urlfilter.txt) )
 after that you will go to the hadoop.log file and grep the sentence 
 'fetcher.Fetcher - fetching http' to get all the fetched urls.
 dont forget to sort the file and to make it uniq (command uniq -c) becoz 
 sometimes the crawl try to fecth the poges several times if they  will not 
 answer the first time.
 
 when you have all your urls you can run wget on your file and archive the 
 dowlowaded pages.
 
 hope it could help.
 
 
 
 
 
  Date: Wed, 30 Sep 2009 20:46:50 +
  From: olson_...@yahoo.it
  Subject: Re: R: Using Nutch for only retriving HTML
  To: nutch-user@lucene.apache.org
  
  Thanks Magnús and Susam for your responses and pointing me in the right 
  direction. I think I would spend time over the next few weeks trying out 
  Nutch over. I only needed the HTML – I don’t care if it is in the Database 
  or in separate files. 
  
  Thanks guys,
  O.O. 
  
  
  --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:
  
   Da: Magnús Skúlason magg...@gmail.com
   Oggetto: Re: R: Using Nutch for only retriving HTML
   A: nutch-user@lucene.apache.org
   Data: Mercoledì 30 settembre 2009, 11:48
   Actually its quite easy to modify the
   parse-html filter to do this.
   
   That is saving the HTML to a file or to some database, you
   could then
   configure it to skip all unnecessary plugins. I think it
   depends a lot on
   the other requirements you have whether using nutch for
   this task is the
   right way to go or not. If you can get by with wget -r then
   its probably an
   overkill to use nutch.
   
   Best regards,
   Magnus
   
   On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
   wrote:
   
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
   wrote:
 Sorry for pushing this topic, but I would like to
   know if Nutch would
help me get the raw HTML in my situation described
   below.

 I am sure it would be a simple answer to those
   who know Nutch. If not
then I guess Nutch is the wrong tool for the job.

 Thanks,
 O. O.


 --- Gio 24/9/09, O. Olson olson_...@yahoo.it
   ha scritto:

 Da: O. Olson olson_...@yahoo.it
 Oggetto: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Giovedì 24 settembre 2009, 20:54
 Hi,
 I am new to Nutch. I
   would like to
 completely crawl through an Internal Website
   and retrieve
 all the HTML Content. I don’t intend to do
   further
 processing using Nutch.
 The Website/Content is rather huge. By crawl,
   I mean that I
 would go to a page, download/archive the
   HTML, get the links
 from that page, and then download/archive
   those pages. I
 would keep doing this till I don’t have any
   new links.
   
I don't think it is possible to retrieve pages and
   store them as
separate files, one per page, without modifications in
   Nutch. I am not
sure though. Someone would correct me if I am wrong
   here. However, it
is easy to retrieve the HTML contents from the crawl
   DB using the
Nutch API. But from your post, it seems, you don't
   want to do this.
   

 Is this possible? Is this the right tool for
   this job, or
 are there other tools out there that would be
   more suited
 for my purpose?
   
I guess 'wget' is the tool you are looking for. You
   can use it with -r
option to recursively download pages and store them as
   separate files
on the hard disk, which is exactly what you need. You
   might want to
use the -np option too. It is available for Windows as
   well as Linux.
   
Regards,
Susam Pal
   
   
  
  

 
 _
 We are your photos. Share us now with Windows Live Photos.
 http://go.microsoft.com/?linkid=9666047
  
_
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki

BELLINI ADAM wrote:


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files


I'd argue with this advice. The goal here is to obtain the HTML pages. 
If you have crawled them, then why do it again? You already have their 
content locally.


However, page content is NOT stored in crawldb, it's stored in segments. 
So you need to dump the content from segments, and not the content of 
crawldb.


The command 'bin/nutch readseg -dump segmentName output' should do 
the trick.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Mario Schroeder
There is a nutch developer in my neighborhood. Yes sir.

So lets stay in touch.

Mario

2009/9/24, Marko Bauhardt m...@101tec.com:
 Hi list.
 we have pushed the second nutch gui release version 0.2.

 You can download the binary or the sources on
 http://github.com/101tec/nutch/downloads
 Two main features are implemented in this version

 + Security. You can start the admin gui with login feature, usernames
 and passwords can be configured in a separate file (see
 http://wiki.github.com/101tec/nutch/security)
 .
 + If you push a new crawled index to search the searcher will be
 releoad the index automatically.


 marko




 ~~~
 101tec GmbH

 Halle (Saale), Saxony-Anhalt, Germany
 http://www.101tec.com





-- 
Von meinen Mobilgerät aus gesendet

http://www.ironschroedi.com/de/