Re: Large files - nutch failing to fetch

2009-12-22 Thread Sundara Kaku
Hi,

  Thanks for the quick reply, yes i though of using wget or httracker but
nutch has several features for parsing and removing duplicates while
fetching web pages. more over it is written in Java and i am using java for
the current project so integration is much easier.

  I really appreciate if you could share the code that you have written for
storing the files separately.



2009/12/21 Andrzej Bialecki a...@getopt.org

 On 2009-12-21 17:15, Sundara Kaku wrote:

 Hi,

Nutch is throwing errors while fetching large files (file with size
 more
 then 100mb). I have a website with pages that point to large files (file
 size varies from 10mb to 500mb) and there are several large files in that
 website. I want to fetch all the files using Nutch, but nutch is throwing
 outofmemory exception for large files ( have set heap size to 2500m), with
 heap memory 2500m file size with 250mb are retrieved but larger that that
 are failing,
 and nutch takes lot of time after printing
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0

 if there are three files with size 100mb each then it is failing (at the
 same depth, with heap size 2500m) to fetch files.

 i have set http.content.limite to -1

 is there way to fetch several large files using nutch..

 I am using nutch as webcrawler, i am not using Indexing. I want to
 download
 web resources and scan then for virus using ClamA/V.


 Probably Nutch is not the right tool for you - you should probably use
 wget. Nutch was designed to fetch many pages of limited size - as a
 temporary step it caches the downloaded content in memory, before flushing
 it out to disk.

 (I had to solve this limitation once for a specific case - the solution was
 to implement a variant of the protocol and Content that stored data into
 separate HDFS files without buffering in memory - but it was a brittle hack
 that only worked for that particular scenario).

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Thanks  Regards,
Sundara Kaku

+91 9963990101 Mobile
+91 40 23314848 Work India
iCore Innovations Pvt Ltd
#405 Pavani Plaza,
Khairatabad Main Road
Hyderabad - 54


Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!

Just one question about it. Your approach will also dump the pages
outside of the domain that are referenced with urls. How far is it going?


BELLINI ADAM wrote:
 hi

 do you know that you can index your data for SOLR...the command is solrindex. 
 so you dont need the nutch index.

 and dont you know that you are not obliged to use crawl command ? so if you 
 want so skip index steps you can just use your own steps to inject, generate, 
 fetch, update in a loop. at the end of the loop you can index your data to 
 solr with solrindex command..here is the code
   
 steps=10
 echo - Inject (Step 1 of $steps) -
 $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls

 echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
 for((i=0; i  $depth; i++))
 do
  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---


 $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments

  if [ $? -ne 0 ]
  then
echo runbot: Stopping at depth $depth. No more URLs to fetch.
break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
echo runbot: fetch $segment at depth `expr $i + 1` failed.
echo runbot: Deleting segment $segment.
rm $RMARGS $segment
continue
  fi

 echo  - Updating Dadatabase ( $steps) -


  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
 donenutch_folder/bin/nutch solrindex  URL_OF_YOUR_SOLR_SERVER  
 $crawl/crawldb $crawl/linkdb $crawl/segments/*
 


 and you can also use segreader to extract content of the pages : here is the 
 command :


 ./bin/nutch readseg -dump crawl_dublin/segments/20091001145126/ dump_folder 
 -nofetch -nogenerate -noparse -noparsedata -noparsetex


 this command will return only the content (source pages)


 hope it will help.



   
 Date: Thu, 17 Dec 2009 15:32:33 +0100
 From: claudio.marte...@tis.bz.it
 To: nutch-user@lucene.apache.org
 Subject: Re: Accessing crawled data

 Hi,

 actually i completely mis-explained myself. I'll try to make myself
 clear: i'd like to extract the information in the segments by using the
 parsers.

 This means i can basically use the crawl command but this will also
 index the data and that's a waist of resources. So basically what i will
 do is copy the code in the org.apache.nutch.craw.Crawl until the data is
 indexed and will skip that part. What I'm missing at the moment (and
 maybe i should check the readseg command
 (org.apache.nutch.segment.SegmentReader class)) is the understanding of
 how i can basically extract, from the segments, the list of urls fetched
 and the text connected to the urls. After i have the list of urls and
 the function:url-text i can use my text analysis algorithms and create
 the xml messages to send to my solr server.

 Any pointer to how to handle segments to extract text or extract list of
 all the urls in the db?

 thanks

 Claudio


 reinhard schwab wrote:
 
 if you dont want to refetch already fetched pages,
 i think of 3 possibilities:

 a/ set a very high fetch interval
 b/ use a customized fetch schedule class instead of DefaultFetchSchedule
 implement there a method
 public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
 which returns false if a datum is already fetched.
 this should theoretically work, have not done this.
 in nutch-site.xml you have then to set the property

 property
   namedb.fetch.schedule.class/name
   valueorg.apache.nutch.crawl.CustomizedFetchSchedule/value
   descriptionThe implementation of fetch schedule.
 DefaultFetchSchedule simply
   adds the original fetchInterval to the last fetch time, regardless of
   page changes./description
 /property

 c/ modify the class you are now using as fetch schedule class and adapt
 the method shouldFetch
 to the behaviour you want

 regards

 Claudio Martella schrieb:
   
   
 Hello list,

 I'm using nutch 1.0 to crawl some intranet sites and i want to later put
 the crawled data into my solr server. Though nutch 1.0 comes with solr
 support out of the box i think that solution doesn't fit me. First, i
 need to run my own code on the crawled data (particularly what comes out
 AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
 extract keywords with my own code and i want to do some language
 detection to choose in what fields to put the text (each solr field for
 me has different stopwords and snowball stemming). What happens with the
 crawl command 

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter - this 
receives a copy of the document with all fields collected just before 
it's sent to the indexing backend - and you can freely modify the 
content of NutchDocument, e.g. do additional analysis, add/remove/modify 
fields, etc.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
Andrzej Bialecki wrote:
 On 2009-12-22 13:16, Claudio Martella wrote:
 Yes, I'am aware of that. The problem is that i have some fields of the
 SolrDocument that i want to compute by text analysis (basically i want
 to do some smart keywords extraction) so i have to get in the middle
 between crawling and indexing! My actual solution is to dump the content
 in a file through the segreader, parse it and then use SolrJ to send the
 documents. Probably the best solution is to set my own analyzer for the
 field on solr side, and do keywords extraction there.

 Thanks for the script, you'll use it!

 Likely the solution that you are looking for is an IndexingFilter -
 this receives a copy of the document with all fields collected just
 before it's sent to the indexing backend - and you can freely modify
 the content of NutchDocument, e.g. do additional analysis,
 add/remove/modify fields, etc.

This sounds very interesting. So the idea is to take the NutchDocument
as it comes out of the crawling and modify it (inside of an
IndexingFilter) before it's sent to indexing (inside of nutch),  right?
So how does it relate to nutch schema and solr schema? Can you give me
some pointers?

-- 
Claudio Martella
Digital Technologies
Unit Research  Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 16:07, Claudio Martella wrote:

Andrzej Bialecki wrote:

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter -
this receives a copy of the document with all fields collected just
before it's sent to the indexing backend - and you can freely modify
the content of NutchDocument, e.g. do additional analysis,
add/remove/modify fields, etc.


This sounds very interesting. So the idea is to take the NutchDocument
as it comes out of the crawling and modify it (inside of an
IndexingFilter) before it's sent to indexing (inside of nutch),  right?


Correct - IndexingFilter-s work no matter whether you use Nutch or Solr 
indexing.



So how does it relate to nutch schema and solr schema? Can you give me
some pointers?



Please take a look at how e.g. the index-more filter is implemented - 
basically you need to copy this filter and make whatever modifications 
you need ;)


Keep in mind that any fields that you create in NutchDocument need to be 
properly declared in schema.xml when using Solr indexing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com