Re: How to get page content of crawled pages

peterbarretto Wed, 30 Jan 2013 19:52:57 -0800

I have tried the repo https://github.com/ctjmorgan/nutch-mongodb-indexer and
it does not work
I guess this is not working as it is mentioned it is for nutch 1.3 and i am
using 1.6


I get the below output when i try to rebuild :-

Buildfile: C:\nutch-16\build.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-lib:
   [delete] Deleting directory C:\nutch-16\build\lib

resolve-default:
[ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = C:\nutch-16\ivy\ivysettings.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

copy-libs:

compile-core:
    [javac] C:\nutch-16\build.xml:96: warning: 'includeantruntime' was not
set, defaulting to build.sysclasspath=last; set to false for repeatable
builds
    [javac] Compiling 1 source file to C:\nutch-16\build\classes
    [javac] warning: [path] bad path element
"C:\nutch-16\build\lib\activation.jar": no such file or directory
    [javac] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:7:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
    [javac] import org.apache.hadoop.mapred.JobConf;
    [javac]                                ^
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
error: MongodbWriter is not abstract and does not override abstract method
delete(String) in NutchIndexWriter
    [javac] public class MongodbWriter  implements NutchIndexWriter{
    [javac]        ^
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:23:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
    [javac]     public void open(JobConf job, String name) throws IOException {
    [javac]                      ^
    [javac] 1 error
    [javac] 4 warnings


I have already crawled some urls now and i need to move those to mongodb. Is
there a easy to use code to do that? I am new to java so will require all
the steps of how to add the code and all.



Jorge Luis Betancourt Gonzalez wrote
> I suppose you can write a custom indexer, to store the data in mongodb
> instead of solr, I think there is an open repo on github about this.
> 
> ----- Mensaje original -----
> De: "peterbarretto" &lt;

> peterbarretto08@

> &gt;
> Para: 

> user@.apache

> Enviados: Martes, 29 de Enero 2013 8:46:04
> Asunto: Re: How to get page content of crawled pages
> 
> Hi
> 
> Is there a way i can dump the url and url content in mongodb?
> 
> 
> Klemens Muthmann wrote
>> Hi,
>>
>> Super. That works. Thank you. I thereby also found the class that shows
>> how to achieve this within Java code, which is
>> org.apache.nutch.segment.SegmentReader.
>>
>> Thanks again and bye
>>      Klemens
>>
>> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
>>> Hi Klemens,
>>>
>>> you should run ./bin/nutch readseg!
>>>
>>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
>>> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>>>
>>> Kind Regards from Hannover
>>>
>>> Hannes
>>>
>>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
>>>
> 
>> klemens.muthmann@
> 
>>>  wrote:
>>>
>>>> Hi,
>>>>
>>>> I did a small crawl of some pages on the web and want to geht the raw
>>>> HTML
>>>> content of these pages now. Reading the documentation in the wiki I
>>>> guess
>>>> this content might be somewhere under
>>>> crawl/segments/20101122071139/content/part-00000.
>>>>
>>>> I also guess I can access this content using the Hadoop API like
>>>> described
>>>> here: http://wiki.apache.org/nutch/Getting_Started
>>>>
>>>> However I have absolutely no idea how to configure:
>>>>
>>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>>>>
>>>>
>>>> The Hadoop documentation is not very helpful either. May someone please
>>>> point me in the right direction to get the page content?
>>>>
>>>> Thank you and regards
>>>>     Klemens Muthmann
>>>>
>>
>>
>> --
>> --------------------------------
>> Dipl.-Medieninf., Klemens Muthmann
>> Wissenschaftlicher Mitarbeiter
>>
>> Technische Universität Dresden
>> Fakultät Informatik
>> Institut für Systemarchitektur
>> Lehrstuhl Rechnernetze
>> 01062 Dresden
>> Tel.: +49 (351) 463-38214
>> Fax: +49 (351) 463-38251
>> E-Mail:
> 
>> klemens.muthmann@
> 
>> --------------------------------
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
> Sent from the Nutch - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037283.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to