Hi

 Can anyone help me to crawl json website in nutch 2.3 with hbase database

On Tue, Jan 31, 2017 at 8:01 PM, Eyeris Rodriguez Rueda <[email protected]>
wrote:

> thanks markus for help.
> I have readed the description of this property(below) and it says that
> crawl datum save that value, i thought that it was necesary to take
> responseTime from it.
> i will try using only _rs_ key.
>
>  <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key
> &quot;_rs_&quot;
> >   </description>
> > </property>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[email protected]>
> Para: [email protected]
> Enviados: Martes, 31 de Enero 2017 9:55:10
> Asunto: RE: [MASSMAIL]how to index response time for a url ?
>
> I am not sure what is going on, but those HTML entities &quot; certainly
> do not belong there. _rs_ is good enough. Then you also need
> index-metadata, and have the indexer add _rs_ to your index.
>
> <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value>&quot;_rs_&quot;</value>
>   <description>Comma-separated list of parse metadata keys to transfer to
> the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled,
> setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry
> in the crawldb.
>   </description>
>
>
>
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <[email protected]>
> > Sent: Tuesday 31st January 2017 14:32
> > To: [email protected]
> > Subject: Re: [MASSMAIL]how to index response time for a url ?
> >
> > Please any body can help me or not?
> > this is only happening to me ?
> >
> > ----- Mensaje original -----
> > De: "Eyeris Rodriguez Rueda" <[email protected]>
> > Para: [email protected]
> > Enviados: Domingo, 29 de Enero 2017 22:28:01
> > Asunto: [MASSMAIL]how to index response time for a url ?
> >
> > Hi all.
> > I need to get and index response time for each url that nutch crawl.
> > I have added a responseTime field in solr for this value.
> >
> > Is there any way to do this with configurations only or i need to do my
> own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
> > Please some help about the steps will be apprecciated.
> >
> >
> > Im have configured http.store.responsetime property to true, what im
> missing ?.
> >
> >
> >
> > This is my nutch-site.xml property
> >
> > <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key
> &quot;_rs_&quot;
> >   </description>
> > </property>
> >
> > after i have put the key but when i do parsechecker i don´t see data
> related to responseTime in the output.
> >
> > <property>
> >   <name>db.parsemeta.to.crawldb</name>
> >   <value>&quot;_rs_&quot;</value>
> >   <description>Comma-separated list of parse metadata keys to transfer
> to the crawldb (NUTCH-779).
> >    Assuming for instance that the languageidentifier plugin is enabled,
> setting the value to 'lang'
> >    will copy both the key 'lang' and its value to the corresponding
> entry in the crawldb.
> >   </description>
> > </property>ç
> ******************************
> this the end of the message.
> Text below is added automatically by my email provider.
> ********************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>


-- 
Thanks & Regards
Surendra Babu Katta
8886747555

Reply via email to