RE: Fetch iframe from HTML (if exists)

Markus Jelsma Thu, 27 Jun 2013 12:03:36 -0700

Hi Amit,

Seems Julien found an issue, could you check the issue 
https://issues.apache.org/jira/browse/NUTCH-1592 and we'll continue the 
discussion there?


Thanks!

 
 
-----Original message-----
> From:Amit Sela <[email protected]>
> Sent: Wednesday 26th June 2013 23:44
> To: [email protected]
> Subject: Re: Fetch iframe from HTML (if exists)
> 
> Well, just for sports, I tried removing the parse-tika but still nothing...
> 
> 
> On Wed, Jun 26, 2013 at 11:25 PM, Julien Nioche <
> [email protected]> wrote:
> 
> > I noticed recently that my XPath extraction rules did not work on HTML
> > documents with parse-tika but worked at treat with parse-html. Forgot to
> > open an issue, my bad. Could be the same problem here
> >
> >
> > On 26 June 2013 15:26, Amit Sela <[email protected]> wrote:
> >
> > > I did succeed in parsing using content and iterating over every line but
> > > I'd prefer do it with DocumentFragment.
> > > my plugin.includes has:
> > >
> > >
> > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta
> > > So I us parse-html but also tika, text metatags and js. maybe it's to
> > much
> > > ? I copied this configuration from an example I saw. I do know that I use
> > > metatags (I index keywords and description) but I'm not sure about the
> > > rest...
> > >
> > >
> > > On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > > > Of course, forget it. What parser do you use? Maybe the old parse-html
> > > > doesn't report it back.You can also try to print every element you loop
> > > > over and check if it's there or not.
> > > >
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:Amit Sela <[email protected]>
> > > > > Sent: Wednesday 26th June 2013 16:11
> > > > > To: [email protected]
> > > > > Subject: Re: Fetch iframe from HTML (if exists)
> > > > >
> > > > > How will it affect ? I Crawl with no depth (depth 1) so outlinks
> > don't
> > > > > matter and it seems that the urls fetched don't get parsed, or am I
> > > > > misunderstanding something ?
> > > > >
> > > > >
> > > > > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > No order does not matter. Try adding iframe to the ignore_tags
> > > > > > configuration directive in your nutch-site.
> > > > > > parser.html.outlinks.ignore_tags
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:Amit Sela <[email protected]>
> > > > > > > Sent: Wednesday 26th June 2013 16:03
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: Fetch iframe from HTML (if exists)
> > > > > > >
> > > > > > > In nutch-site.xml plugin.includes my custom filter is last and I
> > > have
> > > > > > > no htmlparsefilter.order  so my filter should be applied last,
> > > right
> > > > ?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > So I managed to create and deploy my plugin, which initially
> > used
> > > > > > > > content.getContent() and it worked.
> > > > > > > > Then, I wanted to parse the fetched content as DocumentFragment
> > > (by
> > > > > > > > iterating over the child nodes).
> > > > > > > > This doesn't work. I logged DocumentFragment.toString() in my
> > > > > > > > MyCustomHtmlParseFilter in filter method, and in the Parse
> > > > MapReduce
> > > > > > logs I
> > > > > > > > see: [#document-fragment: null] for all URLS.
> > > > > > > >
> > > > > > > > How do I get nutch to pass the parsed html as DocumentFragment
> > ?
> > > > > > Should I
> > > > > > > > state htmlparsefilter.order in nutch-site.xml ? if so, in what
> > > > order ?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks for the prompt answer!
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
> > > > > > > >> [email protected]> wrote:
> > > > > > > >>
> > > > > > > >>> Hi,
> > > > > > > >>>
> > > > > > > >>> Do i understand you correctly if you want all iframe src
> > > > attributes
> > > > > > on a
> > > > > > > >>> given page stored in the iframe field?
> > > > > > > >>>
> > > > > > > >>> The src attributes are not extracted and there is no facility
> > > to
> > > > do
> > > > > > so
> > > > > > > >>> right now. You should create your own HTMLParseFilter, loop
> > > > through
> > > > > > the
> > > > > > > >>> document looking for iframe tags and collect the src
> > attribute.
> > > > Then
> > > > > > add
> > > > > > > >>> those as parse metadata. You can then index them with the
> > > > > > index-metadata
> > > > > > > >>> plugin. I'm not sure it supports multi valued metafields in
> > > Nutch
> > > > > > 1.6, it
> > > > > > > >>> sure will in 1.7.
> > > > > > > >>>
> > > > > > > >>> Use the bin/nutch parsechecker and indexchecker tools to
> > check
> > > if
> > > > > > your
> > > > > > > >>> plugin works.
> > > > > > > >>>
> > > > > > > >>> Cheers
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> -----Original message-----
> > > > > > > >>> > From:Amit Sela <[email protected]>
> > > > > > > >>> > Sent: Tuesday 25th June 2013 16:26
> > > > > > > >>> > To: [email protected]
> > > > > > > >>> > Subject: Fetch iframe from HTML (if exists)
> > > > > > > >>> >
> > > > > > > >>> > Hi all,
> > > > > > > >>> >
> > > > > > > >>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to
> > index
> > > > the
> > > > > > > >>> iframe
> > > > > > > >>> > src field into Solr.
> > > > > > > >>> > i.e.,
> > > > > > > >>> > <iframe src="something" scrolling="" frameborder="".......>
> > > > > > > >>> > So i want to fetch the iframe and index it as iframe so
> > that
> > > I
> > > > > > could
> > > > > > > >>> find
> > > > > > > >>> > URLS by iframe src.
> > > > > > > >>> >
> > > > > > > >>> > I'm crawling with no depth over a seed list, and I don't
> > want
> > > > to
> > > > > > crawl
> > > > > > > >>> to
> > > > > > > >>> > the iframe src, just to index and store it.
> > > > > > > >>> >
> > > > > > > >>> > I tried adding
> > > > > > > >>> > <name>urlmeta.tags</name> <value>iframe</value> to
> > > > nutch-site.xml
> > > > > > > >>> >
> > > > > > > >>> > and
> > > > > > > >>> > <field name="iframe" type="text_general" stored="true"
> > > > > > indexed="true"
> > > > > > > >>> > multiValued="true"/> to schema.xml
> > > > > > > >>> >
> > > > > > > >>> > and
> > > > > > > >>> > <field dest="iframe" source="iframe"/> to
> > > > solrindex-mapping.xml.
> > > > > > > >>> >
> > > > > > > >>> > What am I missing ?
> > > > > > > >>> >
> > > > > > > >>> > Thanks,
> > > > > > > >>> >
> > > > > > > >>> > Amit.
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

RE: Fetch iframe from HTML (if exists)

Reply via email to