Nope. Using nutch 1.6. I ended up using org.jsoup.nodes.Document document = Jsoup.parse(content. List<org.jsoup.nodes.Node> childNodes = document.childNodes();
On Wed, Jun 26, 2013 at 7:19 PM, Lewis John Mcgibbney < [email protected]> wrote: > It looks like your on a pre 1.3 version of Nutch here. > It is highly recommended to upgrade. > Thanks > Lewis > > On Wednesday, June 26, 2013, Amit Sela <[email protected]> wrote: > > I did succeed in parsing using content and iterating over every line but > > I'd prefer do it with DocumentFragment. > > my plugin.includes has: > > > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta > > So I us parse-html but also tika, text metatags and js. maybe it's to > much > > ? I copied this configuration from an example I saw. I do know that I use > > metatags (I index keywords and description) but I'm not sure about the > > rest... > > > > > > On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma > > <[email protected]>wrote: > > > >> Of course, forget it. What parser do you use? Maybe the old parse-html > >> doesn't report it back.You can also try to print every element you loop > >> over and check if it's there or not. > >> > >> > >> > >> -----Original message----- > >> > From:Amit Sela <[email protected]> > >> > Sent: Wednesday 26th June 2013 16:11 > >> > To: [email protected] > >> > Subject: Re: Fetch iframe from HTML (if exists) > >> > > >> > How will it affect ? I Crawl with no depth (depth 1) so outlinks don't > >> > matter and it seems that the urls fetched don't get parsed, or am I > >> > misunderstanding something ? > >> > > >> > > >> > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma > >> > <[email protected]>wrote: > >> > > >> > > No order does not matter. Try adding iframe to the ignore_tags > >> > > configuration directive in your nutch-site. > >> > > parser.html.outlinks.ignore_tags > >> > > > >> > > > >> > > > >> > > -----Original message----- > >> > > > From:Amit Sela <[email protected]> > >> > > > Sent: Wednesday 26th June 2013 16:03 > >> > > > To: [email protected] > >> > > > Subject: Re: Fetch iframe from HTML (if exists) > >> > > > > >> > > > In nutch-site.xml plugin.includes my custom filter is last and I > have > >> > > > no htmlparsefilter.order so my filter should be applied last, > right > >> ? > >> > > > > >> > > > > >> > > > > >> > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <[email protected]> > >> wrote: > >> > > > > >> > > > > So I managed to create and deploy my plugin, which initially > used > >> > > > > content.getContent() and it worked. > >> > > > > Then, I wanted to parse the fetched content as DocumentFragment > (by > >> > > > > iterating over the child nodes). > >> > > > > This doesn't work. I logged DocumentFragment.toString() in my > >> > > > > MyCustomHtmlParseFilter in filter method, and in the Parse > >> MapReduce > >> > > logs I > >> > > > > see: [#document-fragment: null] for all URLS. > >> > > > > > >> > > > > How do I get nutch to pass the parsed html as DocumentFragment ? > >> > > Should I > >> > > > > state htmlparsefilter.order in nutch-site.xml ? if so, in what > >> order ? > >> > > > > > >> > > > > Thanks. > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected] > > > >> > > wrote: > >> > > > > > >> > > > >> Thanks for the prompt answer! > >> > > > >> > >> > > > >> > >> > > > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma < > >> > > > >> [email protected]> wrote: > >> > > > >> > >> > > > >>> Hi, > >> > > > >>> > >> > > > >>> Do i understand you correctly if you want all iframe src > >> attributes > >> > > on a > >> > > > >>> given page stored in the iframe field? > >> > > > >>> > >> > > > >>> The src attributes are not extracted and there is no facility > to > >> do > >> > > so > >> > > > >>> right now. You should create your own HTMLParseFilter, loop > >> through > >> > > the > >> > > > >>> document looking for iframe tags and collect the src > attribute. > >> Then > >> > > add > >> > > > >>> those as > > -- > *Lewis* >

