I take that back; the only parsing that is done is in the context of determining login pages as part of login sequences. So content is not parsed at all; it's sent to the output connector intact, along with HTML headers as metadata. You can, of course, write a transformation connector that would pull out the title; the Tika transformation connector may in fact do that for you already, but I don't know for sure.
Thanks, Karl On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <[email protected]> wrote: > Hi Stephen, > > The ManifoldCF web connector captures all html content in the body part of > an html page, but it does not attempt to separate title content into > specific title metadata at this time. This is, however, not particularly > hard to do, if I recall correctly, but I'd have to look into it in more > detail before I could be certain. > > Thanks, > Karl > > > On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <[email protected]> wrote: > >> Thanks Karl! >> >> >> >> After creating a new mapping in ES, specifying the ‘file’ field as an >> attachment, I can now search the full text of the web content. That part is >> working great now. >> >> >> >> Does MCF capture the page title (in the <title> tag) anywhere? >> >> >> >> >> >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* Tuesday, December 1, 2015 11:00 AM >> *To:* [email protected] >> *Subject:* Re: ManifoldCF and ElasticSearch >> >> >> >> Hi Stephen, >> >> >> >> The integration with ES is supposed to go through the mapper-attachment >> plugin, which at one point did accept Base64-encoded "attachments" and >> index them. This is what's currently implemented in the ElasticSearch >> output connector. >> >> >> >> Unfortunately, however, with ElasticSearch, the level of backwards >> compatibility isn't always what we'd like, so I wouldn't be surprised if >> something changed or if you needed special configuration now to do it that >> way. I've been unable to keep up with what ES is doing but I'm happy to >> make changes to the output connector if you have information that the >> current implementation is incorrect, and have details about how to make it >> work properly in a standard. modern, ES environment. But I'd start by >> making sure there's actually something broken by looking at the >> mapper-attachment plugin. >> >> >> >> Thanks, >> Karl >> >> >> >> >> >> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <[email protected]> wrote: >> >> I’m putting together a proof-of-concept for crawling our website content >> with MCF, and indexing it with ES. At a basic level, everything seems to be >> working. What I’m trying to understand is that when MCF indexes web >> content, the HTML is stored inside an object called file in a property >> called _content. When this is added to the ES index, all the HTML is Base64 >> encoded. I believe this is preventing ES from property searching the field. >> >> >> >> Is this Base64 encoding to be expected, or do I need to change something? >> >> >> >> Does anyone have a walkthrough of using MCF to crawl web content, and >> output to ES? I’ve seen many many guides for both systems, but never >> something that combines the two. I’d prefer to avoid using Nutch for >> crawling, since it lacks any UI for management. >> >> >> >> >> >> Stephen Corey >> >> Technology Consultant >> East Carolina University >> >> [email protected] >> >> >> >> >> > >
