What about https://issues.apache.org/jira/browse/CONNECTORS-1234 to avoid Base64 encoding? If you want to capture the title of html, you could get from Tika transformation connector, since Tika will extract metadata such as a title.
Shinichiro Abe 2015-12-02 3:36 GMT+09:00 Karl Wright <[email protected]>: > I take that back; the only parsing that is done is in the context of > determining login pages as part of login sequences. So content is not > parsed at all; it's sent to the output connector intact, along with HTML > headers as metadata. You can, of course, write a transformation connector > that would pull out the title; the Tika transformation connector may in fact > do that for you already, but I don't know for sure. > > Thanks, > Karl > > > On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <[email protected]> wrote: >> >> Hi Stephen, >> >> The ManifoldCF web connector captures all html content in the body part of >> an html page, but it does not attempt to separate title content into >> specific title metadata at this time. This is, however, not particularly >> hard to do, if I recall correctly, but I'd have to look into it in more >> detail before I could be certain. >> >> Thanks, >> Karl >> >> >> On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <[email protected]> wrote: >>> >>> Thanks Karl! >>> >>> >>> >>> After creating a new mapping in ES, specifying the ‘file’ field as an >>> attachment, I can now search the full text of the web content. That part is >>> working great now. >>> >>> >>> >>> Does MCF capture the page title (in the <title> tag) anywhere? >>> >>> >>> >>> >>> >>> >>> >>> From: Karl Wright [mailto:[email protected]] >>> Sent: Tuesday, December 1, 2015 11:00 AM >>> To: [email protected] >>> Subject: Re: ManifoldCF and ElasticSearch >>> >>> >>> >>> Hi Stephen, >>> >>> >>> >>> The integration with ES is supposed to go through the mapper-attachment >>> plugin, which at one point did accept Base64-encoded "attachments" and index >>> them. This is what's currently implemented in the ElasticSearch output >>> connector. >>> >>> >>> >>> Unfortunately, however, with ElasticSearch, the level of backwards >>> compatibility isn't always what we'd like, so I wouldn't be surprised if >>> something changed or if you needed special configuration now to do it that >>> way. I've been unable to keep up with what ES is doing but I'm happy to >>> make changes to the output connector if you have information that the >>> current implementation is incorrect, and have details about how to make it >>> work properly in a standard. modern, ES environment. But I'd start by >>> making sure there's actually something broken by looking at the >>> mapper-attachment plugin. >>> >>> >>> >>> Thanks, >>> Karl >>> >>> >>> >>> >>> >>> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <[email protected]> wrote: >>> >>> I’m putting together a proof-of-concept for crawling our website content >>> with MCF, and indexing it with ES. At a basic level, everything seems to be >>> working. What I’m trying to understand is that when MCF indexes web content, >>> the HTML is stored inside an object called file in a property called >>> _content. When this is added to the ES index, all the HTML is Base64 >>> encoded. I believe this is preventing ES from property searching the field. >>> >>> >>> >>> Is this Base64 encoding to be expected, or do I need to change something? >>> >>> >>> >>> Does anyone have a walkthrough of using MCF to crawl web content, and >>> output to ES? I’ve seen many many guides for both systems, but never >>> something that combines the two. I’d prefer to avoid using Nutch for >>> crawling, since it lacks any UI for management. >>> >>> >>> >>> >>> >>> Stephen Corey >>> >>> Technology Consultant >>> East Carolina University >>> >>> [email protected] >>> >>> >>> >>> >> >> >
