Base64 encoding is fine when used with mapper-attachment plugin. And yes, I recommended Tika transformer. Thanks, Karl
On Tue, Dec 1, 2015 at 6:42 PM, Shinichiro Abe <[email protected]> wrote: > What about https://issues.apache.org/jira/browse/CONNECTORS-1234 to > avoid Base64 encoding? > If you want to capture the title of html, you could get from Tika > transformation connector, since Tika will extract metadata such as a > title. > > Shinichiro Abe > > 2015-12-02 3:36 GMT+09:00 Karl Wright <[email protected]>: > > I take that back; the only parsing that is done is in the context of > > determining login pages as part of login sequences. So content is not > > parsed at all; it's sent to the output connector intact, along with HTML > > headers as metadata. You can, of course, write a transformation > connector > > that would pull out the title; the Tika transformation connector may in > fact > > do that for you already, but I don't know for sure. > > > > Thanks, > > Karl > > > > > > On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <[email protected]> wrote: > >> > >> Hi Stephen, > >> > >> The ManifoldCF web connector captures all html content in the body part > of > >> an html page, but it does not attempt to separate title content into > >> specific title metadata at this time. This is, however, not > particularly > >> hard to do, if I recall correctly, but I'd have to look into it in more > >> detail before I could be certain. > >> > >> Thanks, > >> Karl > >> > >> > >> On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <[email protected]> wrote: > >>> > >>> Thanks Karl! > >>> > >>> > >>> > >>> After creating a new mapping in ES, specifying the ‘file’ field as an > >>> attachment, I can now search the full text of the web content. That > part is > >>> working great now. > >>> > >>> > >>> > >>> Does MCF capture the page title (in the <title> tag) anywhere? > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> From: Karl Wright [mailto:[email protected]] > >>> Sent: Tuesday, December 1, 2015 11:00 AM > >>> To: [email protected] > >>> Subject: Re: ManifoldCF and ElasticSearch > >>> > >>> > >>> > >>> Hi Stephen, > >>> > >>> > >>> > >>> The integration with ES is supposed to go through the mapper-attachment > >>> plugin, which at one point did accept Base64-encoded "attachments" and > index > >>> them. This is what's currently implemented in the ElasticSearch output > >>> connector. > >>> > >>> > >>> > >>> Unfortunately, however, with ElasticSearch, the level of backwards > >>> compatibility isn't always what we'd like, so I wouldn't be surprised > if > >>> something changed or if you needed special configuration now to do it > that > >>> way. I've been unable to keep up with what ES is doing but I'm happy > to > >>> make changes to the output connector if you have information that the > >>> current implementation is incorrect, and have details about how to > make it > >>> work properly in a standard. modern, ES environment. But I'd start by > >>> making sure there's actually something broken by looking at the > >>> mapper-attachment plugin. > >>> > >>> > >>> > >>> Thanks, > >>> Karl > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <[email protected]> > wrote: > >>> > >>> I’m putting together a proof-of-concept for crawling our website > content > >>> with MCF, and indexing it with ES. At a basic level, everything seems > to be > >>> working. What I’m trying to understand is that when MCF indexes web > content, > >>> the HTML is stored inside an object called file in a property called > >>> _content. When this is added to the ES index, all the HTML is Base64 > >>> encoded. I believe this is preventing ES from property searching the > field. > >>> > >>> > >>> > >>> Is this Base64 encoding to be expected, or do I need to change > something? > >>> > >>> > >>> > >>> Does anyone have a walkthrough of using MCF to crawl web content, and > >>> output to ES? I’ve seen many many guides for both systems, but never > >>> something that combines the two. I’d prefer to avoid using Nutch for > >>> crawling, since it lacks any UI for management. > >>> > >>> > >>> > >>> > >>> > >>> Stephen Corey > >>> > >>> Technology Consultant > >>> East Carolina University > >>> > >>> [email protected] > >>> > >>> > >>> > >>> > >> > >> > > >
