Thanks Karl!

After creating a new mapping in ES, specifying the ‘file’ field as an 
attachment, I can now search the full text of the web content. That part is 
working great now.

Does MCF capture the page title (in the <title> tag) anywhere?



From: Karl Wright [mailto:[email protected]]
Sent: Tuesday, December 1, 2015 11:00 AM
To: [email protected]
Subject: Re: ManifoldCF and ElasticSearch

Hi Stephen,

The integration with ES is supposed to go through the mapper-attachment plugin, 
which at one point did accept Base64-encoded "attachments" and index them.  
This is what's currently implemented in the ElasticSearch output connector.

Unfortunately, however, with ElasticSearch, the level of backwards 
compatibility isn't always what we'd like, so I wouldn't be surprised if 
something changed or if you needed special configuration now to do it that way. 
 I've been unable to keep up with what ES is doing but I'm happy to make 
changes to the output connector if you have information that the current 
implementation is incorrect, and have details about how to make it work 
properly in a standard. modern, ES environment.  But I'd start by making sure 
there's actually something broken by looking at the mapper-attachment plugin.

Thanks,
Karl


On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen 
<[email protected]<mailto:[email protected]>> wrote:
I’m putting together a proof-of-concept for crawling our website content with 
MCF, and indexing it with ES. At a basic level, everything seems to be working. 
What I’m trying to understand is that when MCF indexes web content, the HTML is 
stored inside an object called file in a property called _content. When this is 
added to the ES index, all the HTML is Base64 encoded. I believe this is 
preventing ES from property searching the field.

Is this Base64 encoding to be expected, or do I need to change something?

Does anyone have a walkthrough of using MCF to crawl web content, and output to 
ES? I’ve seen many many guides for both systems, but never something that 
combines the two. I’d prefer to avoid using Nutch for crawling, since it lacks 
any UI for management.


Stephen Corey
Technology Consultant
East Carolina University
[email protected]<mailto:[email protected]>


Reply via email to