Re: ManifoldCF and ElasticSearch

Shinichiro Abe Tue, 01 Dec 2015 15:44:51 -0800

What about https://issues.apache.org/jira/browse/CONNECTORS-1234 to
avoid Base64 encoding?
If you want to capture the title of html, you could get from Tika
transformation connector, since Tika will extract metadata such as a
title.


Shinichiro Abe

2015-12-02 3:36 GMT+09:00 Karl Wright <[email protected]>:
> I take that back; the only parsing that is done is in the context of
> determining login pages as part of login sequences.  So content is not
> parsed at all; it's sent to the output connector intact, along with HTML
> headers as metadata.  You can, of course, write a transformation connector
> that would pull out the title; the Tika transformation connector may in fact
> do that for you already, but I don't know for sure.
>
> Thanks,
> Karl
>
>
> On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <[email protected]> wrote:
>>
>> Hi Stephen,
>>
>> The ManifoldCF web connector captures all html content in the body part of
>> an html page, but it does not attempt to separate title content into
>> specific title metadata at this time.  This is, however, not particularly
>> hard to do, if I recall correctly, but I'd have to look into it in more
>> detail before I could be certain.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <[email protected]> wrote:
>>>
>>> Thanks Karl!
>>>
>>>
>>>
>>> After creating a new mapping in ES, specifying the ‘file’ field as an
>>> attachment, I can now search the full text of the web content. That part is
>>> working great now.
>>>
>>>
>>>
>>> Does MCF capture the page title (in the <title> tag) anywhere?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Karl Wright [mailto:[email protected]]
>>> Sent: Tuesday, December 1, 2015 11:00 AM
>>> To: [email protected]
>>> Subject: Re: ManifoldCF and ElasticSearch
>>>
>>>
>>>
>>> Hi Stephen,
>>>
>>>
>>>
>>> The integration with ES is supposed to go through the mapper-attachment
>>> plugin, which at one point did accept Base64-encoded "attachments" and index
>>> them.  This is what's currently implemented in the ElasticSearch output
>>> connector.
>>>
>>>
>>>
>>> Unfortunately, however, with ElasticSearch, the level of backwards
>>> compatibility isn't always what we'd like, so I wouldn't be surprised if
>>> something changed or if you needed special configuration now to do it that
>>> way.  I've been unable to keep up with what ES is doing but I'm happy to
>>> make changes to the output connector if you have information that the
>>> current implementation is incorrect, and have details about how to make it
>>> work properly in a standard. modern, ES environment.  But I'd start by
>>> making sure there's actually something broken by looking at the
>>> mapper-attachment plugin.
>>>
>>>
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <[email protected]> wrote:
>>>
>>> I’m putting together a proof-of-concept for crawling our website content
>>> with MCF, and indexing it with ES. At a basic level, everything seems to be
>>> working. What I’m trying to understand is that when MCF indexes web content,
>>> the HTML is stored inside an object called file in a property called
>>> _content. When this is added to the ES index, all the HTML is Base64
>>> encoded. I believe this is preventing ES from property searching the field.
>>>
>>>
>>>
>>> Is this Base64 encoding to be expected, or do I need to change something?
>>>
>>>
>>>
>>> Does anyone have a walkthrough of using MCF to crawl web content, and
>>> output to ES? I’ve seen many many guides for both systems, but never
>>> something that combines the two. I’d prefer to avoid using Nutch for
>>> crawling, since it lacks any UI for management.
>>>
>>>
>>>
>>>
>>>
>>> Stephen Corey
>>>
>>> Technology Consultant
>>> East Carolina University
>>>
>>> [email protected]
>>>
>>>
>>>
>>>
>>
>>
>

Re: ManifoldCF and ElasticSearch

Reply via email to