Re: Job definition metadata with multiple path attribute names

Karl Wright Sat, 06 Jun 2015 03:08:40 -0700

I attached a patch to CONNECTORS-1209.  I have not tested it yet.
Hopefully there will be time to do that later in the weekend.


Karl


On Fri, Jun 5, 2015 at 10:03 AM, Karl Wright <[email protected]> wrote:

> Created CONNECTORS-1209 for this functionality.
>
> It's not hard to do, technically, but I need to define a language to
> describe the regex and what you would want to extract.  For instance, right
> now you specify a field value in terms of another field value like this:
>
> stringstringstring${otherfieldname}stringstring
>
> I'd be putting additional specification into ${otherfieldname}, something
> like this:
>
> stringstringstring${otherfieldname:([1234567890]*)}stringstring
>
> ... which would extract the first number from the metadata value.  But
> since ":" may well be part of a field name right now, I'd need to do
> something other than that, and I'd want to be able to support more complex
> regexps as well.
>
> Karl
>
>
> On Fri, Jun 5, 2015 at 9:33 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Vigi,
>>
>> I do understand your issue, but I'd propose a general solution of adding
>> new functionality to the Metadata Transformer to achieve your goal.  So the
>> setup would be this:
>>
>> - Use the JCIFS connector Metadata tab to just include the entire path in
>> the metadata
>> - Use the Metadata Transformer to generate two different pieces of
>> metadata, using a new regular expression modification feature that I would
>> write for you, if we can come up with a design for it
>>
>> You can write your own completely new transformation connector, but
>> that's no different than what I propose, and not as useful.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <[email protected]> wrote:
>>
>>> Dear Karl,
>>>
>>> Maybe I misunderstood the applications for the metadata tab but in my
>>> scenario I need to extract two types of information from a document's path.
>>> Right now I am only able to extract one piece of information and put it in
>>> Solr; it would have been very useful to be able to perform other
>>> transformations to the paths but it's OK, I can probably write a
>>> transformation connector of my own.
>>>
>>> Thanks,
>>> vigi
>>> ------------------------------
>>> Date: Fri, 5 Jun 2015 09:02:59 -0400
>>> Subject: Re: Job definition metadata with multiple path attribute names
>>> From: [email protected]
>>> To: [email protected]
>>>
>>>
>>> Hi Vigi,
>>>
>>> You get, for free, the file name of the document as metadata, from all
>>> repository connectors, including the jcifs connector:
>>>
>>> >>>>>>
>>>                   rd.setFileName(fileNameString);
>>> <<<<<<
>>>
>>> The problem is that this is not something you can manipulate in MCF via
>>> regular expression with the current bevy of supplied transformation
>>> connectors, because (a) it isn't generic metadata but a fixed property of
>>> the document, and (b) the Metadata Transformer connector doesn't allow you
>>> to slice and dice metadata in any case, just compose it into bigger strings.
>>>
>>> So you're stuck with either writing a document transformation connector
>>> of your own, which does what you want, or proposing additional
>>> functionality for the Metadata Transformer.  If it can be done in a
>>> backwards compatible way, this is something I would support.
>>>
>>> I'm not thrilled with the idea of extending the JCIFS connector to build
>>> multiple independent attributes all from the path; the UI for this
>>> connector is already quite complex, and the functionality for generically
>>> manipulating metadata would be useful in general anyway.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <[email protected]> wrote:
>>>
>>> Hello guys,
>>>
>>> I have another Manifoldcf 2.0.2 question. Our process consists of
>>> indexing some documents from a Windows Share and sending them to Solr. I
>>> would like to extract some information from the documents and put it into
>>> specific Solr fields. For example, based on the id of the document I am
>>> currently extracting a specific folder name (using regular expressions on
>>> the metadata tab of the job defintition) and storing it into Solr; this it
>>> works fine.
>>>
>>> However, I also want to extract the file extension (using regex) and
>>> send it to Solr but I am not able to add more than one path attribute name
>>> on the Metadata tab of the job definition. I already have one that extracts
>>> a particular folder name from the file path and I would need a second one
>>> for the file extension.
>>>
>>> How would I be able to achieve this?
>>>
>>> Regards,
>>> vigi
>>>
>>>
>>>
>>
>

Re: Job definition metadata with multiple path attribute names

Reply via email to