I attached a patch to CONNECTORS-1209. I have not tested it yet. Hopefully there will be time to do that later in the weekend.
Karl On Fri, Jun 5, 2015 at 10:03 AM, Karl Wright <[email protected]> wrote: > Created CONNECTORS-1209 for this functionality. > > It's not hard to do, technically, but I need to define a language to > describe the regex and what you would want to extract. For instance, right > now you specify a field value in terms of another field value like this: > > stringstringstring${otherfieldname}stringstring > > I'd be putting additional specification into ${otherfieldname}, something > like this: > > stringstringstring${otherfieldname:([1234567890]*)}stringstring > > ... which would extract the first number from the metadata value. But > since ":" may well be part of a field name right now, I'd need to do > something other than that, and I'd want to be able to support more complex > regexps as well. > > Karl > > > On Fri, Jun 5, 2015 at 9:33 AM, Karl Wright <[email protected]> wrote: > >> Hi Vigi, >> >> I do understand your issue, but I'd propose a general solution of adding >> new functionality to the Metadata Transformer to achieve your goal. So the >> setup would be this: >> >> - Use the JCIFS connector Metadata tab to just include the entire path in >> the metadata >> - Use the Metadata Transformer to generate two different pieces of >> metadata, using a new regular expression modification feature that I would >> write for you, if we can come up with a design for it >> >> You can write your own completely new transformation connector, but >> that's no different than what I propose, and not as useful. >> >> Thanks, >> Karl >> >> >> >> On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <[email protected]> wrote: >> >>> Dear Karl, >>> >>> Maybe I misunderstood the applications for the metadata tab but in my >>> scenario I need to extract two types of information from a document's path. >>> Right now I am only able to extract one piece of information and put it in >>> Solr; it would have been very useful to be able to perform other >>> transformations to the paths but it's OK, I can probably write a >>> transformation connector of my own. >>> >>> Thanks, >>> vigi >>> ------------------------------ >>> Date: Fri, 5 Jun 2015 09:02:59 -0400 >>> Subject: Re: Job definition metadata with multiple path attribute names >>> From: [email protected] >>> To: [email protected] >>> >>> >>> Hi Vigi, >>> >>> You get, for free, the file name of the document as metadata, from all >>> repository connectors, including the jcifs connector: >>> >>> >>>>>> >>> rd.setFileName(fileNameString); >>> <<<<<< >>> >>> The problem is that this is not something you can manipulate in MCF via >>> regular expression with the current bevy of supplied transformation >>> connectors, because (a) it isn't generic metadata but a fixed property of >>> the document, and (b) the Metadata Transformer connector doesn't allow you >>> to slice and dice metadata in any case, just compose it into bigger strings. >>> >>> So you're stuck with either writing a document transformation connector >>> of your own, which does what you want, or proposing additional >>> functionality for the Metadata Transformer. If it can be done in a >>> backwards compatible way, this is something I would support. >>> >>> I'm not thrilled with the idea of extending the JCIFS connector to build >>> multiple independent attributes all from the path; the UI for this >>> connector is already quite complex, and the functionality for generically >>> manipulating metadata would be useful in general anyway. >>> >>> Karl >>> >>> >>> On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <[email protected]> wrote: >>> >>> Hello guys, >>> >>> I have another Manifoldcf 2.0.2 question. Our process consists of >>> indexing some documents from a Windows Share and sending them to Solr. I >>> would like to extract some information from the documents and put it into >>> specific Solr fields. For example, based on the id of the document I am >>> currently extracting a specific folder name (using regular expressions on >>> the metadata tab of the job defintition) and storing it into Solr; this it >>> works fine. >>> >>> However, I also want to extract the file extension (using regex) and >>> send it to Solr but I am not able to add more than one path attribute name >>> on the Metadata tab of the job definition. I already have one that extracts >>> a particular folder name from the file path and I would need a second one >>> for the file extension. >>> >>> How would I be able to achieve this? >>> >>> Regards, >>> vigi >>> >>> >>> >> >
