To this requirement I would add the basic requirement that this file (what Fergus calls the manifest to which I still don't agree) represents a update-set and that there should be a delete-set as well.

ChangeSetEntityProcessor, on there I would jump with two feet.

paul


Le 10-mars-09 à 05:40, Noble Paul നോബിള്‍ नोब्ळ् a écrit :

Hi Fergus open a JIRA issue anyway. put in your thoughts and we can
refine the requirements as a part of the discussion.

Basically the requirements are ,
1)read a file line by line
2) filter out lines (include or exclude ) based on a regex
3) extract parts (named parts) from the line using another regex

Noble


On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie <fer...@twig.me.uk> wrote:
Hi Fergus,
The idea is that we have something generic which can be applicable to a large set of users. If the manifest is a text file it can be read in somestandard way (say line by line). So we can have an EntityProcessor
which reads a text file line and filer it by a regex like the way
'grep' works.
Yes. That is what I have written. It is just an alternate form of the
FileListEntityProcessor except that rather than walking the file system
it reads from a file, line by line, and identifies the portion of the
line containing the filename using a regexp.



On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
manifest processing has a very limited usecase. Why can't it be
processed using a PlainTextEntityProcessor and write a Tranformer to
read lines using regex?

Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
insight to see how this could be used to index each of the files
listed by a 'tar xvf' report. Can you explain further?

About the limited usecase. Verity thought it was useful enough
to have there own "bulk insert file" or bif file format that
did the same and was far less flexible.

In my experience we generally start off with some kind of
file walker or crawler looking after file repositories. But
these always proved slow and unreliable and over time they
were always replaced it with some kind of manifest based
control of the indexer. Where we could get a report of changes
we always used it, and only relied on walkers or crawlers
where we had to.

Fergus


--Noble

On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk > wrote:
Hello,

I have almost finished a new DIH EntityProcessor which
I am calling the manifestEnityProcessor. It is designed
around the idea that whatever demon is used to maintain
your set of a few 100,000 xml documents it is likely to
drop a report or log file explaining what has been changed
within your content store. This assumes a file based
content repository.

The manifestEnityProcessor is used as follows

      <entity name="jc"
              processor="ManifestEntityProcessor"
              baseDir="/Volumes/Techmore/ts/aaa/schema/data"
              rootEntity="false"
              dataSource="null"

              allowRegex="^.*\.xml$"
              manifestFileName="/Volumes/ts/man-find.txt"
              manifestAddRegex="(.*)$"
              >

The idea is you have a log file or other report, perhaps
from tar or zip, and you wish to use this to control the
indexing of the new content. The new entity fields are as
follows.

manifestFileName is the name of the manifest file. If
                this value is relative, it assumed to
                be relative to baseDir. Required.

manifestAddRegex is a required regex to identify lines
                which when matched should cause docs to
                be added to the index.

manifestDelRegex is an optional value of a regex to
                identify documents which when matched should
                be deleted from the index **PLANNED**

allowRegex       a required regex to identify the portion
                of the ADD/DELete line identified above
                which contains the file or pathname to
                ADDed or DELeted. If the resulting value
                relative, it assumed to be relative to
                baseDir.

What do I do next?
  Raise a JIRA issue and add the code?
  Is DIH the right place to add this?
  Suggestions for a different name?
Suggestions on how to do the delete bitty from within an entity?

Regards Fergus.
--Noble Paul

--

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================




--
--Noble Paul

--

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================




--
--Noble Paul

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to