Hi Jukka,
You can create a prototype by implementing one or two parsers. This will
allow us to test it before implementing all parsers.
If you need some help, please don't hesitate.

Best Regards.

On 8/24/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> On 8/24/07, Rida Benjelloun <[EMAIL PROTECTED]> wrote:
> > I agree with your use case.
>
> Thanks!
>
> I've been thinking about this a bit more. The main thing I'm concerned
> about the current Parser classes from Lius Lite is that they always
> parse the entire document into an in-memory data structure. This can
> easily become a scalability issue and I'd like to avoid that already
> on the design level.
>
> Also, it seems to me that the current regexp and xpath features from
> Lius would work better as a layer on top of the parser code instead of
> as an integral part of it.
>
> As for my design proposal itself, I think I have a more workable
> approach to use cases 1 (extract structured content) and 3 (extract
> metadata). It looks like this:
>
> Extract metadata:
>
>     InputStream stream = ...;
>     Metadata metadata = new Metadata();
>     SomeTikaInterface parser = new SomeTikaClass();
>     parser.extractMetadata(stream, metadata);
>
> Extract structured content (and metadata as a side-effect):
>
>     InputStream stream = ...;
>     ContentHandler handler = ...; // SAX event handler
>     Metadata metadata = new Metadata();
>     SomeTikaInterface parser = new SomeTikaClass();
>     parser.extractContent(stream, handler, metadata);
>
> In both cases it would be possible to feed existing metadata hints
> (like the file name, Content-Type header, or some other similar
> information) to the parser through the metadata argument.
>
> WDYT? I'd like to start going forward with some code along these
> lines, most likely by adapting/refactoring the Lius classes we already
> have.
>
> BR,
>
> Jukka Zitting
>



-- 
---------------------------------------------------------
Rida Benjelloun
Doculibre inc.
[EMAIL PROTECTED]
[EMAIL PROTECTED]
Cel: 418-262-3222
Tel: 418-353-3390
Site Web : http://www.doculibre.com
---------------------------------------------------------

Reply via email to