[ 
https://issues.apache.org/jira/browse/TIKA-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-26:
------------------------------

    Attachment: TIKA-26.patch

This patch replaces the List<Content> collection in ParserConfig and Parser 
with a Map<String, Content> map as described above.

In addition the patch makes some minor cleanups like using class-specific 
logger instances, more explicitly tracking state of the parser instances (added 
a separate "parsed" flag), etc. The patch should however not introduce any 
functional changes.

This patch probably conflicts a bit with Keith's recent work on TIKA-17 and 
other issues. I'll give those a look and come up with an updated patch once his 
changes are committed.

After this patch the basic structure of a parser class is:

    public class SomeParser extends Parser {
        private static final Logger logger = Logger.getLogger(SomeParser.class);
        private boolean parsed = false;
        private String contentStr;
        public Map<String,Content> getContents() {
            Map<String,Content> contents = super.getContents();
            if (!parsed) {
                // fill in contents and contentStr with parsed content from 
getInputStream()
                parsed = true;
            }
            return contents;
        }
        public String getStrContent() {
            getContents();
            return contentStr;
        }
    }

What I'd like to do as a followup step is to pass the InputStream as an 
argument to getContents() and to include the full text content as a part of the 
Content map to make the parser instances stateless.


> Use Map<String, Content> instead of List<Content>
> -------------------------------------------------
>
>                 Key: TIKA-26
>                 URL: https://issues.apache.org/jira/browse/TIKA-26
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-26.patch
>
>
> The current Parser classes take a List<Content> collection from ParserConfig, 
> and explicitly reformat that collection into an internal Map<String,Content> 
> map keyed by the Content names. I don't see any place where using a list of 
> Content instances is better than a Map keyed by the Content names, so I'd 
> like to simplify things by creating the map already in ParserConfig and using 
> it directly ever since.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to