[ 
https://issues.apache.org/jira/browse/TIKA-17?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith R. Bennett updated TIKA-17:
---------------------------------

    Attachment: tika-17.patch

I apologize for the large patch, but it was near impossible to avoid.  Here are 
the issues addressed by this patch:

====================

LiusConfig:

1) Changed to use URL's instead of File's.

2) Created constructor w/Document parameter; this was how it was being created 
anyway.

3) In getParserConfig(), added check for null object in list.

4) Added to error message the URL that was being processed when the error 
occurred.

5) a) Changed:
  static void populateConfig(Document doc, LiusConfig tc)
to:
  void populateConfig(Document doc)

... and called it in the LiusConfig(Document) constructor.

5) b) Removed static member 'tc'; it was no longer necessary and, given the 
above change, leaving it in would have been confusing.

==================================

ParserFactory:

1) Changed to use URL's instead of File's.

2) Added:
  public static Parser getParser(URL url, LiusConfig tc).

Removed:
  public static Parser getParser(File file, String tcPath)
  public static Parser getParser(String str, String tcPath)
.. since this could easily be accomplished by instantiating the LiusConfig 
object and passing it instead of tcPath... or do we really need it?  

3) Changed worker method to throw exception if a parser configuration cannot be 
found
for a mime type.  Currently, I think execution would continue and a 
NullPointerException would be thrown when 'parser' is dereferenced.

4) Added log error for parser configuration not found error.

==================================

LiusLogger:

1) Changed to use URL's instead of File's.

==================================

TestParsers:

1) Changed to use URL's instead of File's.

2) Method testWORDxtraction() to testWORDExtraction().

3) Added output that lists on one line all the content objects, such as:
  
  Structured Content contains the following 12 items: fullText, title, author, 
creator, 
  summary, keywords, producer, subject, trapped, creationDate, 
modificationDate,   
  outLinks

This was because some of the content pieces were many lines long, so it was 
difficult to find out the total set of content pieces found.

4) A message is printed to stdout if either the config.xml or the 
log4j.properties file cannot be found.

5) log4j.properties is in the repository in src/test/resources/log4j.  I 
changed the source code to look for it there.

6) config.xml is in the repository in src/test/resources.  I changed the source 
code to look for it there.

7) When exception stack traces are printed, the URL that caused the error is 
printed immediately afterward:
  "Exception getting parser for URL file://...."


> Need to support URL's for input resources.
> ------------------------------------------
>
>                 Key: TIKA-17
>                 URL: https://issues.apache.org/jira/browse/TIKA-17
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: tika-17.patch
>
>
> It would be extremely helpful to support URL's instead of just File's for 
> input resources.  This would enable us to use class loaders to find 
> resources, and in general support resources that are not available via the 
> filesystem.
> Patch coming...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to