[
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting updated TIKA-7:
-----------------------------
Attachment: liuslite.patch
I took the sources (and associated config and test files) from liuslite.zip and
placed them inside the svn trunk following the standard Maven 2 layout. See the
attached liuslite.patch file for the result.
Summary of changes:
* I replaced all tabs with four spaces each.
* I placed liusLite/src/junit to src/test/java/org/apache/tika and mapped
liuslite.junit to org.apache.tika
* I placed the other liusLite/src subfolders to src/main/java/org/apache/tika
and mapped liuslite.* to org.apache.tika.*
* I placed liusLite/config to src/main/resources
* I placed liusLite/testFiles to src/test/resources
Finally I used "svn add" to mark the new files for version control and used
"svn diff" from trunk to create the attached patch file.
Note that the patch is not ready for inclusion, as it doesn't change pom.xml to
include all the required dependencies. It should still help illustrate what I
think the codebase should look like.
> Lius Lite remove all lucene dependencies from Lius and use Nutch office
> parsers
> --------------------------------------------------------------------------------
>
> Key: TIKA-7
> URL: https://issues.apache.org/jira/browse/TIKA-7
> Project: Tika
> Issue Type: New Feature
> Components: general
> Environment: Java 1.5
> Reporter: Rida Benjelloun
> Attachments: liuslite.patch, liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene
> dependencies and use Nutch Office parsers because they are based on Apache
> POI.
> Lius Lite offer 4 ways for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information
> to extract. Please see config.xml in the config folder
> See also Junit tests.
> Here is an example of XML parsing :
> 1- XML Config
> <parser name="text-xml" class="liuslite.parser.xml.XMLParser">
>
>
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
> <mime>application/xml</mime>
> <extract>
> <content name="title"
> xpathSelect="//dc:title"/>
> <content name="subject"
> xpathSelect="//dc:subject"/>
> <content name="creator"
> xpathSelect="//dc:creator"/>
> <content name="description"
> xpathSelect="//dc:description"/>
> <content name="publisher"
> xpathSelect="//dc:publisher"/>
> <content name="contributor"
> xpathSelect="//dc:contributor"/>
> <content name="type"
> xpathSelect="//dc:type"/>
> <content name="format"
> xpathSelect="//dc:format"/>
> <content name="identifier"
> xpathSelect="//dc:identifier"/>
> <content name="language"
> xpathSelect="//dc:language"/>
> <content name="rights"
> xpathSelect="//dc:rights"/>
> <content name="outLinks">
> <regexSelect>
> <![CDATA[
>
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
> ]]>
> </regexSelect>
> </content>
> </extract>
> </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/">
> <dc:title>Archimède et Lius</dc:title>
> <dc:creator>Rida Benjelloun</dc:creator>
> <dc:subject>Java</dc:subject>
> <dc:subject>XML</dc:subject>
> <dc:subject>XSLT</dc:subject>
> <dc:subject>JDOM</dc:subject>
> <dc:subject>Indexation</dc:subject>
> <dc:description>Framework d'indexation des documents XML, HTML, PDF
> etc.. </dc:description>
> <dc:publisher>Doculibre</dc:publisher>
> <dc:identifier>http://www.apache.org</dc:identifier>
> <dc:date>2000-12</dc:date>
> <dc:type>test</dc:type>
> <dc:format>application/msword</dc:format>
> <dc:language>Fr</dc:language>
> <dc:rights>Non restreint</dc:rights>
> </oaidc:dc>
> 3- Java Code
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
> Parser parser = ParserFactory.getParser(testFile, lc);
> String fullText = parser.getContentStr();
>
> Content title = parser.getContent("title");
> String titleStr = title.getValue();
>
> Content subject = parser.getContent("subject");
> String[] subjects = subject.getValues();
> etc ...
> Or :
> List<Content> contents = parser.getContents();
>
> } catch (MimeInfoException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> } catch (LiusException e) {
> e.printStackTrace();
> }
> best regards
> Rida Benjelloun
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.