[jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Bertrand Delacretaz (JIRA) Wed, 13 Jun 2007 07:38:00 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504261
 ]


Bertrand Delacretaz commented on TIKA-7:
----------------------------------------

I agree with Juka that it's better to have some working code in SVN as soon as 
possible.

That code might get completely rewritten along the way, but it will give us 
something concrete to play with and to exchange ideas.

> Lius Lite remove all lucene dependencies from Lius  and use Nutch office 
> parsers
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-7
>                 URL: https://issues.apache.org/jira/browse/TIKA-7
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>         Environment: Java 1.5
>            Reporter: Rida Benjelloun
>         Attachments: liuslite.patch, liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene 
> dependencies and use Nutch Office parsers because they are based on Apache 
> POI.
> Lius Lite offer 4 ways  for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information 
> to extract.  Please see config.xml in the config folder
> See also Junit tests.
> Here is an example  of XML parsing :
> 1- XML Config
>               <parser name="text-xml" class="liuslite.parser.xml.XMLParser">  
>                 
>                               
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
>                               <mime>application/xml</mime>
>                               <extract>
>                                       <content name="title" 
> xpathSelect="//dc:title"/>
>                                       <content name="subject" 
> xpathSelect="//dc:subject"/>
>                                       <content name="creator" 
> xpathSelect="//dc:creator"/>
>                                       <content name="description" 
> xpathSelect="//dc:description"/>
>                                       <content name="publisher" 
> xpathSelect="//dc:publisher"/>
>                                       <content name="contributor" 
> xpathSelect="//dc:contributor"/>
>                                       <content name="type" 
> xpathSelect="//dc:type"/>
>                                       <content name="format" 
> xpathSelect="//dc:format"/>
>                                       <content name="identifier" 
> xpathSelect="//dc:identifier"/>
>                                       <content name="language" 
> xpathSelect="//dc:language"/>
>                                       <content name="rights" 
> xpathSelect="//dc:rights"/>
>                                       <content name="outLinks">
>                                               <regexSelect>
>                                                       <![CDATA[
>                                                               
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
>                                                       ]]>
>                                               </regexSelect>
>                                       </content>
>                               </extract>                      
>               </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"; 
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/";>
>       <dc:title>Archimède et Lius</dc:title>
>       <dc:creator>Rida Benjelloun</dc:creator>
>       <dc:subject>Java</dc:subject>
>       <dc:subject>XML</dc:subject>
>       <dc:subject>XSLT</dc:subject>
>       <dc:subject>JDOM</dc:subject>
>       <dc:subject>Indexation</dc:subject>
>       <dc:description>Framework d'indexation des documents XML, HTML, PDF 
> etc.. </dc:description>
>       <dc:publisher>Doculibre</dc:publisher>
>       <dc:identifier>http://www.apache.org</dc:identifier>
>       <dc:date>2000-12</dc:date>
>       <dc:type>test</dc:type>
>       <dc:format>application/msword</dc:format>
>       <dc:language>Fr</dc:language>
>       <dc:rights>Non restreint</dc:rights>    
> </oaidc:dc>
> 3- Java Code 
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
>       Parser  parser = ParserFactory.getParser(testFile, lc);
>         String fullText = parser.getContentStr();
>         
>         Content title = parser.getContent("title");
>         String titleStr = title.getValue();
>         
>         Content subject = parser.getContent("subject");
>         String[] subjects = subject.getValues();
>         etc ...
>         Or : 
>         List<Content> contents = parser.getContents();
>         
>      } catch (MimeInfoException e) {
>        e.printStackTrace();
>      } catch (IOException e) {
>       e.printStackTrace();
>      } catch (LiusException e) {
>       e.printStackTrace();
>       }
> best regards
> Rida Benjelloun

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Reply via email to