[ https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright reassigned CONNECTORS-16: ------------------------------------- Assignee: Karl Wright > JCIFS connector's document fingerprinting feature is not general enough > ----------------------------------------------------------------------- > > Key: CONNECTORS-16 > URL: https://issues.apache.org/jira/browse/CONNECTORS-16 > Project: Lucene Connector Framework > Issue Type: Improvement > Components: Framework agents process, Framework crawler agent, GTS > connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, > Meridio connector, RSS connector, SharePoint connector, Web connector > Reporter: Karl Wright > Assignee: Karl Wright > Priority: Minor > > The JCIFS connector has a feature, called "fingerprinting", which allows it > to classify documents according to ability of the back-end to index that > content. Right at the moment, this fingerprinter is capable of recognizing > PDFs, Microsoft Office files, and text files as being indexable. One could > imagine, though, that different SOLR plugins, etc. might have more capability > than that. Also, other connectors could potentially benefit from similar > technology, specifically any connector that deals with binary documents. > One approach to solving this problem would be to remove the feature entirely, > and allow whatever pipeline exists in SOLR determine the indexability after > the fact. The reason this feature was added at MetaCarta, however, is that > it may be possible to exclude an un-useful document without having to fetch > the whole thing, and (at least for MetaCarta clients) the number of > unindexable files of gigantic size was a big concern. > Another approach might be to tie the functionality in with the output > connector interface, so that an output connector would (somehow) determine > applicability of a document. This would require some care to make it > possible to fingerprint without having to download the entire document, but > would otherwise have the correct overall structure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.