Hello All,
Just as I send this message I came across following line: WARN org.apache.jackrabbit.core.query.lucene.TextExtractorJob 16.10.2009 08:52:31 -- Exception while indexing binary property: java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider This line appeared after PDF file was added to the system. Unfortunately I don't have a full exception stack trace as it was truncated. It looks like I'm missing some jar - probably http://bouncycastle.org/ Crypto API. Regards, Denis On Fri, Oct 16, 2009 at 8:50 AM, Denis Demichev <[email protected]> wrote: > Hello All, > > Matteo wrote: > >>Sorry, I missed something, how can you say that STK is related to PDF? > STK has a bunch of sample files in DMS and majority of them are PDF. > I still cannot index PDFs even if I delete lucene indexes. > > However, while indexing a RTF file I have an exception: > > java.lang.IllegalArgumentException: The document is really a RTF file > at > org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:114) > at > org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49) > at > org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64) > at > org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90) > at > org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195) > at > org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93) > > It looks like org.apache.jackrabbit.extractor.MSWordTextExtractor is chosen > for text extraction instead of > org.apache.jackrabbit.extractor.RTFTextExtractor. > I.e. an invalid file type is detected here: > line 402 of org.apache.jackrabbit.core.query.lucene.NodeIndexer. > InternalValue typeValue = getValue(NameConstants.JCR_MIMETYPE); > > Here's an implementation of getValue: > > /** > * Utility method that extracts the first value of the named property > * of the current node. Returns <code>null</code> if the property does > * not exist or contains no values. > * > * @param name property name > * @return value of the named property, or <code>null</code> > * @throws ItemStateException if the property can not be accessed > */ > protected InternalValue getValue(Name name) throws ItemStateException { > try { > PropertyId id = new PropertyId(node.getNodeId(), name); > PropertyState property = > (PropertyState) stateProvider.getItemState(id); > InternalValue[] values = property.getValues(); > if (values.length > 0) { > return values[0]; > } else { > return null; > } > } catch (NoSuchItemStateException e) { > return null; > } > } > > So my assumption is: JCR node with RTF file contains a wrong MIME type > associated with RTF file added... Not sure how to check this MIME value in > Magnolia though. > Should be "application/rtf" or "text/rtf", but not > "application/vnd.ms-word" or "application/msword". > > > Would really appreciate any help with PDF - I don't see any exception and > thus cannot research what exactly went wrong. > > > Thank you! > > Regards, > Denis > > > > On Fri, Oct 16, 2009 at 2:40 AM, Matteo Pelucco <[email protected] > > wrote: > >> >> Denis Demichev ha scritto: >> >>> Hello Matteo, >>> >>> Thank you for your quick response. >>> >> >> Magnolia give me one T-shirt for each message I write. >> I have now a shop :-) >> >> >>You should be able to use query manager and to succesfully execute >>> this query: >>> >>SELECT * FROM nt:base >>> >>> I tried to run it against DMS successfully: 244 nodes returned in 734ms >>> >> >> Ok, this is the proof that DMS is indexed. >> Try now to delete ..workspaces/dms/index/* from filesystem. >> At next startup you would see something saying: >> >> 'loading DMS workspace' >> >> (if SearchIndexer is configured correctly for that ws in workspace.xml) >> >> and PDFs will be indexed (again). >> I would like to force re-index to be sure that no exception has been >> thrown in past index building phase. >> >> Unfortunately no luck with PDF. >>> >> > As STK has majority of PDF documents in >> >>> DMS that could be the reason why I couldn't search documents. >>> >> >> Sorry, I missed something, how can you say that STK is related to PDF? >> STK, afaik, is a "framework" which help to build pages, nothing related to >> JCR / Lucene indexes, isn't it? >> Or maybe do you mean the new asset management shipped with Magnolia? >> >> > Still I'm >> >>> not sure when exactly Magnolia will index this or that document in DMS. >>> >> >> It should be at save time, but I'm not 100% sure. >> >> Sorry but I have no huge experience with PDF indexing, but are you sure >> that your PDF are indexable?You can try to wrap PDFIndexer and log >> something, but it is not a quick debugging option... >> >> :-( >> >> >> matteo >> >> >> ---------------------------------------------------------------- >> For list details see >> http://www.magnolia-cms.com/home/community/mailing-lists.html >> To unsubscribe, E-mail to: <[email protected]> >> ---------------------------------------------------------------- >> >> > ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <[email protected]> ----------------------------------------------------------------
