Hi Thomas, thank you very much..... I have added the analyzer, excel files are ok now. but still have problems with my PDF file - it seems that PDFBox is not able to handle some conditions, not a Jackrabbit problem. Here is the error message:
13:45:40,453 WARN PdfTextExtractor:91 - Failed to extract PDF text content java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util..PDFTextStripper.writeText(PDFTextStripper.java:216) at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75) at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90) at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195) at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93) at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source) at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172) at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Thread.java:619) I have tested some other PDF files, seems ok, I can have full text CJK search...so I suspect that it may be a PDFBox limitation... The PDF file giving me problems is generated with Distiller 9.0, PDF version 1.5. Nothing special, or least I am not aware of. rgds, canal ________________________________ From: Thomas Müller <[email protected]> To: [email protected] Sent: Monday, August 10, 2009 4:36:25 PM Subject: Re: full text search for CJK languages Hi, I'm not sure, but I think you need to use class org.apache.lucene.analysis.cjk.CJKAnalyzer See http://wiki.apache.org/jackrabbit/Search - parameter analyzer Can you please verify this is correct? I will then update the documentation. Regards, Thomas On Sun, Aug 9, 2009 at 4:38 PM, go canal<[email protected]> wrote: > Just tested: > the default configuration supports full CJK text search for Text, Word and > PPT file; but can not search PDF/Excel files. > > rgds, > canal > > > > > ________________________________ > From: go canal <[email protected]> > To: [email protected] > Sent: Sunday, August 9, 2009 10:20:28 PM > Subject: full text search for CJK languages > > Hi, > could not find detailed info wrt supporting full text search for 2-byte > languages like CJK (Chinese, Japanese and Korea). > > 1) anybody know if there is one such library available ? and > 2) how to config this in Jackrabbit ? Should I replace all the extractors in > the current configuration: > <SearchIndex ..... > <param name="textFilterClasses" > > value="org.apache.jackrabbit.extractor.PlainTextExtractor, > org.apache.jackrabbit.extractor.MsWordTextExtractor, > org.apache.jackrabbit.extractor.MsExcelTextExtractor, > org.apache.jackrabbit..extractor.MsPowerPointTextExtractor, > org.apache.jackrabbit.extractor..PdfTextExtractor, > org.apache.jackrabbit.extractor.OpenOfficeTextExtractor, > org.apache.jackrabbit.extractor.RTFTextExtractor, > org.apache.jackrabbit.extractor.HTMLTextExtractor, > org.apache.jackrabbit.extractor.XMLTextExtractor" /> > </SearchIndex> > rgds, > canal > > >
