Hi,
I wrote a small java application on Windows using Eclipse, that takes a certain
directory as input and tries to parse all found documents and then index using
Lucene.
The problem is that handler.toString() documents result will be empty.
Here the codes:
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
ParseContext parseContext = new ParseContext();
ContentHandler handler = new BodyContentHandler();
parser.parse(new FileInputStream(file), handler, metadata,
parseContext);
System.out.println("-------------------------------------------------------");
System.out.println("File: " + file);
for (String name : metadata.names()) {
System.out.println("metadata: " + name + " - " +
metadata.get(name));
}
System.out.println("Content: " + handler.toString());
document.add(new Field("fulltext",handler.toString(),
Store.NO,Index.ANALYZED));
Eclipse Console results:
File: C:\Program Files\cwseidocuments\2012\AgileSoftware.ppt
metadata: Content-Type - application/vnd.ms-powerpoint
metadata: resourceName - AgileSoftware.ppt
Content:
path= C:\Program Files\documents\2012\English.pdf
-------------------------------------------------------
File: C:\Program Files\documents\2012\English.pdf
metadata: Content-Type - application/pdf
metadata: resourceName - English.pdf
Content:
path= C:\Program Files\documents\2012\hotle.doc
-------------------------------------------------------
File: C:\Program Files\cwseidocuments\2012\hotle.doc
metadata: Content-Type - application/msword
metadata: resourceName - hotle.doc
Content:
What is wrong with my code?
Thanks for your help.
Mass