Re: MSPowerPointExtractor problem
Ryan, thanks for your reply. I have also seen the posts from Sudhakar on this subject who seems to be contributing a whole lot of code here - which is a great thing but in this code the problem also persists so I think we solve this encoding problem in your code (which is simpler - the fix could later be integrated into Sudhakar's code if this is checked in or whatever...). I have tested this with a simple PPT file containing just the following text: Umlaut-Test Ökologie, Mühsal, Größe, Grätsche I get the following console output with this text: Umlaut-Test \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche Here is the output I get in a web browser (through a web app, view HTML source mode): Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche German umlaute and other special characters work fine that way whenever I extract text from Word documents or Excel spreadsheets using POI and Ryan Ackley's TextMining framework. just for the record: I have only tested this on my own configuration: Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes might behave on Linux or Windows. Can anybody confirm this? I have seen some German names on this list ;-) Thanks for all the work you put into this. Ralph Scheuer Am 01.08.2004 um 08:07 schrieb Ryan Rhodes: Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MSPowerPointExtractor problem
Hm, Basically we have concentrated on English language. So we never faced any problems. It become a new task for our team now :-) Thanks to Ralph in pointing that problem. We Will work on related and let the Jakarta team knows :-) Regards Sudhakar --- Ralph Scheuer [EMAIL PROTECTED] wrote: Ryan, thanks for your reply. I have also seen the posts from Sudhakar on this subject who seems to be contributing a whole lot of code here - which is a great thing but in this code the problem also persists so I think we solve this encoding problem in your code (which is simpler - the fix could later be integrated into Sudhakar's code if this is checked in or whatever...). I have tested this with a simple PPT file containing just the following text: Umlaut-Test Ökologie, Mühsal, Größe, Grätsche I get the following console output with this text: Umlaut-Test \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche Here is the output I get in a web browser (through a web app, view HTML source mode): Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche German umlaute and other special characters work fine that way whenever I extract text from Word documents or Excel spreadsheets using POI and Ryan Ackley's TextMining framework. just for the record: I have only tested this on my own configuration: Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes might behave on Linux or Windows. Can anybody confirm this? I have seen some German names on this list ;-) Thanks for all the work you put into this. Ralph Scheuer Am 01.08.2004 um 08:07 schrieb Ryan Rhodes: Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] = No one can earn a million dollars honestly.- William Jennings Bryan (1860-1925) Make everything as simple as possible, but not simpler.- Albert Einstein (1879-1955) It is dangerous to be sincere unless you are also stupid.- George Bernard Shaw (1856-1950) __ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MSPowerPointExtractor problem
Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MSPowerPointExtractor problem
Check this, http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java --- Ryan Rhodes [EMAIL PROTECTED] wrote: Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] = No one can earn a million dollars honestly.- William Jennings Bryan (1860-1925) Make everything as simple as possible, but not simpler.- Albert Einstein (1879-1955) It is dangerous to be sincere unless you are also stupid.- George Bernard Shaw (1856-1950) __ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MSPowerPointExtractor problem
Hello All, This was my first contribution http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for jakarta team. And it seems another expert(Ryan Rhodes- [EMAIL PROTECTED]) has already started working on that based on my first given contribution. That sounds great to me. So In order to increase the development process for Powerpoint extractor, I just wanted to contribute Our team efforts in developing the Powerpoint extractor Authors :- Sudhakar Chavali ([EMAIL PROTECTED]) and Hari Shanker Goud ([EMAIL PROTECTED]) Have a look on the below source codes Regards Sudhakar /** * Title: DocumentParserException class * Description: This is root Exceptional class for throwing the runtime errors that can be raised by different parsers * @author Sudhakar * @version 1.0 */ public class DocumentParserException extends Exception { /** * Constructs a new exception with null as its detail message. */ public DocumentParserException() { } /** * Constructs a new exception with the specified detail message. * @param message */ public DocumentParserException(String message) { super(message); } /** * Constructs a new exception with the specified detail message. * @param message * @param cause */ public DocumentParserException(String message, Throwable cause) { super(message, cause); } } _ import java.io.*; /** * * Title: Summary Base * Description: A Generic one that reads the document's summary information and returns it through different internal methods * @author Sudhakar Chavali * @version 1.0 */ public interface SummaryBase { /** * A method returns the Document's Author * @return String */ public String getDocAuthor(); /** * A method that returns the Document Created Date * @return String */ public String getDocCreatedDate(); /** * A method that returns the Document's Key words * @return String */ public String getDocKeywords(); /** * A method that returns the Document's comments * @return String */ public String getDocComments(); /** * A method that returns the Document Name * @return String */ public String getDocName(); /** * A method that returns the Document's Subject * @return String */ public String getDocSubject(); /** * A method that returns the Document's title */ public String getDocTitle(); /** * A method that reads the document's Summary Information * @throws DocumentParserException */ public void read() throws DocumentParserException; /** * A method that writes the Document's summary information as an XML into the file * @param strXMLFile * @throws DocumentParserException */ public void write(String strXMLFile) throws DocumentParserException; /** * A method that writes the document's summary information as an XML into OutputStream Object * @param out * @throws DocumentParserException */ public void write(OutputStream out) throws DocumentParserException; /** * A method that returns the Document's summary as an XML String * @return String * @throws DocumentParserException */ public String getSummaryAsXML() throws DocumentParserException; /** * A method that returns document's summary information as normal text * @return String * @throws DocumentParserException */ public String getSummaryAsText() throws DocumentParserException; } __ import java.io.*; /** * A generic document that reads the document's text and parses it into normal Ascii text using the different methods. */ public interface Document { /** * A method that returns the document's text after parsing. This method should be called after calling the read method * @return String * @see #read() * @throws DocumentParserException */ public abstract String getText() throws DocumentParserException; /** * A method that returns the parsed text as byte array. This method should be called after calling the read method * @return byte[] * @throws DocumentParserException */ public abstract byte[] getBytes() throws DocumentParserException; /** * A method that writes the parsed text into the OutputStream object. This method should be called after calling the read method * @param out * @throws DocumentParserException */ public abstract void write(OutputStream out) throws DocumentParserException, Exception; /** * A method that reads and parses the document into Normal text * @throws DocumentParserException */ public abstract void read() throws DocumentParserException, Exception; /**