Re: MSPowerPointExtractor problem

2004-08-02 Thread Ralph Scheuer
Ryan,
thanks for your reply.
I have also seen the posts from Sudhakar on this subject who seems to 
be contributing a whole lot of code here - which is a great thing but 
in this code the problem also persists so I think we solve this 
encoding problem in your code (which is simpler - the fix could later 
be integrated into Sudhakar's code if this is checked in or 
whatever...).

I have tested this with a simple PPT file containing just the following 
text:

Umlaut-Test
Ökologie, Mühsal, Größe, Grätsche
I get the following console output with this text:
Umlaut-Test
\326kologie, M\374hsal, Gr\374\337e, Gr\344tsche
Here is the output I get in a web browser (through a web app, view 
HTML source mode):

Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche
German umlaute and other special characters work fine that way 
whenever I extract text from Word documents or Excel spreadsheets using 
POI and Ryan Ackley's TextMining framework.

just for the record: I have only tested this on my own configuration: 
Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
might behave on Linux or Windows. Can anybody confirm this? I have seen 
some German names on this list ;-)

Thanks for all the work you put into this.
Ralph Scheuer
Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:
Hi Ralph,
I haven't tested the PPT extractor with any other languages.  I 
remember
reading about other people having problems with different character 
sets
though.

Could you send a before and after example file here or to bugzilla?
-Ryan Rhodes
-Original Message-
From: Ralph Scheuer [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem
Hello everybody,
When I was searching for a Java class to extract text from PowerPoint
files, I accidentally discovered Slide.
I pulled the MSPowerPointExtractor class and some other stuff it
depends on via CVS and tried it for some text extraction.
The method I used looks very similar to the provided example main
method (see below).
However. when I tried to extract text from a German PowerPoint
presentation, I had some problems with the encoding. I did not know
which encoding to use, converting the output to ISO Latin 1 with my
text editor solved only part of the problem (some German Umlaute were
displayed correctly, some were not).
Is this a known issue or am I doing something wrong? Any hints for me?
Thanks in advance.
Ralph Scheuer
BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
on this platform is MacRoman.
 public static String contentStringForData(NSData data){

StringBuffer buf = new StringBuffer();
try{
ByteArrayInputStream input = data.stream();
MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);

Reader reader = ex.extract(input);

int c;
do
{
c = reader.read();

buf.append((char)c);
}
while( c != -1 );
}catch(Exception e){

}

return buf.toString();
 }
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MSPowerPointExtractor problem

2004-08-02 Thread Koundinya \(Sudhakar Chavali\)
Hm,

Basically we have concentrated on English language. So we never faced any problems. It 
become a
new task for our team now :-) 

Thanks to Ralph in pointing that problem.

We Will work on related and let the Jakarta team knows :-)

Regards
Sudhakar





--- Ralph Scheuer [EMAIL PROTECTED] wrote:

 Ryan,
 
 thanks for your reply.
 
 I have also seen the posts from Sudhakar on this subject who seems to 
 be contributing a whole lot of code here - which is a great thing but 
 in this code the problem also persists so I think we solve this 
 encoding problem in your code (which is simpler - the fix could later 
 be integrated into Sudhakar's code if this is checked in or 
 whatever...).
 
 I have tested this with a simple PPT file containing just the following 
 text:
 
 Umlaut-Test
 Ökologie, Mühsal, Größe, Grätsche
 
 I get the following console output with this text:
 
 Umlaut-Test
 \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche
 
 Here is the output I get in a web browser (through a web app, view 
 HTML source mode):
 
 Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche
 
 German umlaute and other special characters work fine that way 
 whenever I extract text from Word documents or Excel spreadsheets using 
 POI and Ryan Ackley's TextMining framework.
 
 just for the record: I have only tested this on my own configuration: 
 Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
 might behave on Linux or Windows. Can anybody confirm this? I have seen 
 some German names on this list ;-)
 
 Thanks for all the work you put into this.
 
 Ralph Scheuer
 
 Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:
 
  Hi Ralph,
 
  I haven't tested the PPT extractor with any other languages.  I 
  remember
  reading about other people having problems with different character 
  sets
  though.
 
  Could you send a before and after example file here or to bugzilla?
 
  -Ryan Rhodes
 
 
  -Original Message-
  From: Ralph Scheuer [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, July 28, 2004 10:01 AM
  To: slide
  Subject: MSPowerPointExtractor problem
 
  Hello everybody,
 
  When I was searching for a Java class to extract text from PowerPoint
  files, I accidentally discovered Slide.
 
  I pulled the MSPowerPointExtractor class and some other stuff it
  depends on via CVS and tried it for some text extraction.
 
  The method I used looks very similar to the provided example main
  method (see below).
 
  However. when I tried to extract text from a German PowerPoint
  presentation, I had some problems with the encoding. I did not know
  which encoding to use, converting the output to ISO Latin 1 with my
  text editor solved only part of the problem (some German Umlaute were
  displayed correctly, some were not).
 
  Is this a known issue or am I doing something wrong? Any hints for me?
 
  Thanks in advance.
 
  Ralph Scheuer
 
  BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
  on this platform is MacRoman.
 
 
   public static String contentStringForData(NSData data){
  
  StringBuffer buf = new StringBuffer();
  try{
  ByteArrayInputStream input = data.stream();
  MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
  null);
  
  Reader reader = ex.extract(input);
  
  int c;
  do
  {
  c = reader.read();
  
  buf.append((char)c);
  }
  while( c != -1 );
  }catch(Exception e){
  
  }
  
  return buf.toString();
   }
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


=
No one can earn a million dollars honestly.- William Jennings Bryan (1860-1925) 

Make everything as simple as possible, but not simpler.- Albert Einstein (1879-1955)

It is dangerous to be sincere unless you are also stupid.- George Bernard Shaw 
(1856-1950)




__
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: MSPowerPointExtractor problem

2004-08-01 Thread Ryan Rhodes
Hi Ralph,

I haven't tested the PPT extractor with any other languages.  I remember
reading about other people having problems with different character sets
though.

Could you send a before and after example file here or to bugzilla?

-Ryan Rhodes


-Original Message-
From: Ralph Scheuer [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem

Hello everybody,

When I was searching for a Java class to extract text from PowerPoint 
files, I accidentally discovered Slide.

I pulled the MSPowerPointExtractor class and some other stuff it 
depends on via CVS and tried it for some text extraction.

The method I used looks very similar to the provided example main 
method (see below).

However. when I tried to extract text from a German PowerPoint 
presentation, I had some problems with the encoding. I did not know 
which encoding to use, converting the output to ISO Latin 1 with my 
text editor solved only part of the problem (some German Umlaute were 
displayed correctly, some were not).

Is this a known issue or am I doing something wrong? Any hints for me?

Thanks in advance.

Ralph Scheuer

BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
on this platform is MacRoman.


 public static String contentStringForData(NSData data){

StringBuffer buf = new StringBuffer();
try{
ByteArrayInputStream input = data.stream();
MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);

Reader reader = ex.extract(input);

int c;
do
{
c = reader.read();

buf.append((char)c);
}
while( c != -1 );
}catch(Exception e){

}

return buf.toString();
 }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: MSPowerPointExtractor problem

2004-08-01 Thread Koundinya \(Sudhakar Chavali\)
Check this,

http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java

--- Ryan Rhodes [EMAIL PROTECTED] wrote:

 Hi Ralph,
 
 I haven't tested the PPT extractor with any other languages.  I remember
 reading about other people having problems with different character sets
 though.
 
 Could you send a before and after example file here or to bugzilla?
 
 -Ryan Rhodes
 
 
 -Original Message-
 From: Ralph Scheuer [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, July 28, 2004 10:01 AM
 To: slide
 Subject: MSPowerPointExtractor problem
 
 Hello everybody,
 
 When I was searching for a Java class to extract text from PowerPoint 
 files, I accidentally discovered Slide.
 
 I pulled the MSPowerPointExtractor class and some other stuff it 
 depends on via CVS and tried it for some text extraction.
 
 The method I used looks very similar to the provided example main 
 method (see below).
 
 However. when I tried to extract text from a German PowerPoint 
 presentation, I had some problems with the encoding. I did not know 
 which encoding to use, converting the output to ISO Latin 1 with my 
 text editor solved only part of the problem (some German Umlaute were 
 displayed correctly, some were not).
 
 Is this a known issue or am I doing something wrong? Any hints for me?
 
 Thanks in advance.
 
 Ralph Scheuer
 
 BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
 on this platform is MacRoman.
 
 
  public static String contentStringForData(NSData data){
   
   StringBuffer buf = new StringBuffer();
   try{
   ByteArrayInputStream input = data.stream();
   MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
 null);
   
   Reader reader = ex.extract(input);
   
   int c;
   do
   {
   c = reader.read();
   
   buf.append((char)c);
   }
   while( c != -1 );
   }catch(Exception e){
   
   }
   
   return buf.toString();
  }
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


=
No one can earn a million dollars honestly.- William Jennings Bryan (1860-1925) 

Make everything as simple as possible, but not simpler.- Albert Einstein (1879-1955)

It is dangerous to be sincere unless you are also stupid.- George Bernard Shaw 
(1856-1950)




__
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: MSPowerPointExtractor problem

2004-08-01 Thread Koundinya \(Sudhakar Chavali\)
Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java
 for
jakarta team. And it seems another expert(Ryan Rhodes- [EMAIL PROTECTED]) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just 
wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali ([EMAIL PROTECTED]) and Hari Shanker Goud
([EMAIL PROTECTED])


Have a look on the below source codes


Regards
Sudhakar



/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that 
can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
super(message, cause);
  }

}
_

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and 
returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
  DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into 
OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
  DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
  DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
  DocumentParserException;
}

__

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii 
text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be 
called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
  DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called 
after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
  DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method 
should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
  DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
  DocumentParserException, Exception;

  /**