subject:"Index pdf files with your content in lucene."

Re: Index pdf files with your content in lucene.

2003-11-12 Thread Ernesto De Santis

Hello

well, not work zip the files.

I can send files, if somebody won, to personal email.

And if somebody can post this in a web site, very cool.
I don´t post in a web site.

Ernesto.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis

Classes for index Pdf and word files in lucene.
Ernesto.

- Original Message -
From: Ernesto De Santis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.

Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.

- Original Message -
From: Hartmann, Waehrisch  Feykes GmbH [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.

 Hi Ernesto,

 the IndexManager retrieves a list of files of a folder by calling the
method
 getFilesInFolder of CmsObject. This method returns only empty files, i.e.
 with empty content. To get the content of a pdf file you have to reread
the
 file:
 f = cms.readFile(f.getAbsolutePath());

 Bye,
 Stephan

 Am Montag, 27. Oktober 2003 19:18 schrieben Sie:

   Hello

  Thanks for the previous reply.

  Now, i use
  - version 1.4 of lucene searche module. (the version attached in this
list)
  - new version of registry.xml format for module. (like you write me)
  - the pdf files are stored with the binary type.

  But i have the next problem:
  i can´t make a InputStream for the cmsfile content.
  For this i write this code in de Document method of my class
PDFDocument:

  -

  InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
  parameter CmsFile of the Document method

  PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
  in file system work fine.

  bodyText = extractor.extractText(in);

  Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?

  The error ocurr in the third line.
  In the PDFParcer.
  the error menssage in tomcat is:

  java.io.IOException: Error: Header is corrupt ''
  at PDFParcer.parse
  at PDFExtractor.extractText
  at PDFDocument.Document (my class)
  at.

  By, and thanks.
  Ernesto.

  - Original Message -
From: Hartmann, Waehrisch  Feykes GmbH
To: [EMAIL PROTECTED]
Sent: Friday, October 24, 2003 4:45 AM
Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.

Hello Ernesto,

i assume you are using the unpatched version 1.3 of the search module.
As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
  type plain but not of type binary. PDF files are stored as binary. I
  suggest to use the version i posted yesterday. Then your registry.xml
would
  have to look like this: ...
docFactories
...
   docFactory type=plain enabled=true
...
   /docFactory
   docFactory type=binary enabled=true
  fileType name=pdftext
 extension.pdf/extension

classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
   /docFactory
...
/docFactories

Important: The type attribute must match the file types of OpenCms
(also
  defined in the registry.xml).

Bye,
Stephan

  - Original Message -
  From: Ernesto De Santis
  To: Lucene Users List
  Cc: [EMAIL PROTECTED]
  Sent: Thursday, October 23, 2003 4:16 PM
  Subject: [opencms-dev] Index pdf files with your content in lucene.

  Hello

  I am new in opencms and lucene tecnology.

  I won index pdf files, and index de content of this files.

  I work in this way:

  Make a PDFDocument class like JspDocument class.
  use org.textmining.text.extraction.PDFExtractor class, this class
work
  fine out of vfs.

  and write my registry.xml for pdf document, in plainDocFactory tag.

  fileType name=pdftext
  extension.pdf/extension
  !-- This will strip tags before
processing --

  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType

  my PDFDocument content this code:
  I think that the probrem is how take the content from CmsFile?, what
  InputStream use? PDFExtractor work with extractText(InputStream) method.

  public class PDFDocument implements I_DocumentConstants,
  I_DocumentFactory {

  public PDFDocument(){

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile)

  throws CmsException

  {

  return Document(cmsobject, cmsfile, null);

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile,
HashMap
  hashmap)

  throws CmsException

  {

  Document document=(new BodylessDocument()).Document(cmsobject,
  cmsfile);

  //put de content in the pdf file.

  String contenido = new String(cmsfile.getContents());

  StringBufferInputStream in = new StringBufferInputStream(contenido);

  // ByteArrayInputStream in = new
  ByteArrayInputStream(contenido.getBytes

RE: Index pdf files with your content in lucene.

2003-11-11 Thread Wilton, Reece

Some of us have corporate firewalls that are stripping out attachments.  If possible, 
put these on a web site somewhere so we can download them.  Thanks!

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 11, 2003 11:07 AM
To: Lucene Users List
Subject: Index pdf files with your content in lucene.

Classes for index Pdf and word files in lucene.
Ernesto.

- Original Message -
From: Ernesto De Santis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.

Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.

- Original Message -
From: Hartmann, Waehrisch  Feykes GmbH [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.

 Hi Ernesto,

 the IndexManager retrieves a list of files of a folder by calling the
method
 getFilesInFolder of CmsObject. This method returns only empty files, i.e.
 with empty content. To get the content of a pdf file you have to reread
the
 file:
 f = cms.readFile(f.getAbsolutePath());

 Bye,
 Stephan

 Am Montag, 27. Oktober 2003 19:18 schrieben Sie:

   Hello

  Thanks for the previous reply.

  Now, i use
  - version 1.4 of lucene searche module. (the version attached in this
list)
  - new version of registry.xml format for module. (like you write me)
  - the pdf files are stored with the binary type.

  But i have the next problem:
  i can´t make a InputStream for the cmsfile content.
  For this i write this code in de Document method of my class
PDFDocument:

  -

  InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
  parameter CmsFile of the Document method

  PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
  in file system work fine.

  bodyText = extractor.extractText(in);

  Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?

  The error ocurr in the third line.
  In the PDFParcer.
  the error menssage in tomcat is:

  java.io.IOException: Error: Header is corrupt ''
  at PDFParcer.parse
  at PDFExtractor.extractText
  at PDFDocument.Document (my class)
  at.

  By, and thanks.
  Ernesto.

  - Original Message -
From: Hartmann, Waehrisch  Feykes GmbH
To: [EMAIL PROTECTED]
Sent: Friday, October 24, 2003 4:45 AM
Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.

Hello Ernesto,

i assume you are using the unpatched version 1.3 of the search module.
As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
  type plain but not of type binary. PDF files are stored as binary. I
  suggest to use the version i posted yesterday. Then your registry.xml
would
  have to look like this: ...
docFactories
...
   docFactory type=plain enabled=true
...
   /docFactory
   docFactory type=binary enabled=true
  fileType name=pdftext
 extension.pdf/extension

classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
   /docFactory
...
/docFactories

Important: The type attribute must match the file types of OpenCms
(also
  defined in the registry.xml).

Bye,
Stephan

  - Original Message -
  From: Ernesto De Santis
  To: Lucene Users List
  Cc: [EMAIL PROTECTED]
  Sent: Thursday, October 23, 2003 4:16 PM
  Subject: [opencms-dev] Index pdf files with your content in lucene.

  Hello

  I am new in opencms and lucene tecnology.

  I won index pdf files, and index de content of this files.

  I work in this way:

  Make a PDFDocument class like JspDocument class.
  use org.textmining.text.extraction.PDFExtractor class, this class
work
  fine out of vfs.

  and write my registry.xml for pdf document, in plainDocFactory tag.

  fileType name=pdftext
  extension.pdf/extension
  !-- This will strip tags before
processing --

  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType

  my PDFDocument content this code:
  I think that the probrem is how take the content from CmsFile?, what
  InputStream use? PDFExtractor work with extractText(InputStream) method.

  public class PDFDocument implements I_DocumentConstants,
  I_DocumentFactory {

  public PDFDocument(){

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile)

  throws CmsException

  {

  return Document(cmsobject, cmsfile, null);

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile,
HashMap
  hashmap)

  throws CmsException

Re: Index pdf files with your content in lucene.

2003-11-11 Thread Otis Gospodnetic

Ernesto, it looks like something got stripped.  A ZIP file should make
it to the list.  If not, maybe you can post it somewhere.

Could you also tell us a bit about this code?  Is it better than
existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

Thanks,
Otis


--- Ernesto De Santis [EMAIL PROTECTED] wrote:
 Classes for index Pdf and word files in lucene.
 Ernesto.
 
 - Original Message -
 From: Ernesto De Santis [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Wednesday, October 29, 2003 12:04 PM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
 
 
 Hello all,
 
 Thans very much Stephan for your valuable help.
 Attached you will find the PDFDocument, and WordDocument class source
 code
 
 Ernesto.
 
 
 - Original Message -
 From: Hartmann, Waehrisch  Feykes GmbH
 [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, October 28, 2003 11:10 AM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
 
 
  Hi Ernesto,
 
  the IndexManager retrieves a list of files of a folder by calling
 the
 method
  getFilesInFolder of CmsObject. This method returns only empty
 files, i.e.
  with empty content. To get the content of a pdf file you have to
 reread
 the
  file:
  f = cms.readFile(f.getAbsolutePath());
 
  Bye,
  Stephan
 
  Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
 
Hello
  
   Thanks for the previous reply.
  
   Now, i use
   - version 1.4 of lucene searche module. (the version attached in
 this
 list)
   - new version of registry.xml format for module. (like you write
 me)
   - the pdf files are stored with the binary type.
  
   But i have the next problem:
   i can´t make a InputStream for the cmsfile content.
   For this i write this code in de Document method of my class
 PDFDocument:
  
   -
  
   InputStream in = new ByteArrayInputStream(f.getContents()); //f
 is the
   parameter CmsFile of the Document method
  
   PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
 lib i
 use.
   in file system work fine.
  
  
   bodyText = extractor.extractText(in);
  
   
  
   Is correct use ByteArrayInputStream for make a InputStream for a
 CmsFile?
  
   The error ocurr in the third line.
   In the PDFParcer.
   the error menssage in tomcat is:
  
   java.io.IOException: Error: Header is corrupt ''
   at PDFParcer.parse
   at PDFExtractor.extractText
   at PDFDocument.Document (my class)
   at.
  
   By, and thanks.
   Ernesto.
  
  
   - Original Message -
 From: Hartmann, Waehrisch  Feykes GmbH
 To: [EMAIL PROTECTED]
 Sent: Friday, October 24, 2003 4:45 AM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
  
  
 Hello Ernesto,
  
 i assume you are using the unpatched version 1.3 of the search
 module.
 As i mentioned yesterday, the plainDocFactory does only index
 cmsFiles
 of
   type plain but not of type binary. PDF files are stored as
 binary. I
   suggest to use the version i posted yesterday. Then your
 registry.xml
 would
   have to look like this: ...
 docFactories
 ...
docFactory type=plain enabled=true
 ...
/docFactory
docFactory type=binary enabled=true
   fileType name=pdftext
  extension.pdf/extension
  
 classnet.grcomputing.opencms.search.lucene.PDFDocument/class
   /fileType
/docFactory
 ...
 /docFactories
  
 Important: The type attribute must match the file types of
 OpenCms
 (also
   defined in the registry.xml).
  
 Bye,
 Stephan
  
   - Original Message -
   From: Ernesto De Santis
   To: Lucene Users List
   Cc: [EMAIL PROTECTED]
   Sent: Thursday, October 23, 2003 4:16 PM
   Subject: [opencms-dev] Index pdf files with your content in
 lucene.
  
  
   Hello
  
   I am new in opencms and lucene tecnology.
  
   I won index pdf files, and index de content of this files.
  
   I work in this way:
  
   Make a PDFDocument class like JspDocument class.
   use org.textmining.text.extraction.PDFExtractor class, this
 class
 work
   fine out of vfs.
  
   and write my registry.xml for pdf document, in
 plainDocFactory tag.
  
   fileType name=pdftext
   extension.pdf/extension
   !-- This will strip tags before
 processing --
  
   classnet.grcomputing.opencms.search.lucene.PDFDocument/class
   /fileType
  
   my PDFDocument content this code:
   I think that the probrem is how take the content from
 CmsFile?, what
   InputStream use? PDFExtractor work with extractText(InputStream)
 method.
  
   public class PDFDocument implements I_DocumentConstants,
   I_DocumentFactory {
  
   public PDFDocument(){
  
   }
  
  
   public Document Document(CmsObject cmsobject, CmsFile
 cmsfile)
  
   throws CmsException
  
   {
  
   return Document

Re: Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis

try again zipping the files.

after i post the files in the web site.

 Could you also tell us a bit about this code?  Is it better than
 existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

This code use existing parsing solution.
The intent is make a lucene Document for index pdf and word files, with
content.
Is pure java.
Use TextExtraction library.
tm-extractors-0.2.jar
Use POI and PDFBox.

Ernesto
Sorry for my bad English.


 Thanks,
 Otis


 --- Ernesto De Santis [EMAIL PROTECTED] wrote:
  Classes for index Pdf and word files in lucene.
  Ernesto.
 
  - Original Message -
  From: Ernesto De Santis [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Wednesday, October 29, 2003 12:04 PM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
 
 
  Hello all,
 
  Thans very much Stephan for your valuable help.
  Attached you will find the PDFDocument, and WordDocument class source
  code
 
  Ernesto.
 
 
  - Original Message -
  From: Hartmann, Waehrisch  Feykes GmbH
  [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, October 28, 2003 11:10 AM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
 
 
   Hi Ernesto,
  
   the IndexManager retrieves a list of files of a folder by calling
  the
  method
   getFilesInFolder of CmsObject. This method returns only empty
  files, i.e.
   with empty content. To get the content of a pdf file you have to
  reread
  the
   file:
   f = cms.readFile(f.getAbsolutePath());
  
   Bye,
   Stephan
  
   Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
  
 Hello
   
Thanks for the previous reply.
   
Now, i use
- version 1.4 of lucene searche module. (the version attached in
  this
  list)
- new version of registry.xml format for module. (like you write
  me)
- the pdf files are stored with the binary type.
   
But i have the next problem:
i can´t make a InputStream for the cmsfile content.
For this i write this code in de Document method of my class
  PDFDocument:
   
-
   
InputStream in = new ByteArrayInputStream(f.getContents()); //f
  is the
parameter CmsFile of the Document method
   
PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
  lib i
  use.
in file system work fine.
   
   
bodyText = extractor.extractText(in);
   

   
Is correct use ByteArrayInputStream for make a InputStream for a
  CmsFile?
   
The error ocurr in the third line.
In the PDFParcer.
the error menssage in tomcat is:
   
java.io.IOException: Error: Header is corrupt ''
at PDFParcer.parse
at PDFExtractor.extractText
at PDFDocument.Document (my class)
at.
   
By, and thanks.
Ernesto.
   
   
- Original Message -
  From: Hartmann, Waehrisch  Feykes GmbH
  To: [EMAIL PROTECTED]
  Sent: Friday, October 24, 2003 4:45 AM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
   
   
  Hello Ernesto,
   
  i assume you are using the unpatched version 1.3 of the search
  module.
  As i mentioned yesterday, the plainDocFactory does only index
  cmsFiles
  of
type plain but not of type binary. PDF files are stored as
  binary. I
suggest to use the version i posted yesterday. Then your
  registry.xml
  would
have to look like this: ...
  docFactories
  ...
 docFactory type=plain enabled=true
  ...
 /docFactory
 docFactory type=binary enabled=true
fileType name=pdftext
   extension.pdf/extension
   
  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
/fileType
 /docFactory
  ...
  /docFactories
   
  Important: The type attribute must match the file types of
  OpenCms
  (also
defined in the registry.xml).
   
  Bye,
  Stephan
   
- Original Message -
From: Ernesto De Santis
To: Lucene Users List
Cc: [EMAIL PROTECTED]
Sent: Thursday, October 23, 2003 4:16 PM
Subject: [opencms-dev] Index pdf files with your content in
  lucene.
   
   
Hello
   
I am new in opencms and lucene tecnology.
   
I won index pdf files, and index de content of this files.
   
I work in this way:
   
Make a PDFDocument class like JspDocument class.
use org.textmining.text.extraction.PDFExtractor class, this
  class
  work
fine out of vfs.
   
and write my registry.xml for pdf document, in
  plainDocFactory tag.
   
fileType name=pdftext
extension.pdf/extension
!-- This will strip tags before
  processing --
   
classnet.grcomputing.opencms.search.lucene.PDFDocument/class
/fileType
   
my PDFDocument content this code:
I think that the probrem is how take the content from
  CmsFile?, what

Index pdf files with your content in lucene.

2003-10-23 Thread Ernesto De Santis

Hello

I am new in opencms and lucene tecnology. 

I won index pdf files, and index de content of this files.

I work in this way:

Make a PDFDocument class like JspDocument class. 
use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs.

and write my registry.xml for pdf document, in plainDocFactory tag.

fileType name=pdftext
extension.pdf/extension
!-- This will strip tags before processing --

classnet.grcomputing.opencms.search.lucene.PDFDocument/class
/fileType

my PDFDocument content this code:
I think that the probrem is how take the content from CmsFile?, what InputStream use?
PDFExtractor work with extractText(InputStream) method.

public class PDFDocument implements I_DocumentConstants, I_DocumentFactory {

public PDFDocument(){

}


public Document Document(CmsObject cmsobject, CmsFile cmsfile)

throws CmsException 

{

return Document(cmsobject, cmsfile, null);

}

public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap)

throws CmsException

{

Document document=(new BodylessDocument()).Document(cmsobject, cmsfile);


//put de content in the pdf file.

String contenido = new String(cmsfile.getContents());

StringBufferInputStream in = new StringBufferInputStream(contenido);

// ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes());


/* try{

FileInputStream in = new FileInputStream (cmsfile.getPath() + cmsfile.getName());

*/

PDFExtractor extractor = new PDFExtractor();

String body = extractor.extractText(in);


document.add(Field.Text(body, body));

/* }catch(FileNotFoundException e){

e.toString();

throw new CmsException();

}


*/ 

return (document);

}


thanks
Ernesto
PD: Sorry for my poor english.




- Original Message - 
From: Hartmann, Waehrisch  Feykes GmbH [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 22, 2003 3:50 AM
Subject: Re: [opencms-dev] (no subject)


 Hi Ben,
 
 i think this won't work since the plainDocFactory will only be used for
 files of type plain but not for files of type binary.
 Recently we have done some additions to the module - by order of Lenord,
 Bauer  Co. GmbH - that could meet your needs. It introduces a more flexible
 way of defining docFactories that you can add new factories without having
 to recompile the whole module. So other modules (like the news) can bring
 their own docFactory and all you have to do is to edit the registry.xml.
 Here is an example:
 
 docFactories
 docFactory enabled=true type=plain
 fileType name=plaintext
 extension.txt/extension
 
 classnet.grcomputing.opencms.search.lucene.PlainDocument/class
 /fileType
 /docFactory
 docFactory enabled=true type=news
 
 classnet.grcomputing.opencms.search.lucene.NewsDocument/class
 /docFactory
 /docFactories
 
 To index binary files all you need to add is this:
 
docFactory enabled=true type=binary
 
 classnet.grcomputing.opencms.search.lucene.BodylessDocument/class
/docFactory
 
 There should be no need for an extension mapping.
 
 For the interested people:
 For ContentDefinitions (like news) i introduced the following:
 contentDefinitions
 contentDefinition type=news !-- must match docFactory
 type --
 
 classcom.opencms.modules.homepage.news.NewsContentDefinition/class
 
 initClassnet.grcomputing.opencms.search.lucene.NewsInitialization/initCla
 ss
 listMethod name=getNewsList
 param type=java.lang.Integer1/param
 param type=java.lang.String-1/param
 /listMethod
 page uri=/news.html?__element=entry
 param method=getIntId name=newsid/
 /page
 /contentDefinition
 
 In short:
 initClass is optional: For the news the news classes have to be loaded to
 initialize the db pool.
 listMethod: a method of the content definition class that returns a List of
 elements
 page: the page that can display an entry. Here a jsp that has a template
 element entry. It also needs the id of the news item.
 getIntId is a method of the content definition class and newsid is the url
 parameter the page needs. A link like
 news.html?__element=entrynewsid=xy
 will be generated.
 
 Best regards,
 Stephan
 
 
 - Original Message - 
 From: Ben Rometsch [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Wednesday, October 22, 2003 6:15 AM
 Subject: [opencms-dev] (no subject)
 
 
  Hi Matt,
 
  I am not having any joy! I've updated my registry.xml file, with the
  appropriate section reading:
 
  luceneSearch
  mergeFactor10/mergeFactor
  permChecktrue/permCheck
  indexDirc:\search/indexDir

Re: Index pdf files with your content in lucene.

Index pdf files with your content in lucene.

RE: Index pdf files with your content in lucene.

Re: Index pdf files with your content in lucene.

Re: Index pdf files with your content in lucene.

Index pdf files with your content in lucene.

6 matches

Site Navigation

Mail list logo

Footer information