RE: Lucene for Indian Languages

2004-08-22 Thread Karthik N S
Hi

I do not think so ,but there was One requirement in the Form for the
Devenagari script

Have look at the forms,u might find something on this


Karthik

-Original Message-
From: srinivasa raghavan [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 11:35 AM
To: [EMAIL PROTECTED]
Subject: Lucene for Indian Languages


Hi all,

 Is Lucene API implemented for Indian contexts? I know
that Lucene stemmers and filters for German and
Russian Languages. I would like to know, whether there
are stemmers and filters available/being developed for
Indian Languages.

Thanks,
Rahavan.





___
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now.
http://messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene for Indian Languages

2004-08-22 Thread srinivasa raghavan
Hi all,

 Is Lucene API implemented for Indian contexts? I know
that Lucene stemmers and filters for German and
Russian Languages. I would like to know, whether there
are stemmers and filters available/being developed for
Indian Languages.

Thanks,
Rahavan.





___
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now. 
http://messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pdfboxhelp

2004-08-22 Thread Santosh
Hi natarajan,
I kept log4j.properties in the classpath
my new classpath is

.;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webclien
t.ja
r;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2sdk1.4.1\
lib\
xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\j2sdk1.4.1\l
ib\s
ervlet.jar;E:\Program Files\Apache Tomcat
4.0\common\lib\servlet.jar;C:\Program
Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2sdk1.
4.1\
lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl.jar;C:\
j2sd
k1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2sdk1.4.1\lib\
jaxp
.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.zip;C:\struts.jar
;F:\
apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.6.6.jar;C:\j2sdk1.4.
1\li
b\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-0.6.6\external\log4j.jar
;C:\
j2sdk1.4.1\lib\log4j.properties;

but there is no difference in the output


- Original Message -
From: "Natarajan.T" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 10:56 AM
Subject: RE: pdfboxhelp


> Hi Santhosh,
>
> The attached file must be in your class path.
>
>
> Natarajan.
>
>
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 23, 2004 10:51 AM
> To: Lucene Users List
> Subject: Fw: pdfboxhelp
>
> hi karthik,
> did u find any solution? should I send the pdf to u?
> - Original Message -
> From: "Santosh" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 10:23 AM
> Subject: Re: pdfboxhelp
>
>
> > hi karthik,
> >  I kept log4j in the classpath , I am sending classpath variable
> >
> > CLASSPATH
> >
> >
> .;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webc
> lien
> >
> t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2s
> dk1.
> >
> 4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\
> j2sd
> > k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
> > 4.0\common\lib\servlet.jar;C:\Program
> >
> Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2s
> dk1.
> >
> 4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl
> .jar
> >
> ;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2s
> dk1.
> >
> 4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.z
> ip;C
> >
> :\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.
> 6.6.
> >
> jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-
> 0.6.
> > 6\external\log4j.jar
> >
> > please check the error
> >
> >
> >
> > - Original Message -
> > From: "Karthik N S" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 10:26 AM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi Santosh
> > >
> > >   I think u'r Pdf is using  Log4j package ,Try toe set the classpath
> for
> > > log4j.jar path.
> > >
> > >  [ Is it a just a WARNING  or an ERROR  u are getting.
> > >
> > >   Send me in u'r Configuration management Let me help u with it
> ; [
> > >
> > >
> > > Karthik
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Monday, August 23, 2004 10:11 AM
> > > To: Lucene Users List
> > > Cc: Ben Litchfield
> > > Subject: Re: pdfboxhelp
> > >
> > >
> > > hi karthik,
> > >
> > > I have downloaded pdfbox and kept pdfjar file in the classpath, but
> when
> I
> > > am typing following command in the command prompt I am getting the
> error:
> > >
> > > D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
> > > C:\test.pdf
> > > C:\test.txt
> > > log4j:WARN No appenders could be found for logger
> > > (org.pdfbox.pdfparser.PDFParse
> > > r).
> > > log4j:WARN Please initialize the log4j system properly
> > >
> > > why I am getting this error? plz help
> > >
> > >
> > > - Original Message -
> > > From: "Karthik N S" <[EMAIL PROTECTED]>
> > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > Sent: Monday, August 23, 2004 9:21 AM
> > > Subject: RE: pdfboxhelp
> > >
> > >
> > > > Hi
> > > >
> > > >
> > > > To Begin with try to build Indexes offline  [ out of Tomcat
> > container]
> > > > and  on completing indxexes, feed u'r search  with the realpath of
> the
> > > offline indexed folder,Start the Tomcat and then use the
> > > > search on As u experiment it out u will be comfortable
> > withrequirment
> > > of Indexing /Search..   ; [
> > > >
> > > > Karthik
> > > >
> > > > -Original Message-
> > > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > > Sent: Saturday, August 21, 2004 4:55 PM
> > > > To: Lucene Users List
> > > > Subject: Re: pdfboxhelp
> > > >
> > > >
> > > > Yes I did the same.
> > > > I copied all the classes into classes folder but
> > > > now when I am building the index using IndexHTML the pdfs are not
> added
> > to
> > > >

RE: pdfboxhelp

2004-08-22 Thread Karthik N S
Hi Santosh

  Hold on I's monday and I am on running off the Schedule  with my Job...
will reply u some time in noon.


 Karthik

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 10:51 AM
To: Lucene Users List
Subject: Fw: pdfboxhelp


hi karthik,
did u find any solution? should I send the pdf to u?
- Original Message -
From: "Santosh" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 10:23 AM
Subject: Re: pdfboxhelp


> hi karthik,
>  I kept log4j in the classpath , I am sending classpath variable
>
> CLASSPATH
>
>
.;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webclien
>
t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2sdk1.
>
4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\j2sd
> k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
> 4.0\common\lib\servlet.jar;C:\Program
>
Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2sdk1.
>
4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl.jar
>
;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2sdk1.
>
4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.zip;C
>
:\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.6.6.
>
jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-0.6.
> 6\external\log4j.jar
>
> please check the error
>
>
>
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 10:26 AM
> Subject: RE: pdfboxhelp
>
>
> > Hi Santosh
> >
> >   I think u'r Pdf is using  Log4j package ,Try toe set the classpath for
> > log4j.jar path.
> >
> >  [ Is it a just a WARNING  or an ERROR  u are getting.
> >
> >   Send me in u'r Configuration management Let me help u with it ; [
> >
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Monday, August 23, 2004 10:11 AM
> > To: Lucene Users List
> > Cc: Ben Litchfield
> > Subject: Re: pdfboxhelp
> >
> >
> > hi karthik,
> >
> > I have downloaded pdfbox and kept pdfjar file in the classpath, but when
I
> > am typing following command in the command prompt I am getting the
error:
> >
> > D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
> > C:\test.pdf
> > C:\test.txt
> > log4j:WARN No appenders could be found for logger
> > (org.pdfbox.pdfparser.PDFParse
> > r).
> > log4j:WARN Please initialize the log4j system properly
> >
> > why I am getting this error? plz help
> >
> >
> > - Original Message -
> > From: "Karthik N S" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 9:21 AM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi
> > >
> > >
> > > To Begin with try to build Indexes offline  [ out of Tomcat
> container]
> > > and  on completing indxexes, feed u'r search  with the realpath of the
> > offline indexed folder,Start the Tomcat and then use the
> > > search on As u experiment it out u will be comfortable
> withrequirment
> > of Indexing /Search..   ; [
> > >
> > > Karthik
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 4:55 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > >
> > > Yes I did the same.
> > > I copied all the classes into classes folder but
> > > now when I am building the index using IndexHTML the pdfs are not
added
> to
> > > this index, only text and htmls are added to index.
> > > what changes should I do for IndexHTML.java to build index with pdf
> > > - Original Message -
> > > From: "Karthik N S" <[EMAIL PROTECTED]>
> > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > Sent: Saturday, August 21, 2004 4:54 PM
> > > Subject: RE: pdfboxhelp
> > >
> > >
> > > > Hi
> > > >
> > > > If u are using the jar file with Web Interface for jsp/servlet dev,
> > Place
> > > > the jar file in  "webapps///lib"
> > > > and also correct the Classpath for the present modification.
> > > >
> > > > 2)create u'r own package and put all u'r java files  copy the java
> files
> > > to
> > > > /Web-inf/Classes/
> > > >
> > > >
> > > > Then use the same..;{
> > > >
> > > >
> > > > Karthik
> > > >
> > > > -Original Message-
> > > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > > Sent: Saturday, August 21, 2004 4:31 PM
> > > > To: Lucene Users List
> > > > Subject: Re: pdfboxhelp
> > > >
> > > >
> > > > thanks  Natarajan and karthik,
> > > >
> > > > I corrected classpath
> > > >
> > > > but where I should write your code?
> > > > should I write your code in IndexHTML.java  which comes along with
> > lucene
> > > or
> > > > some other place?
> > > > one more thing
> > > > I kept pdfbox jar file in the classpath is this enough or I have to
> > build
> > > > the pdfbox?
> > > >
> > > >

RE: pdfboxhelp

2004-08-22 Thread Natarajan.T
Hi Santhosh,

The attached file must be in your class path.


Natarajan.



-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 23, 2004 10:51 AM
To: Lucene Users List
Subject: Fw: pdfboxhelp

hi karthik,
did u find any solution? should I send the pdf to u?
- Original Message -
From: "Santosh" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 10:23 AM
Subject: Re: pdfboxhelp


> hi karthik,
>  I kept log4j in the classpath , I am sending classpath variable
>
> CLASSPATH
>
>
.;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webc
lien
>
t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2s
dk1.
>
4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\
j2sd
> k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
> 4.0\common\lib\servlet.jar;C:\Program
>
Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2s
dk1.
>
4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl
.jar
>
;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2s
dk1.
>
4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.z
ip;C
>
:\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.
6.6.
>
jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-
0.6.
> 6\external\log4j.jar
>
> please check the error
>
>
>
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 10:26 AM
> Subject: RE: pdfboxhelp
>
>
> > Hi Santosh
> >
> >   I think u'r Pdf is using  Log4j package ,Try toe set the classpath
for
> > log4j.jar path.
> >
> >  [ Is it a just a WARNING  or an ERROR  u are getting.
> >
> >   Send me in u'r Configuration management Let me help u with it
; [
> >
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Monday, August 23, 2004 10:11 AM
> > To: Lucene Users List
> > Cc: Ben Litchfield
> > Subject: Re: pdfboxhelp
> >
> >
> > hi karthik,
> >
> > I have downloaded pdfbox and kept pdfjar file in the classpath, but
when
I
> > am typing following command in the command prompt I am getting the
error:
> >
> > D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
> > C:\test.pdf
> > C:\test.txt
> > log4j:WARN No appenders could be found for logger
> > (org.pdfbox.pdfparser.PDFParse
> > r).
> > log4j:WARN Please initialize the log4j system properly
> >
> > why I am getting this error? plz help
> >
> >
> > - Original Message -
> > From: "Karthik N S" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 9:21 AM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi
> > >
> > >
> > > To Begin with try to build Indexes offline  [ out of Tomcat
> container]
> > > and  on completing indxexes, feed u'r search  with the realpath of
the
> > offline indexed folder,Start the Tomcat and then use the
> > > search on As u experiment it out u will be comfortable
> withrequirment
> > of Indexing /Search..   ; [
> > >
> > > Karthik
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 4:55 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > >
> > > Yes I did the same.
> > > I copied all the classes into classes folder but
> > > now when I am building the index using IndexHTML the pdfs are not
added
> to
> > > this index, only text and htmls are added to index.
> > > what changes should I do for IndexHTML.java to build index with
pdf
> > > - Original Message -
> > > From: "Karthik N S" <[EMAIL PROTECTED]>
> > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > Sent: Saturday, August 21, 2004 4:54 PM
> > > Subject: RE: pdfboxhelp
> > >
> > >
> > > > Hi
> > > >
> > > > If u are using the jar file with Web Interface for jsp/servlet
dev,
> > Place
> > > > the jar file in  "webapps///lib"
> > > > and also correct the Classpath for the present modification.
> > > >
> > > > 2)create u'r own package and put all u'r java files  copy the
java
> files
> > > to
> > > > /Web-inf/Classes/
> > > >
> > > >
> > > > Then use the same..;{
> > > >
> > > >
> > > > Karthik
> > > >
> > > > -Original Message-
> > > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > > Sent: Saturday, August 21, 2004 4:31 PM
> > > > To: Lucene Users List
> > > > Subject: Re: pdfboxhelp
> > > >
> > > >
> > > > thanks  Natarajan and karthik,
> > > >
> > > > I corrected classpath
> > > >
> > > > but where I should write your code?
> > > > should I write your code in IndexHTML.java  which comes along
with
> > lucene
> > > or
> > > > some other place?
> > > > one more thing
> > > > I kept pdfbox jar file in the classpath is this enough or I have
to
> > build
> > > > the pdfbox?
> > > >
> > > > thankyou
> > > > - Original Message -
> > > 

Fw: pdfboxhelp

2004-08-22 Thread Santosh
hi karthik,
did u find any solution? should I send the pdf to u?
- Original Message -
From: "Santosh" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 10:23 AM
Subject: Re: pdfboxhelp


> hi karthik,
>  I kept log4j in the classpath , I am sending classpath variable
>
> CLASSPATH
>
>
.;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webclien
>
t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2sdk1.
>
4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\j2sd
> k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
> 4.0\common\lib\servlet.jar;C:\Program
>
Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2sdk1.
>
4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl.jar
>
;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2sdk1.
>
4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.zip;C
>
:\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.6.6.
>
jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-0.6.
> 6\external\log4j.jar
>
> please check the error
>
>
>
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 10:26 AM
> Subject: RE: pdfboxhelp
>
>
> > Hi Santosh
> >
> >   I think u'r Pdf is using  Log4j package ,Try toe set the classpath for
> > log4j.jar path.
> >
> >  [ Is it a just a WARNING  or an ERROR  u are getting.
> >
> >   Send me in u'r Configuration management Let me help u with it ; [
> >
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Monday, August 23, 2004 10:11 AM
> > To: Lucene Users List
> > Cc: Ben Litchfield
> > Subject: Re: pdfboxhelp
> >
> >
> > hi karthik,
> >
> > I have downloaded pdfbox and kept pdfjar file in the classpath, but when
I
> > am typing following command in the command prompt I am getting the
error:
> >
> > D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
> > C:\test.pdf
> > C:\test.txt
> > log4j:WARN No appenders could be found for logger
> > (org.pdfbox.pdfparser.PDFParse
> > r).
> > log4j:WARN Please initialize the log4j system properly
> >
> > why I am getting this error? plz help
> >
> >
> > - Original Message -
> > From: "Karthik N S" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 9:21 AM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi
> > >
> > >
> > > To Begin with try to build Indexes offline  [ out of Tomcat
> container]
> > > and  on completing indxexes, feed u'r search  with the realpath of the
> > offline indexed folder,Start the Tomcat and then use the
> > > search on As u experiment it out u will be comfortable
> withrequirment
> > of Indexing /Search..   ; [
> > >
> > > Karthik
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 4:55 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > >
> > > Yes I did the same.
> > > I copied all the classes into classes folder but
> > > now when I am building the index using IndexHTML the pdfs are not
added
> to
> > > this index, only text and htmls are added to index.
> > > what changes should I do for IndexHTML.java to build index with pdf
> > > - Original Message -
> > > From: "Karthik N S" <[EMAIL PROTECTED]>
> > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > Sent: Saturday, August 21, 2004 4:54 PM
> > > Subject: RE: pdfboxhelp
> > >
> > >
> > > > Hi
> > > >
> > > > If u are using the jar file with Web Interface for jsp/servlet dev,
> > Place
> > > > the jar file in  "webapps///lib"
> > > > and also correct the Classpath for the present modification.
> > > >
> > > > 2)create u'r own package and put all u'r java files  copy the java
> files
> > > to
> > > > /Web-inf/Classes/
> > > >
> > > >
> > > > Then use the same..;{
> > > >
> > > >
> > > > Karthik
> > > >
> > > > -Original Message-
> > > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > > Sent: Saturday, August 21, 2004 4:31 PM
> > > > To: Lucene Users List
> > > > Subject: Re: pdfboxhelp
> > > >
> > > >
> > > > thanks  Natarajan and karthik,
> > > >
> > > > I corrected classpath
> > > >
> > > > but where I should write your code?
> > > > should I write your code in IndexHTML.java  which comes along with
> > lucene
> > > or
> > > > some other place?
> > > > one more thing
> > > > I kept pdfbox jar file in the classpath is this enough or I have to
> > build
> > > > the pdfbox?
> > > >
> > > > thankyou
> > > > - Original Message -
> > > > From: "Natarajan.T" <[EMAIL PROTECTED]>
> > > > To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> > > > Sent: Saturday, August 21, 2004 3:20 PM
> > > > Subject: RE: pdfboxhelp
> > > >
> > > >
> > > > > Hi Santhosh,
> > > > >
> > > > > Try 

Re: pdfboxhelp

2004-08-22 Thread Santosh
hi karthik,
 I kept log4j in the classpath , I am sending classpath variable

CLASSPATH

.;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webclien
t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2sdk1.
4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\j2sd
k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
4.0\common\lib\servlet.jar;C:\Program
Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2sdk1.
4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl.jar
;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2sdk1.
4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.zip;C
:\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.6.6.
jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-0.6.
6\external\log4j.jar

please check the error



- Original Message -
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 10:26 AM
Subject: RE: pdfboxhelp


> Hi Santosh
>
>   I think u'r Pdf is using  Log4j package ,Try toe set the classpath for
> log4j.jar path.
>
>  [ Is it a just a WARNING  or an ERROR  u are getting.
>
>   Send me in u'r Configuration management Let me help u with it ; [
>
>
> Karthik
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 23, 2004 10:11 AM
> To: Lucene Users List
> Cc: Ben Litchfield
> Subject: Re: pdfboxhelp
>
>
> hi karthik,
>
> I have downloaded pdfbox and kept pdfjar file in the classpath, but when I
> am typing following command in the command prompt I am getting the error:
>
> D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
> C:\test.pdf
> C:\test.txt
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly
>
> why I am getting this error? plz help
>
>
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 9:21 AM
> Subject: RE: pdfboxhelp
>
>
> > Hi
> >
> >
> > To Begin with try to build Indexes offline  [ out of Tomcat
container]
> > and  on completing indxexes, feed u'r search  with the realpath of the
> offline indexed folder,Start the Tomcat and then use the
> > search on As u experiment it out u will be comfortable
withrequirment
> of Indexing /Search..   ; [
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, August 21, 2004 4:55 PM
> > To: Lucene Users List
> > Subject: Re: pdfboxhelp
> >
> >
> > Yes I did the same.
> > I copied all the classes into classes folder but
> > now when I am building the index using IndexHTML the pdfs are not added
to
> > this index, only text and htmls are added to index.
> > what changes should I do for IndexHTML.java to build index with pdf
> > - Original Message -
> > From: "Karthik N S" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Saturday, August 21, 2004 4:54 PM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi
> > >
> > > If u are using the jar file with Web Interface for jsp/servlet dev,
> Place
> > > the jar file in  "webapps///lib"
> > > and also correct the Classpath for the present modification.
> > >
> > > 2)create u'r own package and put all u'r java files  copy the java
files
> > to
> > > /Web-inf/Classes/
> > >
> > >
> > > Then use the same..;{
> > >
> > >
> > > Karthik
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 4:31 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > >
> > > thanks  Natarajan and karthik,
> > >
> > > I corrected classpath
> > >
> > > but where I should write your code?
> > > should I write your code in IndexHTML.java  which comes along with
> lucene
> > or
> > > some other place?
> > > one more thing
> > > I kept pdfbox jar file in the classpath is this enough or I have to
> build
> > > the pdfbox?
> > >
> > > thankyou
> > > - Original Message -
> > > From: "Natarajan.T" <[EMAIL PROTECTED]>
> > > To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> > > Sent: Saturday, August 21, 2004 3:20 PM
> > > Subject: RE: pdfboxhelp
> > >
> > >
> > > > Hi Santhosh,
> > > >
> > > > Try out this below code.(pdfbox.jar file must be in your
> classpath)
> > > >
> > > > public String getContent(InputStream  reader) throws
> > IOException{PDFParser
> > > parser = null;PDDocument pdDoc = null;PDFTextStripper stripper =
> > null;String
> > > pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
> > > parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument
decryptor
> =
> > > new
> > > > DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
> > > PDFTextStripper();pdftext = stripper

RE: pdfboxhelp

2004-08-22 Thread Karthik N S
Hi Santosh

  I think u'r Pdf is using  Log4j package ,Try toe set the classpath for
log4j.jar path.

 [ Is it a just a WARNING  or an ERROR  u are getting.

  Send me in u'r Configuration management Let me help u with it ; [


Karthik

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 10:11 AM
To: Lucene Users List
Cc: Ben Litchfield
Subject: Re: pdfboxhelp


hi karthik,

I have downloaded pdfbox and kept pdfjar file in the classpath, but when I
am typing following command in the command prompt I am getting the error:

D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
C:\test.pdf
C:\test.txt
log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly

why I am getting this error? plz help


- Original Message -
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 9:21 AM
Subject: RE: pdfboxhelp


> Hi
>
>
> To Begin with try to build Indexes offline  [ out of Tomcat container]
> and  on completing indxexes, feed u'r search  with the realpath of the
offline indexed folder,Start the Tomcat and then use the
> search on As u experiment it out u will be comfortable withrequirment
of Indexing /Search..   ; [
>
> Karthik
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 4:55 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
>
> Yes I did the same.
> I copied all the classes into classes folder but
> now when I am building the index using IndexHTML the pdfs are not added to
> this index, only text and htmls are added to index.
> what changes should I do for IndexHTML.java to build index with pdf
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Saturday, August 21, 2004 4:54 PM
> Subject: RE: pdfboxhelp
>
>
> > Hi
> >
> > If u are using the jar file with Web Interface for jsp/servlet dev,
Place
> > the jar file in  "webapps///lib"
> > and also correct the Classpath for the present modification.
> >
> > 2)create u'r own package and put all u'r java files  copy the java files
> to
> > /Web-inf/Classes/
> >
> >
> > Then use the same..;{
> >
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, August 21, 2004 4:31 PM
> > To: Lucene Users List
> > Subject: Re: pdfboxhelp
> >
> >
> > thanks  Natarajan and karthik,
> >
> > I corrected classpath
> >
> > but where I should write your code?
> > should I write your code in IndexHTML.java  which comes along with
lucene
> or
> > some other place?
> > one more thing
> > I kept pdfbox jar file in the classpath is this enough or I have to
build
> > the pdfbox?
> >
> > thankyou
> > - Original Message -
> > From: "Natarajan.T" <[EMAIL PROTECTED]>
> > To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> > Sent: Saturday, August 21, 2004 3:20 PM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi Santhosh,
> > >
> > > Try out this below code.(pdfbox.jar file must be in your
classpath)
> > >
> > > public String getContent(InputStream  reader) throws
> IOException{PDFParser
> > parser = null;PDDocument pdDoc = null;PDFTextStripper stripper =
> null;String
> > pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
> > parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor
=
> > new
> > > DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
> > PDFTextStripper();pdftext = stripper.getText(pdDoc);
> > >
> > >info = pdDoc.getDocumentInformation();}catch(Exception err)
> > {System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
> > >
> > > Natarajan.
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 3:14 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > > Hi Don,
> > >
> > > your Idea is nice, but whenever I write the  following code in
> > > IndexHTML.java of lucene
> > >
> > >
> > > import org.pdfbox.searchengine.lucene.*;
> > >
> > > File pdfFile = new File("/path/to/the/file.pdf");
> > >
> > > // Below returns a parse PDF file in a Lucene Document object.
> > > Document doc = LucenePDFDocument.getDocument(pdfFile);
> > >
> > > Iam getting the following error
> > >
> > > package org.pdfbox.searchengine.lucene does not exist
> > >
> > > I have downloaded pdfbox source code and kept the jar file in the
> > > classpath, please help me on this- Original Message - From:
Don
> > Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
> > PMSubject: Re: pdfboxhelp
> > >
> > >
> > >   Here is the super simple code required.
> > >
> > >   import org.pdfbox.searchengine.lucene.*;
> > >
> > >   File pdfFile = new File("/path/to/the/file.pdf");
> > >
> > >   // Below returns a parse 

Re: pdfboxhelp

2004-08-22 Thread Santosh
hi karthik,

I have downloaded pdfbox and kept pdfjar file in the classpath, but when I
am typing following command in the command prompt I am getting the error:

D:\setups\searchEngine\PDFBox-0.6.6\src>java org.pdfbox.ExtractText
C:\test.pdf
C:\test.txt
log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly

why I am getting this error? plz help


- Original Message -
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 9:21 AM
Subject: RE: pdfboxhelp


> Hi
>
>
> To Begin with try to build Indexes offline  [ out of Tomcat container]
> and  on completing indxexes, feed u'r search  with the realpath of the
offline indexed folder,Start the Tomcat and then use the
> search on As u experiment it out u will be comfortable withrequirment
of Indexing /Search..   ; [
>
> Karthik
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 4:55 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
>
> Yes I did the same.
> I copied all the classes into classes folder but
> now when I am building the index using IndexHTML the pdfs are not added to
> this index, only text and htmls are added to index.
> what changes should I do for IndexHTML.java to build index with pdf
> - Original Message -
> From: "Karthik N S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Saturday, August 21, 2004 4:54 PM
> Subject: RE: pdfboxhelp
>
>
> > Hi
> >
> > If u are using the jar file with Web Interface for jsp/servlet dev,
Place
> > the jar file in  "webapps///lib"
> > and also correct the Classpath for the present modification.
> >
> > 2)create u'r own package and put all u'r java files  copy the java files
> to
> > /Web-inf/Classes/
> >
> >
> > Then use the same..;{
> >
> >
> > Karthik
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, August 21, 2004 4:31 PM
> > To: Lucene Users List
> > Subject: Re: pdfboxhelp
> >
> >
> > thanks  Natarajan and karthik,
> >
> > I corrected classpath
> >
> > but where I should write your code?
> > should I write your code in IndexHTML.java  which comes along with
lucene
> or
> > some other place?
> > one more thing
> > I kept pdfbox jar file in the classpath is this enough or I have to
build
> > the pdfbox?
> >
> > thankyou
> > - Original Message -
> > From: "Natarajan.T" <[EMAIL PROTECTED]>
> > To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> > Sent: Saturday, August 21, 2004 3:20 PM
> > Subject: RE: pdfboxhelp
> >
> >
> > > Hi Santhosh,
> > >
> > > Try out this below code.(pdfbox.jar file must be in your
classpath)
> > >
> > > public String getContent(InputStream  reader) throws
> IOException{PDFParser
> > parser = null;PDDocument pdDoc = null;PDFTextStripper stripper =
> null;String
> > pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
> > parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor
=
> > new
> > > DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
> > PDFTextStripper();pdftext = stripper.getText(pdDoc);
> > >
> > >info = pdDoc.getDocumentInformation();}catch(Exception err)
> > {System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
> > >
> > > Natarajan.
> > >
> > > -Original Message-
> > > From: Santosh [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, August 21, 2004 3:14 PM
> > > To: Lucene Users List
> > > Subject: Re: pdfboxhelp
> > >
> > > Hi Don,
> > >
> > > your Idea is nice, but whenever I write the  following code in
> > > IndexHTML.java of lucene
> > >
> > >
> > > import org.pdfbox.searchengine.lucene.*;
> > >
> > > File pdfFile = new File("/path/to/the/file.pdf");
> > >
> > > // Below returns a parse PDF file in a Lucene Document object.
> > > Document doc = LucenePDFDocument.getDocument(pdfFile);
> > >
> > > Iam getting the following error
> > >
> > > package org.pdfbox.searchengine.lucene does not exist
> > >
> > > I have downloaded pdfbox source code and kept the jar file in the
> > > classpath, please help me on this- Original Message - From:
Don
> > Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
> > PMSubject: Re: pdfboxhelp
> > >
> > >
> > >   Here is the super simple code required.
> > >
> > >   import org.pdfbox.searchengine.lucene.*;
> > >
> > >   File pdfFile = new File("/path/to/the/file.pdf");
> > >
> > >   // Below returns a parse PDF file in a Lucene Document
object.Document
> > doc = LucenePDFDocument.getDocument(pdfFile);
> > >
> > >   Santosh wrote:
> > >
> > > exactly, the same is required to me- Original Message - From:
> Don
> > Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 6:39
> > PMSubject: Re: pdfboxhelp
> > >
> > >
> > >   What are your intensions with PDFBox?
> > >
> > >   You want 

RE: pdfboxhelp

2004-08-22 Thread Karthik N S
Hi


To Begin with try to build Indexes offline  [ out of Tomcat container]
and  on completing indxexes, feed u'r search  with the real
path of the  offline indexed folder,Start the Tomcat and then use the
search on As u experiment it out u will be comfortable with
requirment of Indexing /Search..   ; [

Karthik

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 4:55 PM
To: Lucene Users List
Subject: Re: pdfboxhelp


Yes I did the same.
I copied all the classes into classes folder but
now when I am building the index using IndexHTML the pdfs are not added to
this index, only text and htmls are added to index.
what changes should I do for IndexHTML.java to build index with pdf
- Original Message -
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, August 21, 2004 4:54 PM
Subject: RE: pdfboxhelp


> Hi
>
> If u are using the jar file with Web Interface for jsp/servlet dev, Place
> the jar file in  "webapps///lib"
> and also correct the Classpath for the present modification.
>
> 2)create u'r own package and put all u'r java files  copy the java files
to
> /Web-inf/Classes/
>
>
> Then use the same..;{
>
>
> Karthik
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 4:31 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
>
> thanks  Natarajan and karthik,
>
> I corrected classpath
>
> but where I should write your code?
> should I write your code in IndexHTML.java  which comes along with lucene
or
> some other place?
> one more thing
> I kept pdfbox jar file in the classpath is this enough or I have to build
> the pdfbox?
>
> thankyou
> - Original Message -
> From: "Natarajan.T" <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Saturday, August 21, 2004 3:20 PM
> Subject: RE: pdfboxhelp
>
>
> > Hi Santhosh,
> >
> > Try out this below code.(pdfbox.jar file must be in your classpath)
> >
> > public String getContent(InputStream  reader) throws
IOException{PDFParser
> parser = null;PDDocument pdDoc = null;PDFTextStripper stripper =
null;String
> pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
> parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor =
> new
> > DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
> PDFTextStripper();pdftext = stripper.getText(pdDoc);
> >
> >info = pdDoc.getDocumentInformation();}catch(Exception err)
> {System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
> >
> > Natarajan.
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, August 21, 2004 3:14 PM
> > To: Lucene Users List
> > Subject: Re: pdfboxhelp
> >
> > Hi Don,
> >
> > your Idea is nice, but whenever I write the  following code in
> > IndexHTML.java of lucene
> >
> >
> > import org.pdfbox.searchengine.lucene.*;
> >
> > File pdfFile = new File("/path/to/the/file.pdf");
> >
> > // Below returns a parse PDF file in a Lucene Document object.
> > Document doc = LucenePDFDocument.getDocument(pdfFile);
> >
> > Iam getting the following error
> >
> > package org.pdfbox.searchengine.lucene does not exist
> >
> > I have downloaded pdfbox source code and kept the jar file in the
> > classpath, please help me on this- Original Message - From: Don
> Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
> PMSubject: Re: pdfboxhelp
> >
> >
> >   Here is the super simple code required.
> >
> >   import org.pdfbox.searchengine.lucene.*;
> >
> >   File pdfFile = new File("/path/to/the/file.pdf");
> >
> >   // Below returns a parse PDF file in a Lucene Document object.Document
> doc = LucenePDFDocument.getDocument(pdfFile);
> >
> >   Santosh wrote:
> >
> > exactly, the same is required to me- Original Message - From:
Don
> Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 6:39
> PMSubject: Re: pdfboxhelp
> >
> >
> >   What are your intensions with PDFBox?
> >
> >   You want to use it to index PDF files?
> >
> >   Santosh wrote:
> >
> > hi,
> >
> > I have downloaded pdfbox zip. but i am in ambigous state that where to
> > start. how can I check with demo, I dont see any help document with this
> > download, please help me.
> >
> >
> > regards
> > Santosh kumar
> > SoftPro Systems
> > Hyderabad
> >
> >
> > "The harder you train in peace, the lesser you bleed in war"
> >
> > ---SOFTPRO DISCLAIMER--
> >
> > Information contained in this E-MAIL and any attachments are
> > confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> > and 'confidential'.
> >
> > If you are not an intended or authorised recipient of this E-MAIL or
> > have received it in error, You are notified that any use, copying or
> > dissemination  of the information contained in this E-MAIL in any
> > manne

Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley

> For example, Nutch automatically translates such
> clauses into QueryFilters.

Thanks for the excellent pointer Doug!  I'll will
definitely be implementing this optimization.

If anyone cares, I did a 1 minute hprof test with the
search server in a servlet container.  Here are the
results (sorry about Yahoo's short line length).

-Yonik

resin.hprof.txt: Exclusive Method Times (CPU) (virtual
times)
 27390  (37.5%)
java.net.PlainSocketImpl.socketAccept
 14885  (20.4%)
org.apache.lucene.index.SegmentTermDocs.skipTo
  6700   (9.2%)
org.apache.lucene.index.CompoundFileReader$CSInputStream.rea
dInternal
  5810   (8.0%) java.io.UnixFileSystem.list
  4785   (6.5%)
org.apache.lucene.store.InputStream.readByte
  3315   (4.5%) java.io.RandomAccessFile.readBytes
  1302   (1.8%)
java.net.SocketOutputStream.socketWrite0
  1004   (1.4%) java.io.RandomAccessFile.seek
   546   (0.7%) java.lang.String.intern
   336   (0.5%) com.caucho.vfs.WriteStream.print
   248   (0.3%)
org.apache.lucene.search.TermScorer.next
   236   (0.3%)
org.apache.lucene.queryParser.QueryParser.jj_scan_token
   232   (0.3%)
org.apache.lucene.index.SegmentTermEnum.readTerm
   228   (0.3%)
org.apache.lucene.search.ConjunctionScorer.score
   200   (0.3%)
org.apache.lucene.queryParser.FastCharStream.refill
   196   (0.3%)
org.apache.lucene.store.InputStream.readVInt
   180   (0.2%)
java.security.AccessController.doPrivileged
   172   (0.2%)
org.apache.lucene.search.ConjunctionScorer.doNext
   152   (0.2%) java.lang.Object.clone
   152   (0.2%)
org.apache.lucene.index.SegmentReader.document
   148   (0.2%)
java.lang.Throwable.fillInStackTrace
   128   (0.2%)
org.apache.lucene.index.SegmentReader.norms
   116   (0.2%)
org.apache.lucene.store.InputStream.readString
   112   (0.2%) java.lang.StrictMath.log
   108   (0.1%) java.util.LinkedList.addLast
   100   (0.1%)
java.net.SocketInputStream.socketRead0
88   (0.1%)
org.apache.lucene.search.ConjunctionScorer.next





__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote:
Setup info & Stats:
- 4.3M documents, 12 keyword fields per document, 11
 [ ... ]
"field1:4 AND field2:188453 AND field3:1"
field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).
The "field1:4" clause is probably dominating the cost of query 
execution.  Clauses which match large portions of the collection are 
slow to evaluate.  If there are not too many different such clauses then 
you can optimize this by re-using a Filter in place of such clauses, 
typically a QueryFilter.

For example, Nutch automatically translates such clauses into 
QueryFilters.  See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup
Note that this only converts clauses whose boost is zero.  Since filters 
do not affect ranking we can only safely convert clauses which do not 
contribute to the score, i.e, those whose boost is zero.  Scores might 
still be different in the filtered results because of 
Similarity.coord().  But, in Nutch, Similarity.coord() is overidden to 
always return 1.0, so that the replacement of clauses with filters does 
not alter the final scores at all.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley
Oops, CPU usage is *not* 50%, but closer to 98%.
This is due to a bug in CPU% on RHEL 3 on
multiprocessor CPUS (I ran run multiple threads in
while(1) loops, and it will still only show 50% CPU
usage for that process).  The agregated (not
per-process) statistics shown by top are correct, and
they show about 73% user time, 25% system time, and
anywhere between .5% and 2% idle time.

Unfortunately, this means that I won't be getting any
performance improvements from using a second
IndexSearcher, and I'm stuck at being 3 times slower
than MySQL on the same data/queries.

I guess the next step is some profiling... move the
server out of the servlet container and move the
clients in with the server, and then try some hprof
work.

Does anyone have pointers to lucene caching and how to
tune it?

-Yonik 





--- Bernhard Messer <[EMAIL PROTECTED]>
wrote:
> Yonik,
> 
> there is another "synchronized" block in
> CSInputStream which could block 
> your second cpu out.



__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]