Re: Search PDF ???

2004-10-24 Thread Honey George

 --- Eric Chow [EMAIL PROTECTED] wrote: 
 Hello,
 
 1. Is it possibleto use Lucene to search PDF
 contents ?
Yes, you need to use some external tools to extract
the text from the PDF file and then pass it to lucene
for indexing. If you do a search of this list you will
get lot of mails related to that.
 
 2. Can it search Chinese contents PDF files ???
I have used a tool called xpdf (in linux) and it works
with both chinese traditional and chinese simplified.
It gives language support packages for many of the
languages. Please take a look at the URL below.
http://www.foolabs.com/xpdf/download.html

Now the tool only helps in extracting the text.
Whether you can search chinese text or not depends on
the analyzer you use in Lucene. Try CJKAnalyzer for
CJK text search.

Thanks,
  George





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Corrupted indexes

2004-10-23 Thread Honey George
Andy,
  Please take a look at the following thread, this
should help you.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08976.html

Thanks,
  George

 --- Andy Goodell [EMAIL PROTECTED] wrote: 
 Recently, I've been getting a lot of corrupted
 lucene indexes.  They
 appear to return search results normally, but there
 is really no good
 way to test whether information is missing.  The
 main problem is that
 when i try to optimize, i get the following
 Exception:
 
 java.io.IOException: read past EOF
 at

org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218)
 at

org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
 at

org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)
 at

org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323)
 at

org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:422)
 at

org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94)
 at

org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
 at

org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
 
 this is preventing me from optimizing the indexes,
 and also scares me
 that information might be missing.
 
 Does anybody know what's going on here, and what
 might be wrong?
 
 Thanks for your time,
 - andy g
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene docs

2004-09-15 Thread Honey George
Try these,
http://jakarta.apache.org/lucene/docs/gettingstarted.html
http://www.darksleep.com/lucene/

Thanks,
  George
 --- Ian McDonnell [EMAIL PROTECTED] wrote: 
 What is the best resource for beginners looking to
 understand Lucenes functionality, ie its use of
 fields, documents, the index reader and writer etc.
 
 is there any web resource that goes into details on
 the exact workings of it?
 
 Ian
 

_
 Sign up for FREE email from SpinnersCity Online
 Dance Magazine  Vortal at
 http://www.spinnerscity.com
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Honey George
 --- Natarajan.T [EMAIL PROTECTED]
wrote: 
 I am trying to extend the current behavior.
You might have already seen a mail from Cocula Remi on
this. Please provide more details of the problem for
specific comments - basically the problem you are
facing and/or what behavior you are trying to extend.
This was not clear from your email. An example will
make things more clear.

Thanks  Regards,
   George







___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PorterStemfilter

2004-09-14 Thread Honey George
Hi,
 This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.

I am using the PorterStemFilter that some with lucene
and it turns out that searching for the word 'printer'
does not return a document containing the text
'print'. To narrow down the problem, I have tested the
PorterStemFilter in a standalone programs and it turns
out that the stem of printer is 'printer' and not
'print'. That is 'printer' is not equal to 'print' +
'er', the whole of the word is stem. Can somebody
explain the behavior.

Thanks  Regards,
   George





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help for text based indexing

2004-09-14 Thread Honey George
You could recieve the group name as an input from the
user and construct a BooleanQuery internally which
will qyery only the group field based on the user
input. So the user need not append the group name with
the search string.

Thanks,
   George
 --- mahaveer jain [EMAIL PROTECTED] wrote: 
 If i have rightly understood, you mean to say that
 the query for search has  to be 
  
 Group1 AND Hello (if hello is what I want to
 search ?)
  
 Cocula Remi [EMAIL PROTECTED] wrote:
 A keyword is not tokenized, that's why you wont be
 able to search over a part of it. You'd rather use a
 Text fied.
 
 About creating a special field : 
 
 IndexWriter Ir = 
 
 File f = 
 Document doc = new Document();
 if

(f.toString.startsWith(C:\tomcat\webapps\Root\Group1)
 {
 doc.add(Field.Text(group, Group1));
 }
 if

(f.toString.startsWith(C:\tomcat\webapps\Root\Group2)
 {
 doc.add(Field.Text(group, Group2));
 }
 doc.add(Field.Text(content, getContent(f)));
 Ir.addDocument(doc);
 
 
 
 Then you can search in group1 with query like that :
 
 
 group:Group1 AND rest_of_the_query.
 
 
 
 -Message d'origine-
 De : mahaveer jain [mailto:[EMAIL PROTECTED]
 Envoyé : mardi 14 septembre 2004 18:03
 À : Lucene Users List
 Objet : RE: Help for text based indexing
 
 
 Well in my case the path is KeyWord. I had tried
 that earlier and it does not seems to work in a
 single index file. 
 
 Can you explain a bit more about adding group1 and
 group2 ?
 
 Cocula Remi wrote:
 Well you could add a field to each of your Documents
 whose value would be either group1 or group2.
 Or you could use the path to your files ...
 
 
 
 -Message d'origine-
 De : mahaveer jain [mailto:[EMAIL PROTECTED]
 Envoyé : mardi 14 septembre 2004 17:49
 À : [EMAIL PROTECTED]
 Objet : RE: Help for text based indexing
 
 
 I am clear with looping recursively to index all the
 file under Root folder.
 But the problem is if I want to search only in
 group1 or group2.Is that possible to search only in
 one of the group folder ?
 
 Cocula Remi wrote:
 You just have to loop recurssively over the
 C:\tomcat\webapps\Root tree to create your index.
 Yes you can index databases; you will just have to
 write a mechanism that is able to create
 org.apache.lucene.document.Document from database.
 For instance : 
 - connect JDBC
 - run a query for obtaining a ResultSet
 - loop for each row of that ResultSet :
 Create a new org.apache.lucene.document.Document
 from ResultSet data
 and add this document to the Index.
 end loop.
 
 For incremental indexing, I suppose you have to
 store some timestamp field in your index; but it's
 up to you.
 Note that Lucene is very fast and I don't think that
 incremetal indexing is required for small or medium
 amout of data.
 
 
 
 -Message d'origine-
 De : mahaveer jain [mailto:[EMAIL PROTECTED]
 Envoyé : mardi 14 septembre 2004 17:22
 À : [EMAIL PROTECTED]
 Objet : Help for text based indexing
 
 
 
 Hi
 
 I have implemented Text based search using lucene. I
 was wonderful playing around with it.
 
 Now I want to enchance the application.
 
 I have a Root folder, under that I have many other
 folder, that are group specific, say (group1,
 group2, .. so on). The Root folder is in
 C:\tomcat\webapps\Root and group folder within that.
 
 Now I am index for these groups separately, ie , I
 have index as C:/index/group1, C:/index/group2,
 C:/index/group3 and so on
 
 I want to know if I can have only one index for all
 these say C:/index/Root (this has index for all the
 folder) and I should be able to Search using
 C:\tomcat\webapps\Root\group1(if want to search for
 group1) similarly for the other groups.
 
 Let me know if this is possible and have anybody
 tried this.
 
 2nd question
 
 Is lucene good to index databases ? How do we
 support incremental indexing ?
 
 (Right now I am using LIKE for searching )
 
 Thanks in Advance
 
 Mahaveer
 
 
 
 -
 Do you Yahoo!?
 vote.yahoo.com - Register online to vote today!
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 vote.yahoo.com - Register online to vote today!
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 vote.yahoo.com - Register online to vote today!
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
   
 -
 Do you Yahoo!?
 Yahoo! Mail - 50x more storage than other providers! 





___ALL-NEW Yahoo! Messenger - 
all 

Re: PorterStemfilter

2004-09-14 Thread Honey George
 --- Tea Yu [EMAIL PROTECTED] wrote: 
 David,
 
 For me I don't want a search for in print gives
 results from in printer?
 I'll consider that over-stemmed elsecase.
Here the in won't be considered as it is a stopword
in most of the analyzers. I know it is in
StandardAnalyzer. So searching for 'in print' will not
return the document containing 'in printer' because
stem('printer') is 'printer' and not 'print'. So
'printer' is what getting stored in the index.
Enclosing in double quotes does not prevent stemming.

 I'm also not that satisfactory when effective is
 stemmed to effect by
 snowball recently


I have tested this with PorterStemFilter and there is
also effective is stemmed as effect. There are
more serious problems. printable is stemmed as
printabl.

Thanks,
  George





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Existing Parsers

2004-09-13 Thread Honey George
Hi Chris,
   I do not have a stats but I think the performance
is reasonable. I use xpdf for PDF  wvWare for DOC.
The size of my index is ~2GB (this is not limited to
only pdf  doc). For avoiding memory problems, I have
set an upperbound to the size of the documents that
can be indexed. For example in my case I do not index
documents if the size is more that 4MB. You could try
something like that.

Thanks  Regards,
   George

 --- Chris Fraschetti [EMAIL PROTECTED] wrote: 
 Some of the tools listed use cmd line execs to
 output a doc of some
 sort to text and then I grab the text and add it to
 a lucene doc, etc
 etc...
 
 Any stats on the scalability of that? In large scale
 applications, I'm
 assuming this will cause some serious issues...
 anyone have any input
 on this?
 
 -Chris Fraschetti
 
 
 On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer
 [EMAIL PROTECTED] wrote:
  Honey George wrote:
  
   Hi,
 I know some of them.
   1. PDF
+ http://www.pdfbox.org/
+ http://www.foolabs.com/xpdf/download.html
  - I am using this and found good. It even
 supports
  
  My dated experience from 2 years ago was that (the
 evil, native code)
  foolabs pdf parser was the best, but obviously
 things could have changed.
  
 

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
  
various languages.
   2. word
 + http://sourceforge.net/projects/wvware
   3. excel
 +
 http://www.jguru.com/faq/view.jsp?EID=1074230
  
   -George
--- [EMAIL PROTECTED] wrote:
  
  Anyone know of any reliable parsers out there
 for
  pdf word
  excel or powerpoint?
  
  For powerpoint it's not easy. I've been using this
 and it has worked
  fine util recently and seems to sometimes go into
 an infinite loop now
  on some recent PPTs. Native code and a package
 that seems to be dormant
  but to some extent it does the job. The file
 ppthtml does the work.
  
  http://chicago.sourceforge.net/xlhtml
  
  
  
  
  
  
  

-
  
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
  
  
  
  
  
  
  

___ALL-NEW
 Yahoo! Messenger - all new features - even more fun!
  http://uk.messenger.yahoo.com
  
  

-
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Case sensitiveness and wildcard searches

2004-09-09 Thread Honey George
Hi,
 I noticed a behavior with wildcard searches and like
to clarify.

From the FAQ
http://www.jguru.com/faq/view.jsp?EID=538312
in JGuru, Analyzer is not used for wildcard queries.
In my case I have a document which contains the word
IMPORTANT. I use PorterStemFiler + StandardAnalyzer
for indexing  searching. I am getting the document if
I search for the word IM*. But if analyzer is not used
then who does the conversion of the word to lowercase.

My code will look like this.

---
QueryParser qp=new QueryParser(title,
  new MyAnalyzer());
Query q = qp.parse(text);
---


Though I pass the text in uppercase (IM*), when I
print the Query object I can see it in lowercase,
something like  (title:im*)

I am using lucene-1.3-final. Can someone explain this?

Thanks  regards,
   George







___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Case sensitiveness and wildcard searches

2004-09-09 Thread Honey George
Thanks for links René,
 The mail is not exactly talking about my case because
the StandardAnalyzer which I use does lowercase the
input. So it is the same scenario as the FAQ entry.

-George

 --- René_Hackl [EMAIL PROTECTED] wrote: 
 Hi George,
 
 I'm not sure about v1.3, but you may want to take a
 look
 at
 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=9342
 
 or
 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1806371
 
 cheers,
 René
 
 -- 
 NEU: Bis zu 10 GB Speicher für e-mails  Dateien!
 1 GB bereits bei GMX FreeMail
 http://www.gmx.net/de/go/mail
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Existing Parsers

2004-09-09 Thread Honey George
Hi,
  I know some of them.
1. PDF
 + http://www.pdfbox.org/
 + http://www.foolabs.com/xpdf/download.html
   - I am using this and found good. It even supports 
 various languages.
2. word
  + http://sourceforge.net/projects/wvware
3. excel
  + http://www.jguru.com/faq/view.jsp?EID=1074230

-George
 --- [EMAIL PROTECTED] wrote: 
 Anyone know of any reliable parsers out there for
 pdf word 
 excel or powerpoint?
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: too many open files

2004-09-01 Thread Honey George
Patrick,
  For your second problem, are you seeing a behavior
similar to the one discussed in the following thread?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08952.html

If yes, you can see the solution there.

Thanks,
  George

 --- Patrick Kates [EMAIL PROTECTED] wrote: 
 I am having two problems with my client's lucene
 indexes.
 
 One, we are getting a FileNotFound exception (too
 many open files).  THis
 would seem to indicate that I need to increase the
 number of open files on
 our Suse 9.0 Pro box.  I have our sys admin working
 on this problem for me.
 
 Two, because of this error and subsequent restarting
 of the box, we seem to
 have lost an index segment or two.  My client's tape
 backups do not contain
 the segments we know about.
 
 I am concerned about the missing index segments as
 they seem to be
 preventing any further update of the index.  Does
 anyone have any
 suggestions as to how to fix this besides a full
 re-index of the problem
 indexes?
 
 I was wondering if maybe a merge of the index might
 solve the problem?  I
 could move our nightly merge of the index files to
 sooner, but I am afraid
 that the merge might make matters worse?
 
 Any ideas or helpful speculation would be greatly
 appreciated.
 
 Patrick
 
 
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Honey George
Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote: 
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct? 
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better. 
 
 
 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.  
 
 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.
 
 Thanks
 Rob 





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Restoring a corrupt index

2004-08-19 Thread Honey George
This is what I did.

There are 2 classes in the lucene source which are not
public and therefore cannot be accessed from outside
the package. The classes are
1. org.apache.lucene.index.SegmentInfos
   - collection of segments
2. org.apache.lucene.index.SegmentInfo
   -represents a sigle segment

I took these two files and moved to a separate folder.
Then created a class with the following code fragment.

public void displaySegments(String indexDir)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

StringBuffer str = new StringBuffer();
int size = segments.size();
str.append(Index Dir =  + indexDir );
str.append(\nTotal Number of Segments  +
size);
   
str.append(\n--);
for(int i=0;isize;i++)
{
str.append(\n);
str.append((i+1) + . );
   
str.append(((SegmentInfo)segments.get(i)).name);
}
   
str.append(\n--);

System.out.println(str.toString());
}


public void deleteSegment(String indexDir, String
segmentName)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

int size = segments.size();
String name = null;
boolean found = false;
for(int i=0;isize;i++)
{
name =
((SegmentInfo)segments.get(i)).name;
if (segmentName.equals(name))
{
found = true;
segments.remove(i);
System.out.println(Deleted the
segment with name  + name
+ from the segments file);
break;
}
}
if (found)
{
segments.write(dir);
}
else
{
System.out.println(Invalid segment name:
 + segmentName);
}
}

Use the displaySegments() method to display the
segments and deleteSegment to delete the corrupt
segment.

Thanks,
  George

 --- Karthik N S [EMAIL PROTECTED] wrote: 
 Hi Guys
 
 
In Our Situation we would be indexing  Million 
 Millions of Information
 documents
 
   with  Huge Giga Bytes of Data Indexed  and 
 finally would be  put into a
 MERGED INDEX, Categorized accordingly.
 
   There may be a possibility of Corruption,  So
 Please do post  the code
 reffrals
 
 
  Thx
 Karthik
 
 
 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 18, 2004 5:51 PM
 To: Lucene Users List
 Subject: Re: Restoring a corrupt index
 
 
 Thanks Erik, that worked. I was able to remove the
 corrupt index and now it looks like the index is OK.
 I
 was able to view the number of documents in the
 index.
 Before that I was getting the error,
 java.io.IOException: read past EOF
 
 I am yet to find out how my index got corrupted.
 There
 is another thread going on about this topic,

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html
 
 If anybody is facing similar problem and is
 interested
 in the code I can post it here.
 
 Thanks,
   George
 
 
 
  --- Erik Hatcher [EMAIL PROTECTED]
 wrote:
  The details of the segments file (and all the
  others) is freely
  available here:
 
 
 

http://jakarta.apache.org/lucene/docs/fileformats.html
 
  Also, there is Java code in Lucene, of course,
 that
  manipulates the
  segments file which could be leveraged (although
  probably package
  scoped and not easily usable in a standalone
 repair
  tool).
 
  Erik
 
 
  On Aug 18, 2004, at 6:50 AM, Honey George wrote:
 
   Looks like problem is not with the hexeditor,
 even
  in
   the ultraedit(i had access to a windows box) I
 am
   seeing the same display. The problem is I am not
  able
   to identify where a record starts with just 1
  record
   in the file.
  
   Need to try some alternate approach.
  
   Thanks,
 George
 
 
 
 
 
 

___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchhelp

2004-08-19 Thread Honey George
Hi,
  Note that Lucene only provides an API to build a
search engine you can use it how ever you want it. You
can pass data to indexing in 2 forms.
1. java.lang.String
2. java.io.Reader

What Lucene recieves is any of the two objects above.
Now in the case of non-text documents you need to
extract the text information from the documents and
either create as a text file and convert to a Reader
object or creat a String object (for small files). 

For indexing database contents, you need to write your
own APIs to get data from the database (using JDBC/EJB
etc), convert the data to a String object and pass it
to Lucene for indexing.

Again Lucene is not responsible for getting the data
from your application. It only indexed the data given
it to you.

Also for extracting contents from pdf  doc
files(generally known as straining) I know of 2 more
tools
wvWare - for word documents
pdftotext(xpdf) - for pdf documents.

Google around and you will get lot of links.

Hope this helps.

Thanks,
   George

 --- Santosh [EMAIL PROTECTED] wrote: 
 I am recently joined into list, I didnt gone through
 any previous mails, if
 you have any mails or related code please forward it
 to me
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: Lucene Users List
 [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:47 PM
 Subject: Re: searchhelp
 
 
  For PDF you need to extract a text from pdf files
 using pdfbox library
 and
  for word documents u can use apache POI api's .
 There are messages
  posted on the  lucene list related to your
 queries. About database ,i
 guess
  someone must have done it . :)
 
  - Original Message -
  From: Santosh [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 3:58 PM
  Subject: searchhelp
 
 
  Hi,
 
  I am using lucene search engine for my
 application.
 
  i am able to search through the text files and
 htmls as specified by
 lucene
 
  can you please clarify my doubts
 
  1.can lucene search through pdfs and word
 documents? if yes then how?
 
  2.can lucene search through database ? if yes then
 how?
 
  thankyou
 
  santosh
 
 
  ---SOFTPRO
 DISCLAIMER--
 
  Information contained in this E-MAIL and any
 attachments are
  confidential being  proprietary to SOFTPRO SYSTEMS
  is 'privileged'
  and 'confidential'.
 
  If you are not an intended or authorised recipient
 of this E-MAIL or
  have received it in error, You are notified that
 any use, copying or
  dissemination  of the information contained in
 this E-MAIL in any
  manner whatsoever is strictly prohibited. Please
 delete it immediately
  and notify the sender by E-MAIL.
 
  In such a case reading, reproducing, printing or
 further dissemination
  of this E-MAIL is strictly prohibited and may be
 unlawful.
 
  SOFTPRO SYSYTEMS does not REPRESENT or WARRANT
 that an attachment
  hereto is free from computer viruses or other
 defects.
 
  The opinions expressed in this E-MAIL and any
 ATTACHEMENTS may be
  those of the author and are not necessarily those
 of SOFTPRO SYSTEMS.
 


 
 
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Restoring a corrupt index

2004-08-19 Thread Honey George
If I understand correctly, You have situation where
you have a large main index and then you create small
indexes and finally merge to the main index. It can
happen that half way through merging, the system
crashed and the index got corrupted. I do not think in
this case you can use my solution. 

What I am trying to do is to remove a corrupt segment
and associated files from the index folder, not trying
to fix a corrupt segment. This way atleast I can add
new documents to the index. Of cource I am sure I
didn't loose anything because my index file size was
actually 0 bytes.


Thanks,
  George

 --- Karthik N S [EMAIL PROTECTED] wrote: 
 Hi
 
   George
 
Do u think ,the same would work for MERGED
 Indexes
Please Can u suggest a solution.
 
 
   Karthik
 
 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 2:08 PM
 To: Lucene Users List
 Subject: RE: Restoring a corrupt index
 
 
 This is what I did.
 
 There are 2 classes in the lucene source which are
 not
 public and therefore cannot be accessed from outside
 the package. The classes are
 1. org.apache.lucene.index.SegmentInfos
- collection of segments
 2. org.apache.lucene.index.SegmentInfo
-represents a sigle segment
 
 I took these two files and moved to a separate
 folder.
 Then created a class with the following code
 fragment.
 
 public void displaySegments(String indexDir)
 throws Exception
 {
 Directory dir =
 (Directory)FSDirectory.getDirectory(indexDir,
 false);
 SegmentInfos segments = new SegmentInfos();
 segments.read(dir);
 
 StringBuffer str = new StringBuffer();
 int size = segments.size();
 str.append(Index Dir =  + indexDir );
 str.append(\nTotal Number of Segments  +
 size);
 

str.append(\n--);
 for(int i=0;isize;i++)
 {
 str.append(\n);
 str.append((i+1) + . );
 
 str.append(((SegmentInfo)segments.get(i)).name);
 }
 

str.append(\n--);
 
 System.out.println(str.toString());
 }
 
 
 public void deleteSegment(String indexDir,
 String
 segmentName)
 throws Exception
 {
 Directory dir =
 (Directory)FSDirectory.getDirectory(indexDir,
 false);
 SegmentInfos segments = new SegmentInfos();
 segments.read(dir);
 
 int size = segments.size();
 String name = null;
 boolean found = false;
 for(int i=0;isize;i++)
 {
 name =
 ((SegmentInfo)segments.get(i)).name;
 if (segmentName.equals(name))
 {
 found = true;
 segments.remove(i);
 System.out.println(Deleted the
 segment with name  + name
 + from the segments file);
 break;
 }
 }
 if (found)
 {
 segments.write(dir);
 }
 else
 {
 System.out.println(Invalid segment
 name:
  + segmentName);
 }
 }
 
 Use the displaySegments() method to display the
 segments and deleteSegment to delete the corrupt
 segment.
 
 Thanks,
   George
 
  --- Karthik N S [EMAIL PROTECTED] wrote:
  Hi Guys
 
 
 In Our Situation we would be indexing  Million
 
  Millions of Information
  documents
 
with  Huge Giga Bytes of Data Indexed  and
  finally would be  put into a
  MERGED INDEX, Categorized accordingly.
 
There may be a possibility of Corruption,  So
  Please do post  the code
  reffrals
 
 
   Thx
  Karthik
 
 
  -Original Message-
  From: Honey George [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, August 18, 2004 5:51 PM
  To: Lucene Users List
  Subject: Re: Restoring a corrupt index
 
 
  Thanks Erik, that worked. I was able to remove the
  corrupt index and now it looks like the index is
 OK.
  I
  was able to view the number of documents in the
  index.
  Before that I was getting the error,
  java.io.IOException: read past EOF
 
  I am yet to find out how my index got corrupted.
  There
  is another thread going on about this topic,
 

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html
 
  If anybody is facing similar problem and is
  interested
  in the code I can post it here.
 
  Thanks,
George
 
 
 
   --- Erik Hatcher [EMAIL PROTECTED]
  wrote:
   The details of the segments file (and all the
   others) is freely
   available here:
  
  
  
 

http://jakarta.apache.org/lucene/docs/fileformats.html
  
   Also, there is Java code in Lucene, of course,
  that
   manipulates the
   segments file which could be leveraged (although
   probably package
   scoped and not easily usable in a standalone
  repair
   tool).
  
 Erik
  
  
   On Aug 18, 2004, at 6:50 AM, Honey George wrote:
  
Looks like problem is not with the hexeditor,
  even
   in
the ultraedit(i had access to a windows box) I
  am

RE: Restoring a corrupt index

2004-08-18 Thread Honey George
Looks like problem is not with the hexeditor, even in
the ultraedit(i had access to a windows box) I am
seeing the same display. The problem is I am not able
to identify where a record starts with just 1 record
in the file.

Need to try some alternate approach.

Thanks,
  George

 --- [EMAIL PROTECTED] wrote: 
 http://www.ultraedit.com/ is the best!
 
 However, I cannot imagine how another hexeditor
 wouldnt work.
 
 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, August 17, 2004 10:35 AM
 To: Lucene Users List
 Subject: RE: Restoring a corrupt index
 
 
 Wallen,
 Which hex editor have you used. I am also facing a
 similar problem. I tried to use KHexEdit and it
 doesn't seem to help. I am attaching with this email
 my segments file. I think only the segment with name
 _ung is a valid one, I wanted to delete the
 remaining..but couldn't. Can you help?
 
 -George
 






___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Restoring a corrupt index

2004-08-18 Thread Honey George
Thanks Erik, that worked. I was able to remove the
corrupt index and now it looks like the index is OK. I
was able to view the number of documents in the index.
Before that I was getting the error,
java.io.IOException: read past EOF

I am yet to find out how my index got corrupted. There
is another thread going on about this topic,
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html

If anybody is facing similar problem and is interested
in the code I can post it here. 

Thanks,
  George



 --- Erik Hatcher [EMAIL PROTECTED] wrote: 
 The details of the segments file (and all the
 others) is freely 
 available here:
 
 

http://jakarta.apache.org/lucene/docs/fileformats.html
 
 Also, there is Java code in Lucene, of course, that
 manipulates the 
 segments file which could be leveraged (although
 probably package 
 scoped and not easily usable in a standalone repair
 tool).
 
   Erik
 
 
 On Aug 18, 2004, at 6:50 AM, Honey George wrote:
 
  Looks like problem is not with the hexeditor, even
 in
  the ultraedit(i had access to a windows box) I am
  seeing the same display. The problem is I am not
 able
  to identify where a record starts with just 1
 record
  in the file.
 
  Need to try some alternate approach.
 
  Thanks,
George






___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Restoring a corrupt index

2004-08-17 Thread Honey George
Wallen,
Which hex editor have you used. I am also facing a
similar problem. I tried to use KHexEdit and it
doesn't seem to help. I am attaching with this email
my segments file. I think only the segment with name
_ung is a valid one, I wanted to delete the
remaining..but couldn't. Can you help?

-George



 --- [EMAIL PROTECTED] wrote: 
 I fixed my own problem, but hope this might help
 someone else in the future:
 
 I went into my segments file (with a hex editor),
 deleted the record for
 _cu0v and changed the length 0x20 to be 0x1f, and it
 seems I have most of my
 index back!
 
 Maybe a developer could elaborate on this?
 





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Restoring a corrupt index

2004-08-17 Thread Honey George
I think attachments are filtered. This is what I see
when I open in the hex editor.

: 00 04 e0 af 00 00 00 02 05 5f 36 75 6e 67 00
04 ..à¯._6ung..
:0010 1e fb 05 5f 36 75 6e 69 00 00 00 01 00 00 00
00 .û._6uni
:0020 00 00 c1 b4 
   ..Á´

-George





 --- Honey George [EMAIL PROTECTED] wrote: 
 Wallen,
 Which hex editor have you used. I am also facing a
 similar problem. I tried to use KHexEdit and it
 doesn't seem to help. I am attaching with this email
 my segments file. I think only the segment with name
 _ung is a valid one, I wanted to delete the
 remaining..but couldn't. Can you help?
 
 -George
 
 
 
  --- [EMAIL PROTECTED] wrote: 
  I fixed my own problem, but hope this might help
  someone else in the future:
  
  I went into my segments file (with a hex editor),
  deleted the record for
  _cu0v and changed the length 0x20 to be 0x1f, and
 it
  seems I have most of my
  index back!
  
  Maybe a developer could elaborate on this?
  
 
 
   
   
   

___ALL-NEW
 Yahoo! Messenger - all new features - even more fun!
  http://uk.messenger.yahoo.com
 
-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED] 





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Not deleting temp files after updating/optimising

2004-07-02 Thread Honey George
Hi,
  I am facing the same problem with temporary index
files. I can see lot of temporary files(hidden) not
being deleted(luncee-1.3-final + Linux RH7 + jdk
1.3.1).

The size of the temporary files is almost same as that
of the index. I have deleted all the hidden temporary
files and now my directory contents are as given
below.

_2lok.fdt  _2lok.fnm  _6c2h.fdx  _6hgv.fdt  _6hgv.fnm 
_6hh1.fdx  _7gqr.fdt  _7gqr.fnm  _918i.fdx  deletable
_2lok.fdx  _6c2h.fdt  _6c2h.fnm  _6hgv.fdx  _6hh1.fdt 
_6hh1.fnm  _7gqr.fdx  _918i.fdt  _918i.fnm  segments

Again I see that the index size is bigger than the
data size. The data size is 5.3GB but the size of the
index is 7GB. I have almost 2,00,000 documents in the
index.

Any help in the above 2 problems is much appreciated.

Thanks  regards,
   George





___ALL-NEW Yahoo! Messenger - 
so many all-new ways to express yourself http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Not deleting temp files after updating/optimising

2004-07-02 Thread Honey George
--- Honey George [EMAIL PROTECTED] wrote:  Hi,
   I am facing the same problem with temporary index
 files. I can see lot of temporary files(hidden) not
 being deleted(luncee-1.3-final + Linux RH7 + jdk
 1.3.1).
Sorry for the spam. I use lucene-1.2. The problem was
actually found in lucene-1.2.





___ALL-NEW Yahoo! Messenger - 
so many all-new ways to express yourself http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]