Re: Parsers

2003-05-29 Thread Pete Lewis
Hi Adriano Thanks. Code samples would be nice :) Will come back if I find something for .ppt. Pete - Original Message - From: Adriano Labate [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Wednesday, May 28, 2003 1:03 PM Subject: RE : Parsers The www.textmining.org

RE : Parsers

2003-05-29 Thread Adriano Labate
Pete, Here's some samples. For Word using Textmining: String textContent = new WordExtractor().extractText(inputStream); For PDF using Textmining: String textContent = new PDFExtractor().extractText(inputStream); For Excel using POI: (From

RE: Sort results by date alone?

2003-05-29 Thread Aviran Mordo
I think I saw a solution for this in the past. Try to search the mailing list. Anyway you can always use the SearchBean which is in lucene sandbox to sort by any field. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of David Weitzman Sent: Tuesday, May 27, 2003 8:26 PM

Re: Wildcard workaround

2003-05-29 Thread Steve Rowe
Andrei, If this sort of thing is important enough, you could implement a customized analyzer which would reverse all terms and store them in a separate field (in addition to the a non-reversed field). This will double your index size, of course. Then, when searching, suffix query terms

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Cory Albright
Thanks for the info, but unfortunately it still is getting an OutOfMemoryError, Here's my code: -- final BitSet bits = new BitSet(); HitCollector hc = new HitCollector() { public void collect(int doc, float score){

Re: Wildcard workaround

2003-05-29 Thread David Warnock
Andrei, I have a file database indexed by content and also by filename. It would be nice if the user could perform a usual search like *.ext. Anybody tried a workaround for this issue ? ( this is needed only for the name of the file, for the rest of the terms the rules are fine with me) If the

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Eric Jain
When I search with a query I know will hit most of the 1.8 million records, the collect print does not even print, it eats up the 700+MB I allocated and then throws an OutOfMemoryError. Are you using wildcard queries? -- Eric Jain

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Cory Albright
Yes. Is that the problem? At 05:13 PM 5/28/2003 +0200, you wrote: When I search with a query I know will hit most of the 1.8 million records, the collect print does not even print, it eats up the 700+MB I allocated and then throws an OutOfMemoryError. Are you using wildcard queries? -- Eric

RE: Wildcard workaround

2003-05-29 Thread Aviran Mordo
You can also index the file names with a leading character. For instance index file1.exe will be indexed as _file1.exe and always add the leading character to the search term. So if the user input is *.exe your query should be _*.exe and if the user input fi* you'll change it to _fi* Aviran

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Eric Jain
Yes. Is that the problem? I believe a term with a wildcard is expanded into all possible terms in memory before searching for it, so if the term is 'a*', and you have a million different terms starting with 'a' occuring in your documents, it's quite possible to run out of memory. Does anyone

Re: too many hits - OutOfMemoryError

2003-05-29 Thread David_Birthwell
Cory, When performing wildcard queries, the bulk of the memory is used during wildcard term expansion. The memory requirement is proportional to the number of matching terms, not the number of hits. You should make sure you are using the latest Lucene. There was a fix in 1.3 to reduce the

Re: Wildcard workaround

2003-05-29 Thread David Warnock
Aviran, You can also index the file names with a leading character. For instance index file1.exe will be indexed as _file1.exe and always add the leading character to the search term. So if the user input is *.exe your query should be _*.exe and if the user input fi* you'll change it to _fi* Now

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Cory Albright
Thanks for the help! Yes, it works fine without a wildcard search, which I believe at this point will be ok for our app. Thanks again, Cory At 11:50 AM 5/28/2003 -0400, you wrote: Cory, When performing wildcard queries, the bulk of the memory is used during wildcard term expansion. The

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Eric Jain
We ran into this problem and decided to put a check on the number of expanded terms and abort the query if the number got too high. Is it possible to perform this check without having to modify Lucene's source code? -- Eric Jain

Re: too many hits - OutOfMemoryError

2003-05-29 Thread David_Birthwell
Unfortunately, no. The modifications are not very extreme, though. If you're interested in seeing our approach, let me know. DaveB Eric Jain

Re: Wildcard workaround

2003-05-29 Thread Erik Hatcher
The problem with that solution is the same as what the other thread about OutOfMemory is discussing with wildcard queries. Just prefixing something with a fixed query to 'hack' a wildcard query could lead to performance/memory issues. I recommend indexing the file extension (or mime type) as a

Re: not fetching all results that has same file name

2003-05-29 Thread Shoba Ramachandran
Does anyone have this problem??? Thanks Shoba --- Shoba Ramachandran [EMAIL PROTECTED] wrote: Hi, I have indexed these documents using lucene test.doc test.xls test.htm test.txt When I search DocName:test, it returns only the first one test.doc. But if I say DocName:text Type:txt, it

Re: not fetching all results that has same file name

2003-05-29 Thread David_Birthwell
It all depends on how you are indexing these documents. What gets put in the DocName field? What gets put in the Type field? Are you using third party code to perform the indexing? DaveB

Re: RE : Parsers

2003-05-29 Thread Victor Hadianto
The www.textmining.org text extractors work very well for Word and pdf documents. They use both PDFBox and POI. For Excel, using POI directly is very easy. Tell me if you want to see code samples. I'm looking myself for a Powerpoint text extractor, if you know one... Another solution is

Re: too many hits - OutOfMemoryError

2003-05-29 Thread Harpreet S Walia
Hi, I have been following this discussion and as i am anticipating such a problem when my index size grows , i would like to hear your approach on limiting the query expansion. Regards, Harpreet - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent:

Re: RE : Parsers

2003-05-29 Thread Pete Lewis
Hi Victor Thanks. In the past I have used the Inso OutsideIn filters and found them very good; however I'd like to come up with a pure Java solution, so if there is a Java equivalent to the Inso filters I be grateful for any details. Failing that, I thought that I'd go for individual parsers

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
Victor Hadianto wrote: The www.textmining.org text extractors work very well for Word and pdf documents. They use both PDFBox and POI. For Excel, using POI directly is very easy. Tell me if you want to see code samples. I'm looking myself for a Powerpoint text extractor, if you know one...

Re: RE : Parsers

2003-05-29 Thread Victor Hadianto
I'm using successfully a combination of Office automation via Jawin (free Java/COM bridge) to convert PPT files. You need to learn a bit about the pseudo-object model of PowerPoint to properly convert various objects, but this information can be found at msdn.microsoft.com. Hmm this is really

Re: RE : Parsers

2003-05-29 Thread David Warnock
Andrzej, Another solution for all MS Office formats is to use openoffice.org the latest betas have a powerful Java SDK. So for example you could script a central copy to open MS Docs and save as html for parsing in lucene. Or you could save in Openoffice.org formats (which are zipped xml) and

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
Victor Hadianto wrote: I'm using successfully a combination of Office automation via Jawin (free Java/COM bridge) to convert PPT files. You need to learn a bit about the pseudo-object model of PowerPoint to properly convert various objects, but this information can be found at msdn.microsoft.com.

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
David Warnock wrote: Andrzej, Another solution for all MS Office formats is to use openoffice.org the latest betas have a powerful Java SDK. So for example you could script a central copy to open MS Docs and save as html for parsing in lucene. Or you could save in Openoffice.org formats (which

cost of opening IndexSearcher

2003-05-29 Thread Herman Chen
Hello, Is the cost of opening an IndexSearcher proportional to anything, e.g. physical index size, number of segments? Thanks. -- Herman - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: RE : Parsers

2003-05-29 Thread David Warnock
Andrzej, Yes, I checked this solution in the past, but (unless something changed drastically) OpenOffice converters and Java integration are coupled tightly with the whole suite, so basically you have to install the whole suite (50MB?) just to be able to use the converters. In my case (a