Hi Adriano
Thanks. Code samples would be nice :)
Will come back if I find something for .ppt.
Pete
- Original Message -
From: Adriano Labate [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Wednesday, May 28, 2003 1:03 PM
Subject: RE : Parsers
The www.textmining.org
Pete,
Here's some samples.
For Word using Textmining:
String textContent = new
WordExtractor().extractText(inputStream);
For PDF using Textmining:
String textContent = new
PDFExtractor().extractText(inputStream);
For Excel using POI:
(From
I think I saw a solution for this in the past. Try to search the mailing
list.
Anyway you can always use the SearchBean which is in lucene sandbox to
sort by any field.
-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of David Weitzman
Sent: Tuesday, May 27, 2003 8:26 PM
Andrei,
If this sort of thing is important enough, you could implement a
customized analyzer which would reverse all terms and store them in a
separate field (in addition to the a non-reversed field). This will
double your index size, of course.
Then, when searching, suffix query terms
Thanks for the info, but unfortunately it still is getting an OutOfMemoryError,
Here's my code:
--
final BitSet bits = new BitSet();
HitCollector hc = new HitCollector() {
public void collect(int doc, float score){
Andrei,
I have a file database indexed by content and also by filename. It would be
nice if the user could perform a usual search like *.ext.
Anybody tried a workaround for this issue ? ( this is needed only for the
name of the file, for the rest of the terms the rules are fine with me)
If the
When I search with a query I know will hit most of the 1.8 million
records, the collect print
does not even print, it eats up the 700+MB I allocated and then
throws an OutOfMemoryError.
Are you using wildcard queries?
--
Eric Jain
Yes. Is that the problem?
At 05:13 PM 5/28/2003 +0200, you wrote:
When I search with a query I know will hit most of the 1.8 million
records, the collect print
does not even print, it eats up the 700+MB I allocated and then
throws an OutOfMemoryError.
Are you using wildcard queries?
--
Eric
You can also index the file names with a leading character. For instance
index file1.exe will be indexed as _file1.exe and always add the
leading character to the search term.
So if the user input is *.exe your query should be _*.exe and if the
user input fi* you'll change it to _fi*
Aviran
Yes. Is that the problem?
I believe a term with a wildcard is expanded into all possible terms in
memory before searching for it, so if the term is 'a*', and you have a
million different terms starting with 'a' occuring in your documents,
it's quite possible to run out of memory.
Does anyone
Cory,
When performing wildcard queries, the bulk of the memory is used during
wildcard term expansion. The memory requirement is proportional to the
number of matching terms, not the number of hits.
You should make sure you are using the latest Lucene. There was a fix in
1.3 to reduce the
Aviran,
You can also index the file names with a leading character. For instance
index file1.exe will be indexed as _file1.exe and always add the
leading character to the search term.
So if the user input is *.exe your query should be _*.exe and if the
user input fi* you'll change it to _fi*
Now
Thanks for the help! Yes, it works fine without a wildcard search, which
I believe at this point will be ok for our app.
Thanks again,
Cory
At 11:50 AM 5/28/2003 -0400, you wrote:
Cory,
When performing wildcard queries, the bulk of the memory is used during
wildcard term expansion. The
We ran into this problem and decided to put a check
on the number of expanded terms and abort the query
if the number got too high.
Is it possible to perform this check without having to modify Lucene's
source code?
--
Eric Jain
Unfortunately, no.
The modifications are not very extreme, though.
If you're interested in seeing our approach, let me know.
DaveB
Eric Jain
The problem with that solution is the same as what the other thread
about OutOfMemory is discussing with wildcard queries.
Just prefixing something with a fixed query to 'hack' a wildcard query
could lead to performance/memory issues.
I recommend indexing the file extension (or mime type) as a
Does anyone have this problem???
Thanks
Shoba
--- Shoba Ramachandran [EMAIL PROTECTED]
wrote:
Hi,
I have indexed these documents using lucene
test.doc
test.xls
test.htm
test.txt
When I search DocName:test, it returns only the
first one test.doc.
But if I say DocName:text Type:txt, it
It all depends on how you are indexing these documents.
What gets put in the DocName field? What gets put in the Type field?
Are you using third party code to perform the indexing?
DaveB
The www.textmining.org text extractors work very well for Word and pdf
documents.
They use both PDFBox and POI.
For Excel, using POI directly is very easy. Tell me if you want to see
code samples.
I'm looking myself for a Powerpoint text extractor, if you know one...
Another solution is
Hi,
I have been following this discussion and as i am anticipating such a
problem when my index size grows , i would like to hear your approach on
limiting the query expansion.
Regards,
Harpreet
- Original Message -
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent:
Hi Victor
Thanks.
In the past I have used the Inso OutsideIn filters and found them very good;
however I'd like to come up with a pure Java solution, so if there is a Java
equivalent to the Inso filters I be grateful for any details. Failing that,
I thought that I'd go for individual parsers
Victor Hadianto wrote:
The www.textmining.org text extractors work very well for Word and pdf
documents.
They use both PDFBox and POI.
For Excel, using POI directly is very easy. Tell me if you want to see
code samples.
I'm looking myself for a Powerpoint text extractor, if you know one...
I'm using successfully a combination of Office automation via Jawin
(free Java/COM bridge) to convert PPT files. You need to learn a bit
about the pseudo-object model of PowerPoint to properly convert various
objects, but this information can be found at msdn.microsoft.com.
Hmm this is really
Andrzej,
Another solution for all MS Office formats is to use openoffice.org the
latest betas have a powerful Java SDK. So for example you could script a
central copy to open MS Docs and save as html for parsing in lucene. Or
you could save in Openoffice.org formats (which are zipped xml) and
Victor Hadianto wrote:
I'm using successfully a combination of Office automation via Jawin
(free Java/COM bridge) to convert PPT files. You need to learn a bit
about the pseudo-object model of PowerPoint to properly convert various
objects, but this information can be found at msdn.microsoft.com.
David Warnock wrote:
Andrzej,
Another solution for all MS Office formats is to use openoffice.org the
latest betas have a powerful Java SDK. So for example you could script a
central copy to open MS Docs and save as html for parsing in lucene. Or
you could save in Openoffice.org formats (which
Hello,
Is the cost of opening an IndexSearcher proportional to anything, e.g.
physical index size, number of segments? Thanks.
--
Herman
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Andrzej,
Yes, I checked this solution in the past, but (unless something changed
drastically) OpenOffice converters and Java integration are coupled
tightly with the whole suite, so basically you have to install the whole
suite (50MB?) just to be able to use the converters. In my case (a
28 matches
Mail list logo