date:20050208

Good example scientific lucene sites?

2005-02-08 Thread Fred Toth

Hi,
I'm going to be demonstrating some of our work with lucene to
a prospective customer this week, and I'm wondering if
any of you have suggestions for other relevant sites that
use lucene.
In particular, I'm interested in scientific or technical sites,
perhaps with use of the highlighter, and perhaps extra work
on relevancy ranking.
I'm happy to poke around on the "powered by lucene" page,
but if anyone can save me some time, I'd appreciate it.
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch

I have a colleague which uses Lucene 1.3 on PersonalJava (equally to Java
1.1.8). I can't find a significant difference to his code (sill searching)
but he did not many any changes. He did also not recompile Lucene 1.3 on
1.1.8 etc.

It must be something simple. I will look for that switch...

In the meantime, I am thankful for any other help.

Cheers,
Karl

> On Tuesday 08 February 2005 18:49, sergiu gordea wrote:
> > Karl Koch wrote:
> ...
> > >>A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
> > >>occurred in : 
> > >>  'org/apache/lucene/store/FSDirectory.getDirectory
> > >>(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> > >>method.
> > >>  Please report this error in detail to
> > >>http://java.sun.com/cgi-bin/bugreport.cgi
> 
> Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things
> down when I was using 1.1 (1.1.8?), but it might you help now...
> 
> Regards,
> Paul Elschot
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher

On Feb 8, 2005, at 12:19 PM, Steven Rowe wrote:
Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't  
mess with its input in any way, but just returns it as-is?

I realize that under most circumstances, it would probably be more  
code to use it than just constructing a TermQuery, but having it would  
regularize query handling, and simplify new users' experience.  And  
for the purposes of the PerFieldAnalyzerWrapper, it could be helpful.
It's long been on my TODO list.  I just adapted (changed the package  
names) the Lucene in Action KeywordAnalyzer and added it to the new  
contrib area:

	http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/ 
src/java/org/apache/lucene/analysis/KeywordAnalyzer.java

In the next official release of Lucene, the contrib (formerly known as  
the Sandbox) components will be packaged along with the Lucene core.   
I'm still working on this packaging build process as I migrate the  
Sandbox over to contrib.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Paul Elschot

On Tuesday 08 February 2005 18:49, sergiu gordea wrote:
> Karl Koch wrote:
...
> >>A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
> >>occurred in : 
> >>  'org/apache/lucene/store/FSDirectory.getDirectory
> >>(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> >>method.
> >>  Please report this error in detail to
> >>http://java.sun.com/cgi-bin/bugreport.cgi

Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things
down when I was using 1.1 (1.1.8?), but it might you help now...

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread sergiu gordea

Karl Koch wrote:
When I switch to Java 1.2, I can also not run it. Also I cannot index
anything. I have no idea why...
Can sombody help me?
 

I think you are a pioneer in this domain :) . I'm not very familiar with 
the lucene source code, but I think it uses the
advantages of java 1.3 and 1.4. 
Probably the best thing you can do is to get the sources of the old 
versions of lucene and to try to compile them with
java 1.2 compiler.

Best,
Sergiu
Karl
 

Hello all,
I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
because I want to run a search with a PDA using Java 1.1).
However, when I run my code. I get the following error:
--
A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
occurred in : 
 'org/apache/lucene/store/FSDirectory.getDirectory
(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
method.
 Please report this error in detail to
http://java.sun.com/cgi-bin/bugreport.cgi

Exception occured in StandardSearch:search(String, String[], String)!
java.lang.IllegalMonitorStateException: current thread not owner
at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
Code)
--
The error does not occur when I run it under Java 1.4.
What do I do wrong and what do I need to change in order to make it work.
It
must be my code. Here the code relevant to this error (the search method).
public static Result search(String queryString, String[] searchFields, 
 String indexDirectory) {
 // create access to index
 StandardAnalyzer analyser = new StandardAnalyzer();
 Hits hits = null;
 Result result = null;
 try {
 fsDirectory = 
FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
 IndexSearcher searcher = new IndexSearcher(fsDirectory);
 ...
}

What is wrong here?
Best Regards,
Karl
--
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch

When I switch to Java 1.2, I can also not run it. Also I cannot index
anything. I have no idea why...

Can sombody help me?

Karl

> Hello all,
> 
> I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
> because I want to run a search with a PDA using Java 1.1).
> 
> However, when I run my code. I get the following error:
> 
> --
> 
> A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
> occurred in : 
>   'org/apache/lucene/store/FSDirectory.getDirectory
> (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> method.
>   Please report this error in detail to
> http://java.sun.com/cgi-bin/bugreport.cgi
> 
> Exception occured in StandardSearch:search(String, String[], String)!
> java.lang.IllegalMonitorStateException: current thread not owner
>   at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
>   at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
> Code)
> 
> --
> 
> The error does not occur when I run it under Java 1.4.
> 
> What do I do wrong and what do I need to change in order to make it work.
> It
> must be my code. Here the code relevant to this error (the search method).
> 
> 
> public static Result search(String queryString, String[] searchFields, 
>   String indexDirectory) {
>   // create access to index
>   StandardAnalyzer analyser = new StandardAnalyzer();
>   Hits hits = null;
>   Result result = null;
>   try {
>   fsDirectory = 
> FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
>   IndexSearcher searcher = new IndexSearcher(fsDirectory);
>   ...
> }
> 
> 
> What is wrong here?
> 
> Best Regards,
> Karl
> 
> -- 
> DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
> AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan

Erik, I was thinking about the case where

category:"document management"
category:"document publishing"

and the user wants to search category:document and have both turn up. But 
that's obviously not the use-case in the situation of a drop-down, so you're 
right about this, Field.Keyword is correct here. Sorry for misleading you, Mike.

k

On Tue, 8 Feb 2005 12:02:15 -0500, Erik Hatcher wrote:
> Kelvin - I respectfully disagree - could you elaborate on why this
> is not an appropriate use of Field.Keyword?
>
> If the category is "How To", Field.Text would split this (depending
> on the Analyzer) into "how" and "to".
>
> If the user is selecting a category from a drop-down, though, you
> shouldn't be using QueryParser on it, but instead aggregating a
> TermQuery("category", "How To") into a BooleanQuery with the rest
> of it.  The rest may be other API created clauses and likely a
> piece from QueryParser.
>
> Erik
>
>
> On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
>
>> As I posted previously, Field.Keyword is appropriate in only
>> certain situations. For your use-case, I believe Field.Text is
>> more suitable.
>>
>> k
>>
>> On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
>>
>>> This may or may not be correct, but I am indexing it as a
>>> keyword  because I provide a (required) radio button on the add
>>> screen for  the user to determine which category the document
>>> should be  assigned.  Then in the search, provide a dropdown
>>> that can be used  in the advanced search so that they can
>>> search only for a specific  category of documents (like HowTo,
>>> Troubleshooting, etc).
>>>
>>> -Original Message-
>>> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent:
>>> Tuesday,  February 08, 2005 9:32 AM To: Lucene Users List
>>> Subject: RE: Problem searching Field.Keyword field
>>>
>>> Mike, is there a reason why you're indexing "category" as
>>> keyword  not text?
>>>
>>> k
>>>
>>> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>>>
 Thanks for the quick response.

 Sorry for my lack of understanding, but I am learning!  Won't
 the   query parser still handle this query?  My limited
 understanding  was  that the search call provides the 'all'
 field as default  field for  query terms in the case where
 fields aren't specified.    Using the  current code, searches
 like author:Mike" and  title:Lucene work fine.

 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
   Tuesday, February 08, 2005 8:08 AM To: Lucene Users List
 Subject:   Re: Problem searching Field.Keyword field

 You're using the query parser with the standard analyser. You
should construct a term query manually instead.


 --
 Miles Barr <[EMAIL PROTECTED]> Runtime Collective
 Ltd.

 --
   --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-
 mail: [EMAIL PROTECTED]


 --
   --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-
 mail: [EMAIL PROTECTED]
>>>
>>>
>>> 
>>>   - To unsubscribe, e-mail: lucene-user-
>>>[EMAIL PROTECTED] For additional commands, e-mail:
>>>[EMAIL PROTECTED]
>>>
>>>
>>> 
>>>   - To unsubscribe, e-mail: lucene-user-
>>>[EMAIL PROTECTED] For additional commands, e-mail:
>>>[EMAIL PROTECTED]
>>
>>
>> --
>> --- To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:
>>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Steven Rowe

Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't 
mess with its input in any way, but just returns it as-is?

I realize that under most circumstances, it would probably be more code 
to use it than just constructing a TermQuery, but having it would 
regularize query handling, and simplify new users' experience.  And for 
the purposes of the PerFieldAnalyzerWrapper, it could be helpful.

Steve
Erik Hatcher wrote:
Kelvin - I respectfully disagree - could you elaborate on why this is 
not an appropriate use of Field.Keyword?

If the category is "How To", Field.Text would split this (depending on 
the Analyzer) into "how" and "to".

If the user is selecting a category from a drop-down, though, you 
shouldn't be using QueryParser on it, but instead aggregating a 
TermQuery("category", "How To") into a BooleanQuery with the rest of 
it.  The rest may be other API created clauses and likely a piece from 
QueryParser.

Erik
On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k
On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned.  Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).
 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field
 Mike, is there a reason why you're indexing "category" as keyword
 not text?
 k
 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
 Thanks for the quick response.
 Sorry for my lack of understanding, but I am learning!  Won't the
  query parser still handle this query?  My limited understanding
 was  that the search call provides the 'all' field as default
 field for  query terms in the case where fields aren't specified.
   Using the  current code, searches like author:Mike" and
 title:Lucene work fine.
 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
  Re: Problem searching Field.Keyword field

 You're using the query parser with the standard analyser. You  
 should construct a term query manually instead.

 --
 Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher

Kelvin - I respectfully disagree - could you elaborate on why this is 
not an appropriate use of Field.Keyword?

If the category is "How To", Field.Text would split this (depending on 
the Analyzer) into "how" and "to".

If the user is selecting a category from a drop-down, though, you 
shouldn't be using QueryParser on it, but instead aggregating a 
TermQuery("category", "How To") into a BooleanQuery with the rest of 
it.  The rest may be other API created clauses and likely a piece from 
QueryParser.

Erik
On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k
On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned.  Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).
 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field
 Mike, is there a reason why you're indexing "category" as keyword
 not text?
 k
 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
 Thanks for the quick response.
 Sorry for my lack of understanding, but I am learning!  Won't the
  query parser still handle this query?  My limited understanding
 was  that the search call provides the 'all' field as default
 field for  query terms in the case where fields aren't specified.
   Using the  current code, searches like author:Mike" and
 title:Lucene work fine.
 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
  Re: Problem searching Field.Keyword field
 You're using the query parser with the standard analyser. You  
 should construct a term query manually instead.
 --
 Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
 --
 --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:  
[EMAIL PROTECTED]
 --
 --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:  
[EMAIL PROTECTED]

 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]
 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan

As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k

On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
> This may or may not be correct, but I am indexing it as a keyword
> because I provide a (required) radio button on the add screen for
> the user to determine which category the document should be
> assigned.  Then in the search, provide a dropdown that can be used
> in the advanced search so that they can search only for a specific
> category of documents (like HowTo, Troubleshooting, etc).
>
> -Original Message-
> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
> February 08, 2005 9:32 AM To: Lucene Users List
> Subject: RE: Problem searching Field.Keyword field
>
> Mike, is there a reason why you're indexing "category" as keyword
> not text?
>
> k
>
> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>
>> Thanks for the quick response.
>>
>> Sorry for my lack of understanding, but I am learning!  Won't the
>>  query parser still handle this query?  My limited understanding
>> was  that the search call provides the 'all' field as default
>> field for  query terms in the case where fields aren't specified.
>>   Using the  current code, searches like author:Mike" and
>> title:Lucene work fine.
>>
>> -Original Message-
>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
>>  Re: Problem searching Field.Keyword field
>>
>> You're using the query parser with the standard analyser. You  
>> should construct a term query manually instead.
>>
>>
>> --
>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>>
>> --
>> --  - To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:  
>>[EMAIL PROTECTED]
>>
>>
>> --
>> --  - To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:  
>>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Starts With x and Ends With x Queries

2005-02-08 Thread Chong, Herb

i would say that matching root words in German compounds is a text
analysis application.

Herb... 

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 11:08 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

That might be true ... but our application is not a text analysis 
aplication,
and it is also not intended to be a search engine. We use lucene just to

index our pages.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Erik Hatcher wrote:
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still 
support leading wildcard characters, QueryParser will still disallow 
them.  All I'm going to change is the javadoc that makes it sound 
like WildcardQuery does not support leading wildcard characters.

Erik

From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.
I'm pleased to hear that. I'm not very skilled in writing .jj files but 
I will try to do it in next days,

Sergiu
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Chong, Herb wrote:
commercial text analytics tools including search engines usually
tokenize with splitting of compound words for German.
Herb
That might be true ... but our application is not a text analysis 
aplication,
and it is also not intended to be a search engine. We use lucene just to 
index our pages.

 Best,
 Sergiu

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 10:38 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

From what I was reading in the mailing list there are more lucene users
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem searching Field.Keyword field

2005-02-08 Thread Mike Miller

This may or may not be correct, but I am indexing it as a keyword because I 
provide a (required) radio button on the add screen for the user to determine 
which category the document should be assigned.  Then in the search, provide a 
dropdown that can be used in the advanced search so that they can search only 
for a specific category of documents (like HowTo, Troubleshooting, etc).

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 9:32 AM
To: Lucene Users List
Subject: RE: Problem searching Field.Keyword field

Mike, is there a reason why you're indexing "category" as keyword not text?

k

On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
> Thanks for the quick response.
>
> Sorry for my lack of understanding, but I am learning!  Won't the
> query parser still handle this query?  My limited understanding was
> that the search call provides the 'all' field as default field for
> query terms in the case where fields aren't specified.   Using the
> current code, searches like author:Mike" and title:Lucene work fine.
>
> -Original Message-
> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
> Re: Problem searching Field.Keyword field
>
> You're using the query parser with the standard analyser. You
> should construct a term query manually instead.
>
>
> --
> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Starts With x and Ends With x Queries

2005-02-08 Thread Chong, Herb

commercial text analytics tools including search engines usually
tokenize with splitting of compound words for German.

Herb 

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 10:38 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

 From what I was reading in the mailing list there are more lucene users

that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread Erik Hatcher

On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?
I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene users 
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

Thanks for understanding,
 Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan

Mike, is there a reason why you're indexing "category" as keyword not text?

k

On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
> Thanks for the quick response.
>
> Sorry for my lack of understanding, but I am learning!  Won't the
> query parser still handle this query?  My limited understanding was
> that the search call provides the 'all' field as default field for
> query terms in the case where fields aren't specified.   Using the
> current code, searches like author:Mike" and title:Lucene work fine.
>
> -Original Message-
> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
> Re: Problem searching Field.Keyword field
>
> You're using the query parser with the standard analyser. You
> should construct a term query manually instead.
>
>
> --
> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch

Hello all,

I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
because I want to run a search with a PDA using Java 1.1).

However, when I run my code. I get the following error:

--

A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
occurred in : 
  'org/apache/lucene/store/FSDirectory.getDirectory
(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
method.
  Please report this error in detail to
http://java.sun.com/cgi-bin/bugreport.cgi

Exception occured in StandardSearch:search(String, String[], String)!
java.lang.IllegalMonitorStateException: current thread not owner
at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
Code)

--

The error does not occur when I run it under Java 1.4.

What do I do wrong and what do I need to change in order to make it work. It
must be my code. Here the code relevant to this error (the search method).


public static Result search(String queryString, String[] searchFields, 
  String indexDirectory) {
  // create access to index
  StandardAnalyzer analyser = new StandardAnalyzer();
  Hits hits = null;
  Result result = null;
  try {
  fsDirectory = 
FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
  IndexSearcher searcher = new IndexSearcher(fsDirectory);
  ...
}


What is wrong here?

Best Regards,
Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does anyone have a copy of the highligher code?

2005-02-08 Thread Erik Hatcher

On Feb 8, 2005, at 9:50 AM, Jim Lynch wrote:
Our firewall prevents me from using cvs to check out anything.  Does 
anyone have a jar file or a set of class files publicly available?
The "Lucene in Action" source code - http://www.lucenebook.com - 
contains JAR files, including the Highlighter, for lots of Lucene 
add-on goodies.

Also, Lucene just converted to using Subversion, which is much more 
firewall friendly.  Try this after you have installed the svn client:

svn co http://svn.apache.org/repos/asf/lucene/java/trunk
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighter: how to specify text from external source?

2005-02-08 Thread Erik Hatcher

On Feb 8, 2005, at 6:29 AM, Yura Smolsky wrote:
Hello, lucene-user.
If I do not store text fields in the index, is there a way to specify
values for Highlighter from external source and how?
One of the parameters passed to the highlighting method is a String to 
highlight.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighter: how to specify text from external source?

2005-02-08 Thread mark harwood

Here's a rough example using a database:

 Hits hits=searcher.search(q);
 int numDocs=Math.min(10, hits.length());
 Analyzer analyzer=new WhitespaceAnalyzer();

 PreparedStatement ps=conn.prepareStatement("select
docText from myTable where pk=?");
for(int i=0;ihttp://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Does anyone have a copy of the highligher code?

2005-02-08 Thread Jim Lynch

Our firewall prevents me from using cvs to check out anything.  Does 
anyone have a jar file or a set of class files publicly available?

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher

The problem is that QueryParser analyzes all pieces of a query 
expression regardless of whether you indexed them as a Field.Keyword or 
not.  If you need to use QueryParser and still support keyword fields, 
you'll want to plug in an analyzer specific to that field using 
PerFieldAnalyzerWrapper.  You'll see this demonstrated in the "Lucene 
in Action" source code.  Here's a quick pointer to where we cover it in 
the book:

http://www.lucenebook.com/search?query=KeywordAnalyzer
On Feb 8, 2005, at 9:26 AM, Mike Miller wrote:
Thanks for the quick response.
Sorry for my lack of understanding, but I am learning!  Won't the query
parser still handle this query?  My limited understanding was that the
search call provides the 'all' field as default field for query terms 
in
the case where fields aren't specified.   Using the current code,
searches like author:Mike" and title:Lucene work fine.

-Original Message-
From: Miles Barr [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 08, 2005 8:08 AM
To: Lucene Users List
Subject: Re: Problem searching Field.Keyword field
You're using the query parser with the standard analyser. You should
construct a term query manually instead.
--
Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Highlighter: how to specify text from external source?

2005-02-08 Thread Yura Smolsky

Hello, lucene-user.

If I do not store text fields in the index, is there a way to specify
values for Highlighter from external source and how?

Thanks in advance.

Yura Smolsky



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem searching Field.Keyword field

2005-02-08 Thread Miles Barr

Hi Mike,

If you use a different analyzer, say a custom one the didn't do anything
to the original search query, then you could use the query parser to
search on the keyword field. The standard analyzer does things like
making everything lowercase, removing stop words etc. Since the value
held in the keyword field didn't go through the same process during
indexing it won't come up as a match.

So basically you want to do:

...

IndexSearcher is = new IndexSearcher(fsDir);

...

Term t = new Term("keyword-field-name", keyword);
Query query = new TermQuery(t);
Hits hits = is.search(query);


You will typically only do this when you're trying to retrieve a
specific document, rather than doing a search.


-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan

Javadocs for Field.Keyword says:

Constructs a Date-valued Field that is not tokenized and is indexed, and stored 
in the index, for return with hits.

For most purposes dealing with Strings, use Field.Text, unless you have a date, 
a GUID or some other string you don't want tokenized or processed in any way. 
This basically means that Field.Keyword indexes the field as-is.

k

On Tue, 8 Feb 2005 07:54:57 -0600, Mike Miller wrote:
> First let me say - Awesome tool!  Almost too easy to be true, but
> with that being said
>
> Hi,  I have read several articles and postings that indicate that
> the Field.Keyword field should be searchable but it's not working
> for me, until I change to Field.Text.  Parts of the index and
> search code are included below - mostly lifted from articles,etc,
> including Erik Hatches article on java.net.   I created a small
> KnowledgeBase web application that contains a category field, which
> I want to be searchable. Searching using a query string of
> category:Doc* or
> category:Documentation does not find a hit unless I change the code
> to add the category to the index as a Field.Text instead of
> Field.Keyword. The field value is out there:   I have verified this
> using the TermEnum to list the term values for field category and
> Documentation is in the list of values.
>
> The intention is to provide a 'Advanced Search' page that allows
> the user to search specific fields, like category, title and author
> instead of always using the 'all' field.
>
> What am I doing wrong???     Thanks in advance.
>
> Index code:
>
> public boolean index(ArticleFormBean article) throws IOException {
> IndexWriter writer = new IndexWriter(indexDir, new
> StandardAnalyzer(), false);
>
> Document doc = new Document();
> doc.add(Field.UnStored("content", article.getContent()));
> doc.add(Field.Text("title", article.getTitle()));
> doc.add(Field.Text("author", article.getAuthor()));
> doc.add(Field.UnIndexed("articleId",
> String.valueOf(article.getArticleId(;
> doc.add(Field.Keyword("createdDate", article.getCreateDate()));
> doc.add(Field.Keyword("modDate", article.getModDate()));
> doc.add(Field.Keyword("category", article.getCategory()));
>
> // create an 'all' field
> StringBuffer sb = new StringBuffer(4000);
> sb.append(article.getTitle()).append("
> ").append(article.getAuthor()).append(" ");
> sb.append(article.getContent()).append("
> ").append(article.getCategory());
> doc.add(Field.UnStored("all", sb.toString()));
>
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
> return false;
> }
>
> Search code:
> File indexDir = new File("c:/dev/java/kb/index"); Directory fsDir =
> FSDirectory.getDirectory(indexDir, false); IndexSearcher is = new
> IndexSearcher(fsDir); Query query = QueryParser.parse(q, "all", new
> StandardAnalyzer()); Hits hits = is.search(query);
>
>
> Mike Miller
> JDA Software Group, Inc.
> 7501 Ester's Blvd, Suite 100
> Irving, Texas 75063



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem searching Field.Keyword field

2005-02-08 Thread Mike Miller

Thanks for the quick response.  

Sorry for my lack of understanding, but I am learning!  Won't the query
parser still handle this query?  My limited understanding was that the
search call provides the 'all' field as default field for query terms in
the case where fields aren't specified.   Using the current code,
searches like author:Mike" and title:Lucene work fine.

-Original Message-
From: Miles Barr [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 8:08 AM
To: Lucene Users List
Subject: Re: Problem searching Field.Keyword field

You're using the query parser with the standard analyser. You should
construct a term query manually instead.

 
--
Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use an executable from java ...

2005-02-08 Thread Ben Litchfield


Kristian,

I assume all of you comments are with the 0.7.0 version of PDFBox.  There
were some great improvements in that version in terms of speed and
accuracy.

> That's courious beacause we experienced that pdftotext was able to
> convert 33% more pdf documents than PDFBox.

Depending on the set of PDF documents you will notice different results.
I welcome any bug reports(if they don't already exist) on that 33% that
are not working for you.  In particular, PDFBox needs some work on
non-english languages.


> That's good. Out application supports alternative conversion pipelines
> that provide fallback mechanims. If the first converter cannot convert a
> document a second converter is called. So PDFBox is our fallback
> converter.


Well, at least PDFBox made it as the "fallback.  :)

Ben
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem searching Field.Keyword field

2005-02-08 Thread Miles Barr

You're using the query parser with the standard analyser. You should
construct a term query manually instead.

 
-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem searching Field.Keyword field

2005-02-08 Thread Mike Miller

First let me say - Awesome tool!  Almost too easy to be true, but with
that being said
 
Hi,  I have read several articles and postings that indicate that the
Field.Keyword field should be searchable but it's not working for me,
until I change to Field.Text.  Parts of the index and search code are
included below - mostly lifted from articles,etc, including Erik Hatches
article on java.net.   I created a small KnowledgeBase web application
that contains a category field, which I want to be searchable.
Searching using a query string of category:Doc* or
category:Documentation does not find a hit unless I change the code to
add the category to the index as a Field.Text instead of Field.Keyword.
The field value is out there:   I have verified this using the TermEnum
to list the term values for field category and Documentation is in the
list of values.  
 
The intention is to provide a 'Advanced Search' page that allows the
user to search specific fields, like category, title and author instead
of always using the 'all' field.  
 
What am I doing wrong??? Thanks in advance.
 
Index code:
 
public boolean index(ArticleFormBean article) throws IOException {
  IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(),
false);
 
 Document doc = new Document();
 doc.add(Field.UnStored("content", article.getContent()));
doc.add(Field.Text("title", article.getTitle()));
doc.add(Field.Text("author", article.getAuthor()));
doc.add(Field.UnIndexed("articleId",
String.valueOf(article.getArticleId(;
doc.add(Field.Keyword("createdDate", article.getCreateDate()));
doc.add(Field.Keyword("modDate", article.getModDate()));
doc.add(Field.Keyword("category", article.getCategory()));

// create an 'all' field
StringBuffer sb = new StringBuffer(4000);
sb.append(article.getTitle()).append("
").append(article.getAuthor()).append(" ");
sb.append(article.getContent()).append("
").append(article.getCategory());
doc.add(Field.UnStored("all", sb.toString()));

writer.addDocument(doc);
writer.optimize();
writer.close();

return false;
 }
 
Search code:
File indexDir = new File("c:/dev/java/kb/index");
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
Query query = QueryParser.parse(q, "all", new
StandardAnalyzer());
Hits hits = is.search(query);


 
Mike Miller
JDA Software Group, Inc.
7501 Ester's Blvd, Suite 100
Irving, Texas 75063

Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-08 Thread Erik Hatcher

I agree that it is a worthwhile contribution.
Some suggestions... allow the configuration to specify field boost 
values, and analyzer(s).  If analyzers are specified per-field, then 
wrap then automatically with a PerFieldAnalyzerWrapper.  Also, having a 
facility to aggregate fields into a "contents"-like field would be nice 
- though maybe this would be covered implicitly as part of the SQL 
mapping with one of the columns being an aggregate column.

Perhaps the configuration aspect of it (XML mapping of expressions to 
field details) could be generalized to work with an object graph as 
well as SQL result sets.  OGNL (www.ognl.org) makes expression language 
glue and I can see it being used for mappings - for example the "name" 
field could be mapped to "company.president.name", where "company" is 
an object (or Map) with a "president" property, and so on.

Erik

On Feb 8, 2005, at 2:42 AM, Aad Nales wrote:
If that is a general thought then I will plan for some time to put 
this in action.

Cheers,
Aad
David Spencer wrote:
Nice, very similar to what I was thinking of, where the most 
significant difference is probably just that I was thinking of a 
batch indexer, not one embedded in a web container. Probably a 
worthwhile contribution to the sandbox.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document Clustering

2005-02-08 Thread Dawid Weiss

Hi Owen,
Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  
Yes, Carrot2 should help you with this. The labels it creates highly 
depend on the quality of the input snippets, but the so-called KWIK 
snippets (keyword in context) should suffice (see David Spencer's 
example with Wikipedia).

There is one thing, though: what is employed in Carrot2 is an on-line 
unsupervised clusterer that is designed to work with small number of 
documents and incomplete descriptions (snippets versus full text 
documents). It will _not_ work for large document collections (thousands 
of documents) simply because it was not designed to do that. I guess
you could try with up to 500 snippets -- beyond that, you'll be waiting 
for the result forever.

There is a great number of algorithms that can cluster large document 
collections -- see proceedings from information retrieval conferences 
for example.

As for David's hints:
> I'm not sure what the complexity of the algorithm is, but for me ~100 
> docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

Yes, 100 to 200 snippets is optimal with the open source clustering 
algorithm. We have a refactored and optimized version of the Lingo 
clusterer that is commercial (it also provides hierarchical clustering 
capability as an add-on to the open source component). But even the 
commercial version will only cluster up to 500 -- 1000 snippets. As I 
said, it was not our goal to cluster document collections, rather to 
retrieve useful information from preprocessed snippets.

Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use an executable from java ...

2005-02-08 Thread Kristian Hermsdorf

Hi Christiaan
Just to defend PDFBox: we actually recently decided to move in the
opposite direction.
I didn't want to offend PDFBox *g*
We just removed pdftotext from our application and are now using PDFBox
0.7.0 for all our PDF processing. Before we were using them both in
parallel: pdftotext for fast text extraction and PDFBox for all metadata
such as titles, authors, etc.
pdftotext is able to produce html output which contains these metadata as 
well.
Conversion from pdf to html and parsing html is (with our tests) still twice as 
fast as PDFBox.
Upon closer inspection of the output, we also saw that pdftotext was not
able to extract text from a significant amount of PDFs (9 out of 113
documents, all perfectly readable PDF documents) while PDFBox performed
flawlessly. For us, quality is of greater concern than speed.
That's courious beacause we experienced that pdftotext was able to convert 
33% more pdf documents than PDFBox.
Finally, I must say that the speed and quality of Ben's replies to bug
reports and suggestions is very impressive, giving us confidence in that
future problems will be handled satisfactorily.
That's good. Out application supports alternative conversion pipelines that 
provide fallback mechanims. If the first converter cannot convert a document a 
second converter is called. So PDFBox is our fallback converter.
Greetings
Kristian
--
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  
Kristian Hermsdorf
Interface Projects GmbH
Tolkewitzer Straße  49  
01277 Dresden   
tel.: ++49-351-3 18 09 39
mail: [EMAIL PROTECTED]
priv: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use an executable from java ...

2005-02-08 Thread Christiaan Fluit

Kristian Hermsdorf wrote:
We're using pdftotext as well, because PDFbox ist really slow. If your 
application should work under Windows you will probably experiance some 
mystic Java-VM crashes while executing external processes in batch-mode. 
(This is because of a bug in Windows-VM... we implemented out own 
Process with JNI to compensate this bug).
Just to defend PDFBox: we actually recently decided to move in the 
opposite direction.

We just removed pdftotext from our application and are now using PDFBox 
0.7.0 for all our PDF processing. Before we were using them both in 
parallel: pdftotext for fast text extraction and PDFBox for all metadata 
such as titles, authors, etc.

One reason for this is that with version 0.7.0 the difference in 
performance was only marginal on our testset of 113 PDF documents from 
various sources. Of course the difference will be bigger when you are 
only extracting text, because in the old situation we had to let two 
tools process the same file.

Upon closer inspection of the output, we also saw that pdftotext was not 
able to extract text from a significant amount of PDFs (9 out of 113 
documents, all perfectly readable PDF documents) while PDFBox performed 
flawlessly. For us, quality is of greater concern than speed.

Finally, I must say that the speed and quality of Ben's replies to bug 
reports and suggestions is very impressive, giving us confidence in that 
future problems will be handled satisfactorily.

Regards,
Chris
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PHP-Lucene Integration

2005-02-08 Thread Sanyi

Thanx a lot!

Sanyi

--- "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Howdy,
> [...]

__ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document Clustering

2005-02-08 Thread David Spencer

Owen Densmore wrote:
I would like to be able to analyze my document collection (~1200 
documents) and discover good "buckets" of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene should 
be able to help with this.  Has anyone had success with this recently?

Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  Our goal is 
to use clustering to build a nifty graphic interface, probably using Flash.
Carrot2 seems to work nicely.
Demo here...
Search for something like "artificial intelligence" in my Wikipedia 
Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence
The click on "see clustered results.." link to go here:
http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence
And voilla, what seems like decent clusters.
I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave



Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use an executable from java ...

2005-02-08 Thread Kristian Hermsdorf

Hi
I ve a kind of problem to execute a converting tool to modify a pdf to an
html under Linux. In fact, i have an executable "pdftohtml" which work
correctly on batch mode, and when I want to use it through Java under
Windows 2000 works also,BUT it does not work at all on the server under
linux. I m using the following code
you've got to read the processes stdout and stderr while the process is 
running. If you don't read those streams the process will block after it wrote 
some (about 8k) bytes to ist's stdout/stderr.
We're using pdftotext as well, because PDFbox ist really slow. If your 
application should work under Windows you will probably experiance some mystic 
Java-VM crashes while executing external processes in batch-mode. (This is 
because of a bug in Windows-VM... we implemented out own Process with JNI to 
compensate this bug).
Greetings,
Kristian
--
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  
Kristian Hermsdorf
Interface Projects GmbH
Tolkewitzer Straße  49  
01277 Dresden   
tel.: ++49-351-3 18 09 39
mail: [EMAIL PROTECTED]
priv: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

38 matches

Mail list logo