Beginner: Best way to index and display orginal text of pdfs in search results

2008-12-12 Thread maxmil

Hi,

This is the first time i am using Lucene.

I need to index pdf's with very few fields, title, date and body (long
field) for a web based search.

The results i need to display have to show not only the documents found but
for each document a snapshot of the text where the search term has been
found. This is analogous to the way google displays search results. That is
to say

 ... some words and first instance of search Term and some more words ...
some more words second instance of search term and some more words... 

etc.

To do this i would need the original text of the document for each hit. As
far as i understand Lucene does not save the original text of the document
in the index.

I am not using a database and would prefer not to have to store the original
document text elsewhere.

One way i could do this would be to take the hits from Lucene and reopen
each pdf to extract the original text at run time however i fear that with
many results this would be very slow. 

What would you recommend me to do?

Thanks

max
-- 
View this message in context: 
http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene SpellChecker returns no suggetions after changing Server

2008-12-12 Thread Matthias W.

Yes, I'm passing the same index for Spellchecker and IndexReader.
I'm going to test if this is a reason for my problem.

But I still don't understand why the same code is working on the testserver.

I think this could be because of the rights from tomcat.
Is there any tutorial about the tomcat configuration for lucene with debian?
Or can anyone tell me what's really important? I also don't know why there
are two webapps folders (/var/lib/tomcat5.5/webapps and
/usr/share/tomcat5.5-webapps). I made my JSP's into
/var/lib/tomcat5.5/webapps.
I copied the files from my testserver including the WEB-INF, could this be
the reason?

The changes:
Ubuntu 8.10 - Debian Etch
Java5  - Java6
Tomcat6 - Tomcat 5.5



Grant Ingersoll-6 wrote:
 
 So, what changed with the server?
 
  From the looks of your code, you're passing the same index into both  
 the Spellchecker and the IndexReader.   The spelling index is separate  
 from the main index.
 
 See the example at:
 http://lucene.apache.org/java/2_4_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html
 
 See also my Boot Camp examples at:
 http://www.lucenebootcamp.com/LuceneBootCamp/training/src/test/java/com/lucenebootcamp/training/basic/ContribExamplesTest.java
  
Have a look at the testSpelling code there
 
 HTH,
 Grant
 
 
 On Dec 9, 2008, at 2:50 AM, Matthias W. wrote:
 

 Hi,
 I'm using Lucene's SpellChecker (Lucene 2.1.0) class to get  
 suggestions.
 Till now my testing server was a VMWare-Image from 
 http://es.cohesiveft.com
 http://es.cohesiveft.com  (Ubuntu 8.10, Tomcat6, Java5).
 Now I'm using a Debian Etch Server with Tomcat5.5 and Java6.

 Code-Sample:
 String indexName = indexLocation;
 String queryString = null;
 queryString = URLDecoder.decode(request.getParameter(q), UTF-8);
 SpellChecker spellchecker = new
 SpellChecker(FSDirectory.getDirectory(indexName));
 String[] suggestions = spellchecker.suggestSimilar(queryString, 5,
 IndexReader.open(indexName), content, false);
 for(int i = 0; i  suggestions.length; i++) {
  out.println(suggestions[i]);
 }

 This worked fine on the old server, but on my new server this returns
 nothing.
 The index is generated by the nutch crawler, but this shouldn't be the
 problem.

 I've got the lucene-spellchecker-2.1.0.jar in the WEB-INF/lib/ (If I  
 remove
 it, I get the expected errormessage.)

 So I don't know why I neither get results, nor an errormessage.
 -- 
 View this message in context:
 http://www.nabble.com/Lucene-SpellChecker-returns-no-suggetions-after-changing-Server-tp20910159p20910159.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 
 --
 Grant Ingersoll
 
 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Lucene-SpellChecker-returns-no-suggetions-after-changing-Server-tp20910159p20971594.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner: Best way to index and display orginal text of pdfs in search results

2008-12-12 Thread Ian Lea
Hi


Lucene can store the original text of the document.  You make the
lucene fields to do what you need.  Have a look at the apidocs for
Field.Store and you'll see that you've got three choices: Yes, No or
Compress.

For your display snapshots, have a look at the lucene highlighter package.

And all newcomers to Lucene could do a lot worse than getting hold of
a copy of Lucene in Action.  Somewhat out of date but the principles
are still valid.


--
Ian.

On Fri, Dec 12, 2008 at 8:34 AM, maxmil m...@alwayssunny.com wrote:

 Hi,

 This is the first time i am using Lucene.

 I need to index pdf's with very few fields, title, date and body (long
 field) for a web based search.

 The results i need to display have to show not only the documents found but
 for each document a snapshot of the text where the search term has been
 found. This is analogous to the way google displays search results. That is
 to say

  ... some words and first instance of search Term and some more words ...
 some more words second instance of search term and some more words...

 etc.

 To do this i would need the original text of the document for each hit. As
 far as i understand Lucene does not save the original text of the document
 in the index.

 I am not using a database and would prefer not to have to store the original
 document text elsewhere.

 One way i could do this would be to take the hits from Lucene and reopen
 each pdf to extract the original text at run time however i fear that with
 many results this would be very slow.

 What would you recommend me to do?

 Thanks

 max
 --
 View this message in context: 
 http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Taxonomy in Lucene

2008-12-12 Thread Michael McCandless


John can you describe some of these changes?  They sound cool!

Mike

John Wang wrote:

We are doing lotsa internal changes for performance. Also upgrading  
the api

to support for features. So my suggestion is to wait for 2.0. (should
release this this month, at the latest mid jan) We can take this  
offline if

you want to have a deeper discussion on browse engine.

Thanks

-John

On Thu, Dec 11, 2008 at 1:23 AM, Karsten F.
karsten-luc...@fiz-technik.dewrote:



hi glen,

possible you will find this thread interesting:

http://groups.google.com/group/xtf-user/browse_thread/thread/beb62f5ff9a16a3a/16044d1009511cda
was about a taxonomy like in your example.
Also take a look to the faceted browsing on date in

http://www.marktwainproject.org/xtf/search?category=letters;style=mtp;facet-written=

In solr 1.3 the faceted browsing was implemented with filter for each
possible value.
The implementation in xtf is quite more sophisticated (
http://xtf.wiki.sourceforge.net/programming_Faceted_Browsing ).
I am not familiar with current version of solr.

Best regards
Karsten



hossman wrote:


the simple faceting support provided out of the box by solr can  
easily be
used for taxonomy based faceting if you encode your taxonomy  
breadcrumbs
in the docs (a google search for solr hierarchical facets will  
give you

lots off discussion on this).


-Hoss



--
View this message in context:
http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20952134.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner: Best way to index and display orginal text of pdfs in search results

2008-12-12 Thread maxmil

Thanks very much. Looks like Field.Store.COMPRESS is what i want.

I'll also have a look at the search highlight stuff and getting Lucene in
Action.



Ian Lea wrote:
 
 Hi
 
 
 Lucene can store the original text of the document.  You make the
 lucene fields to do what you need.  Have a look at the apidocs for
 Field.Store and you'll see that you've got three choices: Yes, No or
 Compress.
 
 For your display snapshots, have a look at the lucene highlighter package.
 
 And all newcomers to Lucene could do a lot worse than getting hold of
 a copy of Lucene in Action.  Somewhat out of date but the principles
 are still valid.
 
 
 --
 Ian.
 
 On Fri, Dec 12, 2008 at 8:34 AM, maxmil m...@alwayssunny.com wrote:

 Hi,

 This is the first time i am using Lucene.

 I need to index pdf's with very few fields, title, date and body (long
 field) for a web based search.

 The results i need to display have to show not only the documents found
 but
 for each document a snapshot of the text where the search term has been
 found. This is analogous to the way google displays search results. That
 is
 to say

  ... some words and first instance of search Term and some more words ...
 some more words second instance of search term and some more words...

 etc.

 To do this i would need the original text of the document for each hit.
 As
 far as i understand Lucene does not save the original text of the
 document
 in the index.

 I am not using a database and would prefer not to have to store the
 original
 document text elsewhere.

 One way i could do this would be to take the hits from Lucene and reopen
 each pdf to extract the original text at run time however i fear that
 with
 many results this would be very slow.

 What would you recommend me to do?

 Thanks

 max
 --
 View this message in context:
 http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20973618.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Spell check of a large text

2008-12-12 Thread Lucene User no 1981

Grant,

It's definitely dictionary based spell checker. A bit fleshing out, 
currently the document gets indexed and then it's analysed (bad words, 
repetitions etc), spell check - no corrections - would be yet another 
step in the process. It's all read-only stuff, the document content is 
not modified, it's just tagged accordingly.
That said, I kind of like your idea, I mean token filter looks like the 
good candidate. As of Lazzy, is it any different than Lucene 
SpellChecker (ngram based)? what really matters here is not the 
accuracy (decent but not exceptional - there is a manual double- check 
of tagged docs anyway), what matters most is performance and ease of 
integration. Any grammar check is absolutely immaterial.
About that payload idea, I can only work with a token in a filter. I 
could attach something and spit it out, but what would be that 
something? It would have to be searchable I assume, otherwise I could 
perform the check without filter, out of index. If it's searchable 
then, apart from querying, I could perhaps make highlighter work with 
it nicely.

Thx,
Mac


Grant Ingersoll-6 wrote:
 
 I think I'm missing something here...
 
 Spell checked in what sense?  Sounds to me like you need dictionary  
 based spell checking during index, not index based spelling during  
 search, right?
 
 How about hooking up something like the Jazzy spell checker into a  
 TokenFilter?  Then, as the tokens stream by, you lookup the spelling  
 and then add a 1 byte payload to all words that are misspelled.
 
 As for Highlighter, hmmm...  Not sure if there is a way to make a  
 Fragmenter/Scorer that was payload aware, such that it would only  
 produce fragments (and scores) for sections of the file that have  
 these payloads.  Definitely pushing my area of expertise, but maybe  
 one of the Highlighter experts can chime in.
 
 HTH,
 Grant
 
 On Dec 11, 2008, at 6:18 AM, Lucene User no 1981 wrote:
 

 Hi,

 the problem is as follows: there is a text, ca. 30kb, it has to be
 spellchecked automatically, there is no manual intervention, no  
 suggestions
 needed. All I would like to achieve is a simple check if there are any
 problems with the spelling or not. It has to be rather fast cause  
 there are
 tons of docs a minute going thru the system. Solutions like
 SpellChecker.exists() don't really apply. Additionally, spelling  
 errors
 could be highlighted - haven't really found any reasonable way of  
 leveraging
 Highlighter for that task.

 Does anyone have any idea how this problem can be addressed with  
 Lucene?

 Regards,
 Mac
 -- 
 View this message in context:
 http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20953625.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 
 --
 Grant Ingersoll
 
 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20973238.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner: Best way to index and display orginal text of pdfs in search results

2008-12-12 Thread Paul Libbrecht


I also encountered these options of the Field constructor but I never  
found a way to be sure that the field is really not loaded in RAM and  
only return with Field.reader(). There seems to be no contract in the  
javadoc.
Moreover the reader access methods went away between 1.9 and 2.2 if I  
don't mistake... so I had the impression it was not wanted to store  
blobs in the index.


Also, reader is not enough to do a decent job to store PDFs.
It should be a binary format (so getBinaryValue() should be used) and  
it should be an input-stream and not an in-memory array!


Echoes of a long frustrated user which implemented its own mass- 
storage because of that.

thanks for hints and even contradictions!

paul


Le 12-déc.-08 à 10:49, Ian Lea a écrit :

Lucene can store the original text of the document.  You make the
lucene fields to do what you need.  Have a look at the apidocs for
Field.Store and you'll see that you've got three choices: Yes, No or
Compress.

For your display snapshots, have a look at the lucene highlighter  
package.


And all newcomers to Lucene could do a lot worse than getting hold of
a copy of Lucene in Action.  Somewhat out of date but the principles
are still valid.




smime.p7s
Description: S/MIME cryptographic signature


Re: Taxonomy in Lucene

2008-12-12 Thread Karsten F.

Hi John,

I will take a look in the bobo-browse source code at week end.

Do you now the xtf implementation of faceted browsing:
starting point is
org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc
?
(It works with millions of facet values on millions of hits)

What is the starting point in browseengine?

How is the connection between solr and browseengine ?

Thanks for mention browseengine. I really like the car demo!

Best regards 
  Karsten


John Wang wrote:
 
 We are doing lotsa internal changes for performance. Also upgrading the
 api
 to support for features. So my suggestion is to wait for 2.0. (should
 release this this month, at the latest mid jan) We can take this offline
 if
 you want to have a deeper discussion on browse engine.
 

-- 
View this message in context: 
http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to search for -2 in field?

2008-12-12 Thread Darren Govoni
Tried them all, with quotes, without. Doesn't work. At least in Luke it
doesn't.

On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote:
 whitespace analyzer will tokenize on white space irrespective of quotes. Use
 standard analyzer or keyword analyzer.
 Prabin meitei
 toostep.com
 
 On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  I'm using Luke to find the right combination of quotes,\'s and
  analyzers.
 
  No combination can produce a positive result for -2 String for the
  field 'type'. (any -number String)
 
  type: 0 -2 Word
 
  analyzer:
  query - rewritten = result
 
  default field is 'type'.
 
  WhitespaceAnalyzer:
  \-2 ConfigurationFile\  - type:-2 type:ConfigurationFile = NO
  -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO
  \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO
  \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought
  this one would work).
 
  Same results for the other analyzers more or less.
 
  Weird.
 
  Darren
 
 
 
  On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote:
   Hi,  While constructing the query give the query string in quotes.
   eg: query = queryparser.parse(\-2 word\);
  
   Prabin meitei
   toostep.com
  
   On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com
  wrote:
  
I'm hoping to do this with a simple query string, but not sure if its
possible. I'll try your suggestion though as a workaround.
   
Thanks!!
   
On Thu, 2008-12-11 at 16:48 +, Robert Young wrote:
 You could do it with a TermQuery but I'm not quite sure if that's the
answer
 you're looking for.

 Cheers
 Rob

 On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com
wrote:

  Hi,
   This might be a dumb question, but I have a simple field like this
 
  field: 0 -2 Word
 
  that is indexed,tokenized and stored. I've tried various ways in
  Lucene
  (using Luke) to search for -2 Word and none of them work, the
  query
is
  re-written improperly. I escaped the -2 to \-2 Word and it still
  doesn't work. I've used all the analyzers.
 
 
  What's the trick here?
 
  Thanks,
  Darren
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
   
   
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
   
   
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to search for -2 in field?

2008-12-12 Thread prabin meitei
one more thing, few times I have encountered that I get different results in
Luke then in my actual code. Try in your code directly using standard
analyzer and quoted query string. print your query to check if the query
formed is correct (query is formed with quoted string).

  Can you tell what is the text you are indexing?? Let me also just check at
my end.

Prabin meitei
toostep.com

On Fri, Dec 12, 2008 at 6:14 PM, Darren Govoni dar...@ontrenet.com wrote:

 Tried them all, with quotes, without. Doesn't work. At least in Luke it
 doesn't.

 On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote:
  whitespace analyzer will tokenize on white space irrespective of quotes.
 Use
  standard analyzer or keyword analyzer.
  Prabin meitei
  toostep.com
 
  On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com
 wrote:
 
   I'm using Luke to find the right combination of quotes,\'s and
   analyzers.
  
   No combination can produce a positive result for -2 String for the
   field 'type'. (any -number String)
  
   type: 0 -2 Word
  
   analyzer:
   query - rewritten = result
  
   default field is 'type'.
  
   WhitespaceAnalyzer:
   \-2 ConfigurationFile\  - type:-2 type:ConfigurationFile = NO
   -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO
   \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO
   \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought
   this one would work).
  
   Same results for the other analyzers more or less.
  
   Weird.
  
   Darren
  
  
  
   On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote:
Hi,  While constructing the query give the query string in quotes.
eg: query = queryparser.parse(\-2 word\);
   
Prabin meitei
toostep.com
   
On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com
 
   wrote:
   
 I'm hoping to do this with a simple query string, but not sure if
 its
 possible. I'll try your suggestion though as a workaround.

 Thanks!!

 On Thu, 2008-12-11 at 16:48 +, Robert Young wrote:
  You could do it with a TermQuery but I'm not quite sure if that's
 the
 answer
  you're looking for.
 
  Cheers
  Rob
 
  On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni 
 dar...@ontrenet.com
 wrote:
 
   Hi,
This might be a dumb question, but I have a simple field like
 this
  
   field: 0 -2 Word
  
   that is indexed,tokenized and stored. I've tried various ways
 in
   Lucene
   (using Luke) to search for -2 Word and none of them work, the
   query
 is
   re-written improperly. I escaped the -2 to \-2 Word and it
 still
   doesn't work. I've used all the analyzers.
  
  
   What's the trick here?
  
   Thanks,
   Darren
  
  
  
   -
   To unsubscribe, e-mail:
 java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail:
 java-user-h...@lucene.apache.org
  
  



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to search for -2 in field?

2008-12-12 Thread Matthew Hall
Are you absolutely, 100% sure that the -2 token has actually made it 
into your index?


As a VERY basic way to check this try something like this:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;


public class IndexTerms {
  
  
   public static void main(String[] args) {

   try {
   IndexReader ir = IndexReader.open(C:/Search/index/index);

   TermEnum te = ir.terms();

   while (te.next()) {
   System.out.println(te.term().text());
   }
   }
   catch (Exception e) {;}
   }
}

Then look through the output, verifying that the tokens you are 
expecting to exist in your index, actually do.


I have a feeling that whatever analyzer you are using is dropping the 
- from the front of your -2 at indexing time, and if so it can 
sometimes be pretty hard to tell via Luke.


Hope this helps,

-Matt

Darren Govoni wrote:

Tried them all, with quotes, without. Doesn't work. At least in Luke it
doesn't.

On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote:
  

whitespace analyzer will tokenize on white space irrespective of quotes. Use
standard analyzer or keyword analyzer.
Prabin meitei
toostep.com

On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote:



I'm using Luke to find the right combination of quotes,\'s and
analyzers.

No combination can produce a positive result for -2 String for the
field 'type'. (any -number String)

type: 0 -2 Word

analyzer:
query - rewritten = result

default field is 'type'.

WhitespaceAnalyzer:
\-2 ConfigurationFile\  - type:-2 type:ConfigurationFile = NO
-2 ConfigurationFile - -type:2 type:ConfigurationFile = NO
\-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO
\-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought
this one would work).

Same results for the other analyzers more or less.

Weird.

Darren



On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote:
  

Hi,  While constructing the query give the query string in quotes.
eg: query = queryparser.parse(\-2 word\);

Prabin meitei
toostep.com

On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com


wrote:
  

I'm hoping to do this with a simple query string, but not sure if its
possible. I'll try your suggestion though as a workaround.

Thanks!!

On Thu, 2008-12-11 at 16:48 +, Robert Young wrote:
  

You could do it with a TermQuery but I'm not quite sure if that's the


answer
  

you're looking for.

Cheers
Rob

On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com


wrote:
  

Hi,
 This might be a dumb question, but I have a simple field like this

field: 0 -2 Word

that is indexed,tokenized and stored. I've tried various ways in
  

Lucene
  

(using Luke) to search for -2 Word and none of them work, the
  

query
  

is
  

re-written improperly. I escaped the -2 to \-2 Word and it still
doesn't work. I've used all the analyzers.


What's the trick here?

Thanks,
Darren



  

-
  

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to search for -2 in field?

2008-12-12 Thread Greg Shackles
I admit I only read through this thread quickly so maybe I missed something,
but it sounds like you're trying different Analyzers for searching, when
what you really need is to use the right analyzer during indexing.
Generally you want to use the same analyzer for both indexing and searching
so that you get the results you would expect.  That's where I would start in
trying to figure out the problem, since switching analyzers on the search
side probably won't help you.


Greg


How to add an Arabic and Farsi language analyzer to Lucene

2008-12-12 Thread Ian Vink
Anyone heard of one for Lucene.NET ?

Ian


.NET list?

2008-12-12 Thread Ian Vink
I am using java-user@lucene.apache.org  for help, but sometimes I'd like
Lucene.net specific help. Is there a mailing list for Lucene.NET on apache?

Ian


Re: .NET list?

2008-12-12 Thread Erik Hatcher


On Dec 12, 2008, at 9:43 AM, Ian Vink wrote:
I am using java-user@lucene.apache.org  for help, but sometimes I'd  
like
Lucene.net specific help. Is there a mailing list for Lucene.NET on  
apache?


Yes, see the mail list section here: http://incubator.apache.org/lucene.net/ 



Erik



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Beginner: Best way to index and display orginal text of pdfs in search results

2008-12-12 Thread Sudarsan, Sithu D.
You can use PDFBOX. 

http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.h
tml 


Sincerely,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

-Original Message-
From: maxmil [mailto:m...@alwayssunny.com] 
Sent: Friday, December 12, 2008 3:34 AM
To: java-user@lucene.apache.org
Subject: Beginner: Best way to index and display orginal text of pdfs in
search results


Hi,

This is the first time i am using Lucene.

I need to index pdf's with very few fields, title, date and body (long
field) for a web based search.

The results i need to display have to show not only the documents found
but
for each document a snapshot of the text where the search term has been
found. This is analogous to the way google displays search results. That
is
to say

 ... some words and first instance of search Term and some more words
...
some more words second instance of search term and some more words... 

etc.

To do this i would need the original text of the document for each hit.
As
far as i understand Lucene does not save the original text of the
document
in the index.

I am not using a database and would prefer not to have to store the
original
document text elsewhere.

One way i could do this would be to take the hits from Lucene and reopen
each pdf to extract the original text at run time however i fear that
with
many results this would be very slow. 

What would you recommend me to do?

Thanks

max
-- 
View this message in context:
http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-
text-of-pdfs-in-search-results-tp20971377p20971377.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Taxonomy in Lucene

2008-12-12 Thread John Wang
wiki:http://bobo-browse.wiki.sourceforge.net/

this describes the upcoming 2.0 release, which is in the ill-named
branch: BR_DEV_1_5_0
We are still doing some development work on that, feel free to check out the
branch and we will be doing a release shortly.

some features we aimed for 2.0 and also reasons for the api changes:

1) support for selection expansion: The ability to select a value in a
field, and allow the sibling facets to come back, e.g. intersect with other
fields and keep the current field not intersect with selected value. This is
rather tricky to be fast, e.g. doing 2 searches.

2) allow the framework, ability to handle derived data, e.g. build facets
from data not necc. on index. Some examples, in linkedin's case, being able
to facet on different distances of the social graph, etc.

3) Being able to handle multi valued facets, e.g. 1 docid - into multiple
values.

4) being able to do 1) on range facets.

etc..

-John

On Fri, Dec 12, 2008 at 3:52 AM, Karsten F.
karsten-luc...@fiz-technik.dewrote:


 Hi John,

 I will take a look in the bobo-browse source code at week end.

 Do you now the xtf implementation of faceted browsing:
 starting point is
 org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc
 ?
 (It works with millions of facet values on millions of hits)

 What is the starting point in browseengine?

 How is the connection between solr and browseengine ?

 Thanks for mention browseengine. I really like the car demo!

 Best regards
  Karsten


 John Wang wrote:
 
  We are doing lotsa internal changes for performance. Also upgrading the
  api
  to support for features. So my suggestion is to wait for 2.0. (should
  release this this month, at the latest mid jan) We can take this offline
  if
  you want to have a deeper discussion on browse engine.
 

 --
 View this message in context:
 http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Taxonomy in Lucene

2008-12-12 Thread John Wang
HI Karsten:
 I will check out xtf library.

 there is no connection between solr and browseengien other than Lucene
and java.

Thanks

-John

On Fri, Dec 12, 2008 at 3:52 AM, Karsten F.
karsten-luc...@fiz-technik.dewrote:


 Hi John,

 I will take a look in the bobo-browse source code at week end.

 Do you now the xtf implementation of faceted browsing:
 starting point is
 org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc
 ?
 (It works with millions of facet values on millions of hits)

 What is the starting point in browseengine?

 How is the connection between solr and browseengine ?

 Thanks for mention browseengine. I really like the car demo!

 Best regards
  Karsten


 John Wang wrote:
 
  We are doing lotsa internal changes for performance. Also upgrading the
  api
  to support for features. So my suggestion is to wait for 2.0. (should
  release this this month, at the latest mid jan) We can take this offline
  if
  you want to have a deeper discussion on browse engine.
 

 --
 View this message in context:
 http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Spell check of a large text

2008-12-12 Thread Grant Ingersoll


On Dec 12, 2008, at 5:36 AM, Lucene User no 1981 wrote:



Grant,

It's definitely dictionary based spell checker. A bit fleshing out,
currently the document gets indexed and then it's analysed (bad words,
repetitions etc), spell check - no corrections - would be yet another
step in the process. It's all read-only stuff, the document content is
not modified, it's just tagged accordingly.
That said, I kind of like your idea, I mean token filter looks like  
the

good candidate. As of Lazzy, is it any different than Lucene
SpellChecker (ngram based)?


Yes, Jazzy is actually a dictionary of correctly spelled words.   
Lucene's approach (at least the index based one) is merely a  
dictionary of words that occur in your corpus, misspellings and all.   
So, if your goal is to tag words that are really, truly spelled  
incorrectly, than I'd say Jazzy or some other dictionary tool is the  
way to go.




what really matters here is not the
accuracy (decent but not exceptional - there is a manual double- check
of tagged docs anyway), what matters most is performance and ease of
integration. Any grammar check is absolutely immaterial.
About that payload idea, I can only work with a token in a filter. I
could attach something and spit it out, but what would be that
something? It would have to be searchable I assume, otherwise I could
perform the check without filter, out of index. If it's searchable
then, apart from querying, I could perhaps make highlighter work with
it nicely.


Payloads live on Tokens.  See the Token.setPayload() method.  It would  
then be searchable by using the BoostingTermQuery (BTQ) but you may  
need to write some other type of query.
For instance, the BTQ will allow you to say, I believe, give me all  
documents where a particular terms is misspelled and you can specify  
that term.  However, you may also want give me all documents that  
have misspellings and that is not something the BTQ can do.  You  
probably could hack up the MatchAllDocsQuery to do it though.  Or you  
could maybe write a QueryFilter that turns on all docs that have a  
payload present.  This is totally out there at this point, so take it  
with a grain of salt.  I think you can achieve what you want, but it  
will take some lifting.


I have no clue on the performance, but I think the indexing approach  
could be pretty fast, especially if you can perhaps test a cache of  
commonly misspelled terms, but I would test that first.


Cheers,
Grant 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to add an Arabic and Farsi language analyzer to Lucene

2008-12-12 Thread Grant Ingersoll
I just added an Arabic Analyzer to contrib/analysis.  No clue as to  
when that will percolate to .NET version.  I believe you can search  
the archives for help w/ Persian, as I recall someone offering  
something in the past.



On Dec 12, 2008, at 9:40 AM, Ian Vink wrote:


Anyone heard of one for Lucene.NET ?

Ian




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to search for -2 in field?

2008-12-12 Thread Darren Govoni
Hi Matt,
   Thanks for the thought. Yeah, I see it there in Luke, but the other
gentleman's idea that maybe Luke is producing different than code might
be a clue. It would be odd, if true, but nothing else works so I will
see if that is it. 

Darren

On Fri, 2008-12-12 at 08:03 -0500, Matthew Hall wrote:
 Are you absolutely, 100% sure that the -2 token has actually made it 
 into your index?
 
 As a VERY basic way to check this try something like this:
 
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.TermEnum;
 
 
 public class IndexTerms {


 public static void main(String[] args) {
 try {
 IndexReader ir = IndexReader.open(C:/Search/index/index);
 
 TermEnum te = ir.terms();
 
 while (te.next()) {
 System.out.println(te.term().text());
 }
 }
 catch (Exception e) {;}
 }
 }
 
 Then look through the output, verifying that the tokens you are 
 expecting to exist in your index, actually do.
 
 I have a feeling that whatever analyzer you are using is dropping the 
 - from the front of your -2 at indexing time, and if so it can 
 sometimes be pretty hard to tell via Luke.
 
 Hope this helps,
 
 -Matt
 
 Darren Govoni wrote:
  Tried them all, with quotes, without. Doesn't work. At least in Luke it
  doesn't.
 
  On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote:

  whitespace analyzer will tokenize on white space irrespective of quotes. 
  Use
  standard analyzer or keyword analyzer.
  Prabin meitei
  toostep.com
 
  On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com 
  wrote:
 
  
  I'm using Luke to find the right combination of quotes,\'s and
  analyzers.
 
  No combination can produce a positive result for -2 String for the
  field 'type'. (any -number String)
 
  type: 0 -2 Word
 
  analyzer:
  query - rewritten = result
 
  default field is 'type'.
 
  WhitespaceAnalyzer:
  \-2 ConfigurationFile\  - type:-2 type:ConfigurationFile = NO
  -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO
  \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO
  \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought
  this one would work).
 
  Same results for the other analyzers more or less.
 
  Weird.
 
  Darren
 
 
 
  On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote:

  Hi,  While constructing the query give the query string in quotes.
  eg: query = queryparser.parse(\-2 word\);
 
  Prabin meitei
  toostep.com
 
  On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com
  
  wrote:

  I'm hoping to do this with a simple query string, but not sure if its
  possible. I'll try your suggestion though as a workaround.
 
  Thanks!!
 
  On Thu, 2008-12-11 at 16:48 +, Robert Young wrote:

  You could do it with a TermQuery but I'm not quite sure if that's the
  
  answer

  you're looking for.
 
  Cheers
  Rob
 
  On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com
  
  wrote:

  Hi,
   This might be a dumb question, but I have a simple field like this
 
  field: 0 -2 Word
 
  that is indexed,tokenized and stored. I've tried various ways in

  Lucene

  (using Luke) to search for -2 Word and none of them work, the

  query

  is

  re-written improperly. I escaped the -2 to \-2 Word and it still
  doesn't work. I've used all the analyzers.
 
 
  What's the trick here?
 
  Thanks,
  Darren
 
 
 

  -

  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene - Authentication

2008-12-12 Thread Aaron Schon
Hi , if I have a Lucene index (or Solr) that is installed in client premises. 
how would you go about securing the index from being queries in unauthorized 
fashion. For example, from malicious users or hackers, or for that matter 
internal users trying to reengineer the system and use it for purposes other 
than the way licensed.

any suggestions?
as


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene - Authentication

2008-12-12 Thread Chris Hostetter
: X-Mailer: YahooMailRC/1155.45 YahooMailWebService/0.7.260.1
: References: 1229011161.7448.10.ca...@nuraku 
: 32a1c320812110848u302dd645h4143205068fe3...@mail.gmail.com 
: 1229015253.7448.12.ca...@nuraku 
: 295da8fe0812110932x3b31380dla64b09f1b09be...@mail.gmail.com 
: 1229018304.7448.24.ca...@nuraku 
: 295da8fe0812111733n529163a7r6fb51fec4db16...@mail.gmail.com 
: 1229085896.26037.0.ca...@nuraku  49426127.9060...@informatics.jax.org
: 1229130748.24089.15.ca...@nuraku
: Date: Fri, 12 Dec 2008 21:05:29 -0800 (PST)
: Subject: Lucene - Authentication

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org