Beginner: Best way to index and display orginal text of pdfs in search results
Hi, This is the first time i am using Lucene. I need to index pdf's with very few fields, title, date and body (long field) for a web based search. The results i need to display have to show not only the documents found but for each document a snapshot of the text where the search term has been found. This is analogous to the way google displays search results. That is to say ... some words and first instance of search Term and some more words ... some more words second instance of search term and some more words... etc. To do this i would need the original text of the document for each hit. As far as i understand Lucene does not save the original text of the document in the index. I am not using a database and would prefer not to have to store the original document text elsewhere. One way i could do this would be to take the hits from Lucene and reopen each pdf to extract the original text at run time however i fear that with many results this would be very slow. What would you recommend me to do? Thanks max -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene SpellChecker returns no suggetions after changing Server
Yes, I'm passing the same index for Spellchecker and IndexReader. I'm going to test if this is a reason for my problem. But I still don't understand why the same code is working on the testserver. I think this could be because of the rights from tomcat. Is there any tutorial about the tomcat configuration for lucene with debian? Or can anyone tell me what's really important? I also don't know why there are two webapps folders (/var/lib/tomcat5.5/webapps and /usr/share/tomcat5.5-webapps). I made my JSP's into /var/lib/tomcat5.5/webapps. I copied the files from my testserver including the WEB-INF, could this be the reason? The changes: Ubuntu 8.10 - Debian Etch Java5 - Java6 Tomcat6 - Tomcat 5.5 Grant Ingersoll-6 wrote: So, what changed with the server? From the looks of your code, you're passing the same index into both the Spellchecker and the IndexReader. The spelling index is separate from the main index. See the example at: http://lucene.apache.org/java/2_4_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html See also my Boot Camp examples at: http://www.lucenebootcamp.com/LuceneBootCamp/training/src/test/java/com/lucenebootcamp/training/basic/ContribExamplesTest.java Have a look at the testSpelling code there HTH, Grant On Dec 9, 2008, at 2:50 AM, Matthias W. wrote: Hi, I'm using Lucene's SpellChecker (Lucene 2.1.0) class to get suggestions. Till now my testing server was a VMWare-Image from http://es.cohesiveft.com http://es.cohesiveft.com (Ubuntu 8.10, Tomcat6, Java5). Now I'm using a Debian Etch Server with Tomcat5.5 and Java6. Code-Sample: String indexName = indexLocation; String queryString = null; queryString = URLDecoder.decode(request.getParameter(q), UTF-8); SpellChecker spellchecker = new SpellChecker(FSDirectory.getDirectory(indexName)); String[] suggestions = spellchecker.suggestSimilar(queryString, 5, IndexReader.open(indexName), content, false); for(int i = 0; i suggestions.length; i++) { out.println(suggestions[i]); } This worked fine on the old server, but on my new server this returns nothing. The index is generated by the nutch crawler, but this shouldn't be the problem. I've got the lucene-spellchecker-2.1.0.jar in the WEB-INF/lib/ (If I remove it, I get the expected errormessage.) So I don't know why I neither get results, nor an errormessage. -- View this message in context: http://www.nabble.com/Lucene-SpellChecker-returns-no-suggetions-after-changing-Server-tp20910159p20910159.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/Lucene-SpellChecker-returns-no-suggetions-after-changing-Server-tp20910159p20971594.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Beginner: Best way to index and display orginal text of pdfs in search results
Hi Lucene can store the original text of the document. You make the lucene fields to do what you need. Have a look at the apidocs for Field.Store and you'll see that you've got three choices: Yes, No or Compress. For your display snapshots, have a look at the lucene highlighter package. And all newcomers to Lucene could do a lot worse than getting hold of a copy of Lucene in Action. Somewhat out of date but the principles are still valid. -- Ian. On Fri, Dec 12, 2008 at 8:34 AM, maxmil m...@alwayssunny.com wrote: Hi, This is the first time i am using Lucene. I need to index pdf's with very few fields, title, date and body (long field) for a web based search. The results i need to display have to show not only the documents found but for each document a snapshot of the text where the search term has been found. This is analogous to the way google displays search results. That is to say ... some words and first instance of search Term and some more words ... some more words second instance of search term and some more words... etc. To do this i would need the original text of the document for each hit. As far as i understand Lucene does not save the original text of the document in the index. I am not using a database and would prefer not to have to store the original document text elsewhere. One way i could do this would be to take the hits from Lucene and reopen each pdf to extract the original text at run time however i fear that with many results this would be very slow. What would you recommend me to do? Thanks max -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Taxonomy in Lucene
John can you describe some of these changes? They sound cool! Mike John Wang wrote: We are doing lotsa internal changes for performance. Also upgrading the api to support for features. So my suggestion is to wait for 2.0. (should release this this month, at the latest mid jan) We can take this offline if you want to have a deeper discussion on browse engine. Thanks -John On Thu, Dec 11, 2008 at 1:23 AM, Karsten F. karsten-luc...@fiz-technik.dewrote: hi glen, possible you will find this thread interesting: http://groups.google.com/group/xtf-user/browse_thread/thread/beb62f5ff9a16a3a/16044d1009511cda was about a taxonomy like in your example. Also take a look to the faceted browsing on date in http://www.marktwainproject.org/xtf/search?category=letters;style=mtp;facet-written= In solr 1.3 the faceted browsing was implemented with filter for each possible value. The implementation in xtf is quite more sophisticated ( http://xtf.wiki.sourceforge.net/programming_Faceted_Browsing ). I am not familiar with current version of solr. Best regards Karsten hossman wrote: the simple faceting support provided out of the box by solr can easily be used for taxonomy based faceting if you encode your taxonomy breadcrumbs in the docs (a google search for solr hierarchical facets will give you lots off discussion on this). -Hoss -- View this message in context: http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20952134.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Beginner: Best way to index and display orginal text of pdfs in search results
Thanks very much. Looks like Field.Store.COMPRESS is what i want. I'll also have a look at the search highlight stuff and getting Lucene in Action. Ian Lea wrote: Hi Lucene can store the original text of the document. You make the lucene fields to do what you need. Have a look at the apidocs for Field.Store and you'll see that you've got three choices: Yes, No or Compress. For your display snapshots, have a look at the lucene highlighter package. And all newcomers to Lucene could do a lot worse than getting hold of a copy of Lucene in Action. Somewhat out of date but the principles are still valid. -- Ian. On Fri, Dec 12, 2008 at 8:34 AM, maxmil m...@alwayssunny.com wrote: Hi, This is the first time i am using Lucene. I need to index pdf's with very few fields, title, date and body (long field) for a web based search. The results i need to display have to show not only the documents found but for each document a snapshot of the text where the search term has been found. This is analogous to the way google displays search results. That is to say ... some words and first instance of search Term and some more words ... some more words second instance of search term and some more words... etc. To do this i would need the original text of the document for each hit. As far as i understand Lucene does not save the original text of the document in the index. I am not using a database and would prefer not to have to store the original document text elsewhere. One way i could do this would be to take the hits from Lucene and reopen each pdf to extract the original text at run time however i fear that with many results this would be very slow. What would you recommend me to do? Thanks max -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20973618.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Spell check of a large text
Grant, It's definitely dictionary based spell checker. A bit fleshing out, currently the document gets indexed and then it's analysed (bad words, repetitions etc), spell check - no corrections - would be yet another step in the process. It's all read-only stuff, the document content is not modified, it's just tagged accordingly. That said, I kind of like your idea, I mean token filter looks like the good candidate. As of Lazzy, is it any different than Lucene SpellChecker (ngram based)? what really matters here is not the accuracy (decent but not exceptional - there is a manual double- check of tagged docs anyway), what matters most is performance and ease of integration. Any grammar check is absolutely immaterial. About that payload idea, I can only work with a token in a filter. I could attach something and spit it out, but what would be that something? It would have to be searchable I assume, otherwise I could perform the check without filter, out of index. If it's searchable then, apart from querying, I could perhaps make highlighter work with it nicely. Thx, Mac Grant Ingersoll-6 wrote: I think I'm missing something here... Spell checked in what sense? Sounds to me like you need dictionary based spell checking during index, not index based spelling during search, right? How about hooking up something like the Jazzy spell checker into a TokenFilter? Then, as the tokens stream by, you lookup the spelling and then add a 1 byte payload to all words that are misspelled. As for Highlighter, hmmm... Not sure if there is a way to make a Fragmenter/Scorer that was payload aware, such that it would only produce fragments (and scores) for sections of the file that have these payloads. Definitely pushing my area of expertise, but maybe one of the Highlighter experts can chime in. HTH, Grant On Dec 11, 2008, at 6:18 AM, Lucene User no 1981 wrote: Hi, the problem is as follows: there is a text, ca. 30kb, it has to be spellchecked automatically, there is no manual intervention, no suggestions needed. All I would like to achieve is a simple check if there are any problems with the spelling or not. It has to be rather fast cause there are tons of docs a minute going thru the system. Solutions like SpellChecker.exists() don't really apply. Additionally, spelling errors could be highlighted - haven't really found any reasonable way of leveraging Highlighter for that task. Does anyone have any idea how this problem can be addressed with Lucene? Regards, Mac -- View this message in context: http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20953625.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20973238.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Beginner: Best way to index and display orginal text of pdfs in search results
I also encountered these options of the Field constructor but I never found a way to be sure that the field is really not loaded in RAM and only return with Field.reader(). There seems to be no contract in the javadoc. Moreover the reader access methods went away between 1.9 and 2.2 if I don't mistake... so I had the impression it was not wanted to store blobs in the index. Also, reader is not enough to do a decent job to store PDFs. It should be a binary format (so getBinaryValue() should be used) and it should be an input-stream and not an in-memory array! Echoes of a long frustrated user which implemented its own mass- storage because of that. thanks for hints and even contradictions! paul Le 12-déc.-08 à 10:49, Ian Lea a écrit : Lucene can store the original text of the document. You make the lucene fields to do what you need. Have a look at the apidocs for Field.Store and you'll see that you've got three choices: Yes, No or Compress. For your display snapshots, have a look at the lucene highlighter package. And all newcomers to Lucene could do a lot worse than getting hold of a copy of Lucene in Action. Somewhat out of date but the principles are still valid. smime.p7s Description: S/MIME cryptographic signature
Re: Taxonomy in Lucene
Hi John, I will take a look in the bobo-browse source code at week end. Do you now the xtf implementation of faceted browsing: starting point is org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc ? (It works with millions of facet values on millions of hits) What is the starting point in browseengine? How is the connection between solr and browseengine ? Thanks for mention browseengine. I really like the car demo! Best regards Karsten John Wang wrote: We are doing lotsa internal changes for performance. Also upgrading the api to support for features. So my suggestion is to wait for 2.0. (should release this this month, at the latest mid jan) We can take this offline if you want to have a deeper discussion on browse engine. -- View this message in context: http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
Tried them all, with quotes, without. Doesn't work. At least in Luke it doesn't. On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote: whitespace analyzer will tokenize on white space irrespective of quotes. Use standard analyzer or keyword analyzer. Prabin meitei toostep.com On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote: I'm using Luke to find the right combination of quotes,\'s and analyzers. No combination can produce a positive result for -2 String for the field 'type'. (any -number String) type: 0 -2 Word analyzer: query - rewritten = result default field is 'type'. WhitespaceAnalyzer: \-2 ConfigurationFile\ - type:-2 type:ConfigurationFile = NO -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought this one would work). Same results for the other analyzers more or less. Weird. Darren On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote: Hi, While constructing the query give the query string in quotes. eg: query = queryparser.parse(\-2 word\); Prabin meitei toostep.com On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com wrote: I'm hoping to do this with a simple query string, but not sure if its possible. I'll try your suggestion though as a workaround. Thanks!! On Thu, 2008-12-11 at 16:48 +, Robert Young wrote: You could do it with a TermQuery but I'm not quite sure if that's the answer you're looking for. Cheers Rob On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, This might be a dumb question, but I have a simple field like this field: 0 -2 Word that is indexed,tokenized and stored. I've tried various ways in Lucene (using Luke) to search for -2 Word and none of them work, the query is re-written improperly. I escaped the -2 to \-2 Word and it still doesn't work. I've used all the analyzers. What's the trick here? Thanks, Darren - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
one more thing, few times I have encountered that I get different results in Luke then in my actual code. Try in your code directly using standard analyzer and quoted query string. print your query to check if the query formed is correct (query is formed with quoted string). Can you tell what is the text you are indexing?? Let me also just check at my end. Prabin meitei toostep.com On Fri, Dec 12, 2008 at 6:14 PM, Darren Govoni dar...@ontrenet.com wrote: Tried them all, with quotes, without. Doesn't work. At least in Luke it doesn't. On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote: whitespace analyzer will tokenize on white space irrespective of quotes. Use standard analyzer or keyword analyzer. Prabin meitei toostep.com On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote: I'm using Luke to find the right combination of quotes,\'s and analyzers. No combination can produce a positive result for -2 String for the field 'type'. (any -number String) type: 0 -2 Word analyzer: query - rewritten = result default field is 'type'. WhitespaceAnalyzer: \-2 ConfigurationFile\ - type:-2 type:ConfigurationFile = NO -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought this one would work). Same results for the other analyzers more or less. Weird. Darren On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote: Hi, While constructing the query give the query string in quotes. eg: query = queryparser.parse(\-2 word\); Prabin meitei toostep.com On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com wrote: I'm hoping to do this with a simple query string, but not sure if its possible. I'll try your suggestion though as a workaround. Thanks!! On Thu, 2008-12-11 at 16:48 +, Robert Young wrote: You could do it with a TermQuery but I'm not quite sure if that's the answer you're looking for. Cheers Rob On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, This might be a dumb question, but I have a simple field like this field: 0 -2 Word that is indexed,tokenized and stored. I've tried various ways in Lucene (using Luke) to search for -2 Word and none of them work, the query is re-written improperly. I escaped the -2 to \-2 Word and it still doesn't work. I've used all the analyzers. What's the trick here? Thanks, Darren - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
Are you absolutely, 100% sure that the -2 token has actually made it into your index? As a VERY basic way to check this try something like this: import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; public class IndexTerms { public static void main(String[] args) { try { IndexReader ir = IndexReader.open(C:/Search/index/index); TermEnum te = ir.terms(); while (te.next()) { System.out.println(te.term().text()); } } catch (Exception e) {;} } } Then look through the output, verifying that the tokens you are expecting to exist in your index, actually do. I have a feeling that whatever analyzer you are using is dropping the - from the front of your -2 at indexing time, and if so it can sometimes be pretty hard to tell via Luke. Hope this helps, -Matt Darren Govoni wrote: Tried them all, with quotes, without. Doesn't work. At least in Luke it doesn't. On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote: whitespace analyzer will tokenize on white space irrespective of quotes. Use standard analyzer or keyword analyzer. Prabin meitei toostep.com On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote: I'm using Luke to find the right combination of quotes,\'s and analyzers. No combination can produce a positive result for -2 String for the field 'type'. (any -number String) type: 0 -2 Word analyzer: query - rewritten = result default field is 'type'. WhitespaceAnalyzer: \-2 ConfigurationFile\ - type:-2 type:ConfigurationFile = NO -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought this one would work). Same results for the other analyzers more or less. Weird. Darren On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote: Hi, While constructing the query give the query string in quotes. eg: query = queryparser.parse(\-2 word\); Prabin meitei toostep.com On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com wrote: I'm hoping to do this with a simple query string, but not sure if its possible. I'll try your suggestion though as a workaround. Thanks!! On Thu, 2008-12-11 at 16:48 +, Robert Young wrote: You could do it with a TermQuery but I'm not quite sure if that's the answer you're looking for. Cheers Rob On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, This might be a dumb question, but I have a simple field like this field: 0 -2 Word that is indexed,tokenized and stored. I've tried various ways in Lucene (using Luke) to search for -2 Word and none of them work, the query is re-written improperly. I escaped the -2 to \-2 Word and it still doesn't work. I've used all the analyzers. What's the trick here? Thanks, Darren - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
I admit I only read through this thread quickly so maybe I missed something, but it sounds like you're trying different Analyzers for searching, when what you really need is to use the right analyzer during indexing. Generally you want to use the same analyzer for both indexing and searching so that you get the results you would expect. That's where I would start in trying to figure out the problem, since switching analyzers on the search side probably won't help you. Greg
How to add an Arabic and Farsi language analyzer to Lucene
Anyone heard of one for Lucene.NET ? Ian
.NET list?
I am using java-user@lucene.apache.org for help, but sometimes I'd like Lucene.net specific help. Is there a mailing list for Lucene.NET on apache? Ian
Re: .NET list?
On Dec 12, 2008, at 9:43 AM, Ian Vink wrote: I am using java-user@lucene.apache.org for help, but sometimes I'd like Lucene.net specific help. Is there a mailing list for Lucene.NET on apache? Yes, see the mail list section here: http://incubator.apache.org/lucene.net/ Erik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Beginner: Best way to index and display orginal text of pdfs in search results
You can use PDFBOX. http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.h tml Sincerely, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu -Original Message- From: maxmil [mailto:m...@alwayssunny.com] Sent: Friday, December 12, 2008 3:34 AM To: java-user@lucene.apache.org Subject: Beginner: Best way to index and display orginal text of pdfs in search results Hi, This is the first time i am using Lucene. I need to index pdf's with very few fields, title, date and body (long field) for a web based search. The results i need to display have to show not only the documents found but for each document a snapshot of the text where the search term has been found. This is analogous to the way google displays search results. That is to say ... some words and first instance of search Term and some more words ... some more words second instance of search term and some more words... etc. To do this i would need the original text of the document for each hit. As far as i understand Lucene does not save the original text of the document in the index. I am not using a database and would prefer not to have to store the original document text elsewhere. One way i could do this would be to take the hits from Lucene and reopen each pdf to extract the original text at run time however i fear that with many results this would be very slow. What would you recommend me to do? Thanks max -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal- text-of-pdfs-in-search-results-tp20971377p20971377.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Taxonomy in Lucene
wiki:http://bobo-browse.wiki.sourceforge.net/ this describes the upcoming 2.0 release, which is in the ill-named branch: BR_DEV_1_5_0 We are still doing some development work on that, feel free to check out the branch and we will be doing a release shortly. some features we aimed for 2.0 and also reasons for the api changes: 1) support for selection expansion: The ability to select a value in a field, and allow the sibling facets to come back, e.g. intersect with other fields and keep the current field not intersect with selected value. This is rather tricky to be fast, e.g. doing 2 searches. 2) allow the framework, ability to handle derived data, e.g. build facets from data not necc. on index. Some examples, in linkedin's case, being able to facet on different distances of the social graph, etc. 3) Being able to handle multi valued facets, e.g. 1 docid - into multiple values. 4) being able to do 1) on range facets. etc.. -John On Fri, Dec 12, 2008 at 3:52 AM, Karsten F. karsten-luc...@fiz-technik.dewrote: Hi John, I will take a look in the bobo-browse source code at week end. Do you now the xtf implementation of faceted browsing: starting point is org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc ? (It works with millions of facet values on millions of hits) What is the starting point in browseengine? How is the connection between solr and browseengine ? Thanks for mention browseengine. I really like the car demo! Best regards Karsten John Wang wrote: We are doing lotsa internal changes for performance. Also upgrading the api to support for features. So my suggestion is to wait for 2.0. (should release this this month, at the latest mid jan) We can take this offline if you want to have a deeper discussion on browse engine. -- View this message in context: http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Taxonomy in Lucene
HI Karsten: I will check out xtf library. there is no connection between solr and browseengien other than Lucene and java. Thanks -John On Fri, Dec 12, 2008 at 3:52 AM, Karsten F. karsten-luc...@fiz-technik.dewrote: Hi John, I will take a look in the bobo-browse source code at week end. Do you now the xtf implementation of faceted browsing: starting point is org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc ? (It works with millions of facet values on millions of hits) What is the starting point in browseengine? How is the connection between solr and browseengine ? Thanks for mention browseengine. I really like the car demo! Best regards Karsten John Wang wrote: We are doing lotsa internal changes for performance. Also upgrading the api to support for features. So my suggestion is to wait for 2.0. (should release this this month, at the latest mid jan) We can take this offline if you want to have a deeper discussion on browse engine. -- View this message in context: http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20974217.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Spell check of a large text
On Dec 12, 2008, at 5:36 AM, Lucene User no 1981 wrote: Grant, It's definitely dictionary based spell checker. A bit fleshing out, currently the document gets indexed and then it's analysed (bad words, repetitions etc), spell check - no corrections - would be yet another step in the process. It's all read-only stuff, the document content is not modified, it's just tagged accordingly. That said, I kind of like your idea, I mean token filter looks like the good candidate. As of Lazzy, is it any different than Lucene SpellChecker (ngram based)? Yes, Jazzy is actually a dictionary of correctly spelled words. Lucene's approach (at least the index based one) is merely a dictionary of words that occur in your corpus, misspellings and all. So, if your goal is to tag words that are really, truly spelled incorrectly, than I'd say Jazzy or some other dictionary tool is the way to go. what really matters here is not the accuracy (decent but not exceptional - there is a manual double- check of tagged docs anyway), what matters most is performance and ease of integration. Any grammar check is absolutely immaterial. About that payload idea, I can only work with a token in a filter. I could attach something and spit it out, but what would be that something? It would have to be searchable I assume, otherwise I could perform the check without filter, out of index. If it's searchable then, apart from querying, I could perhaps make highlighter work with it nicely. Payloads live on Tokens. See the Token.setPayload() method. It would then be searchable by using the BoostingTermQuery (BTQ) but you may need to write some other type of query. For instance, the BTQ will allow you to say, I believe, give me all documents where a particular terms is misspelled and you can specify that term. However, you may also want give me all documents that have misspellings and that is not something the BTQ can do. You probably could hack up the MatchAllDocsQuery to do it though. Or you could maybe write a QueryFilter that turns on all docs that have a payload present. This is totally out there at this point, so take it with a grain of salt. I think you can achieve what you want, but it will take some lifting. I have no clue on the performance, but I think the indexing approach could be pretty fast, especially if you can perhaps test a cache of commonly misspelled terms, but I would test that first. Cheers, Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to add an Arabic and Farsi language analyzer to Lucene
I just added an Arabic Analyzer to contrib/analysis. No clue as to when that will percolate to .NET version. I believe you can search the archives for help w/ Persian, as I recall someone offering something in the past. On Dec 12, 2008, at 9:40 AM, Ian Vink wrote: Anyone heard of one for Lucene.NET ? Ian - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
Hi Matt, Thanks for the thought. Yeah, I see it there in Luke, but the other gentleman's idea that maybe Luke is producing different than code might be a clue. It would be odd, if true, but nothing else works so I will see if that is it. Darren On Fri, 2008-12-12 at 08:03 -0500, Matthew Hall wrote: Are you absolutely, 100% sure that the -2 token has actually made it into your index? As a VERY basic way to check this try something like this: import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; public class IndexTerms { public static void main(String[] args) { try { IndexReader ir = IndexReader.open(C:/Search/index/index); TermEnum te = ir.terms(); while (te.next()) { System.out.println(te.term().text()); } } catch (Exception e) {;} } } Then look through the output, verifying that the tokens you are expecting to exist in your index, actually do. I have a feeling that whatever analyzer you are using is dropping the - from the front of your -2 at indexing time, and if so it can sometimes be pretty hard to tell via Luke. Hope this helps, -Matt Darren Govoni wrote: Tried them all, with quotes, without. Doesn't work. At least in Luke it doesn't. On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote: whitespace analyzer will tokenize on white space irrespective of quotes. Use standard analyzer or keyword analyzer. Prabin meitei toostep.com On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote: I'm using Luke to find the right combination of quotes,\'s and analyzers. No combination can produce a positive result for -2 String for the field 'type'. (any -number String) type: 0 -2 Word analyzer: query - rewritten = result default field is 'type'. WhitespaceAnalyzer: \-2 ConfigurationFile\ - type:-2 type:ConfigurationFile = NO -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought this one would work). Same results for the other analyzers more or less. Weird. Darren On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote: Hi, While constructing the query give the query string in quotes. eg: query = queryparser.parse(\-2 word\); Prabin meitei toostep.com On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com wrote: I'm hoping to do this with a simple query string, but not sure if its possible. I'll try your suggestion though as a workaround. Thanks!! On Thu, 2008-12-11 at 16:48 +, Robert Young wrote: You could do it with a TermQuery but I'm not quite sure if that's the answer you're looking for. Cheers Rob On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, This might be a dumb question, but I have a simple field like this field: 0 -2 Word that is indexed,tokenized and stored. I've tried various ways in Lucene (using Luke) to search for -2 Word and none of them work, the query is re-written improperly. I escaped the -2 to \-2 Word and it still doesn't work. I've used all the analyzers. What's the trick here? Thanks, Darren - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene - Authentication
Hi , if I have a Lucene index (or Solr) that is installed in client premises. how would you go about securing the index from being queries in unauthorized fashion. For example, from malicious users or hackers, or for that matter internal users trying to reengineer the system and use it for purposes other than the way licensed. any suggestions? as - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene - Authentication
: X-Mailer: YahooMailRC/1155.45 YahooMailWebService/0.7.260.1 : References: 1229011161.7448.10.ca...@nuraku : 32a1c320812110848u302dd645h4143205068fe3...@mail.gmail.com : 1229015253.7448.12.ca...@nuraku : 295da8fe0812110932x3b31380dla64b09f1b09be...@mail.gmail.com : 1229018304.7448.24.ca...@nuraku : 295da8fe0812111733n529163a7r6fb51fec4db16...@mail.gmail.com : 1229085896.26037.0.ca...@nuraku 49426127.9060...@informatics.jax.org : 1229130748.24089.15.ca...@nuraku : Date: Fri, 12 Dec 2008 21:05:29 -0800 (PST) : Subject: Lucene - Authentication http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/Thread_hijacking -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org