Re: PDF Text extraction
to get the string value of a inputstream you can use it to fill a ByteArrayInputStream and get the content from that; ByteArrayInputStream bais = new ByteArrayInputStream(inputstream); System.out.println( new String(bais.getBytes()) ); mvh karl øie On Friday, Dec 27, 2002, at 07:34 Europe/Oslo, Suhas Indra wrote: Hello List I am using PDFBox to index some of the PDF documents. The parser works fine and I can read the summary. But the contents are displayed as java.io.InputStream. When I try the following: System.out.println(doc.getField(contents)) (where doc is the Document object) The result will be: Textcontents:java.io.InputStreamReader@127dc0 I want to print the extracted data. Can anyone please let me know how to extract the contents? Regards Suhas -- Robosoft Technologies - Partners in Product Development -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: problems with search on Russian content
Hi i took a look at Andrey Grishin russian character problem and found something strange happening while we tried to debug it. It seems that he has avoided the usual querying with different encoding than indexed problem as he can dump out correctly encoded russian at all points in his application. Is the strings for terms treated differently than the text stored in text fields? The reason i ask is that his russian words are correct in the stored text fields, but shows up faulty in a terms() dump. If he had a character encoding problem in his application the fields should show up faulty as well i think. Even stranger is that i use Lucene 1.2 successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is this problem showing in russian(Cp1251) and not the other encodings? Strangeness number two is the theory that if the russian word ,!,_,U was skewed to say 0d66539qw upon indexing, and the problem was just a consistent encoding problem, wouldn't a query with ,!,_,U be skewed to 0d66539qw and be found anyway? mvh karl )*ie Begin forwarded message: From: Andrey Grishin [EMAIL PROTECTED] Date: Thu Nov 21, 2002 15:13:33 Europe/Oslo To: Karl Oie [EMAIL PROTECTED] Subject: Re: How to include strange characters?? yes, you are right - there are no russian words in returned terms :((( I've just executed the following -- IndexReader r = IndexReader.open(C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo); TermEnum e = r.terms(); while (e.next()) { Term term = (Term) e.term(); System.out.println(term : + term.text()); } -- and got no russian words in result there are some strange terms returned instead of russian: term : 0d4xvp70w term : 0d66539qw term : 0d67les2o term : 0d6eqgic0 etc. So, I think we got a problem. THis is great :)), thank you... but how to fix it? - Original Message - From: Karl ?e [EMAIL PROTECTED] To: Andrey Grishin [EMAIL PROTECTED] Sent: Thursday, November 21, 2002 3:56 PM Subject: Re: How to include strange characters?? another thing to check is weither the IndexReader.terms() actually contains your term. mvh karl oie On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote: Karl, I have the same problem with lucene search within russian content. I tried all your advises, but lucene still can't find anything : I indexed the content using Cp1251 charset text = new String(text.getBytes(Cp1251)); doc.add(Field.Text(CONTENT_FIELD,text)); and I am searching using the same charset String txt = ,!,_,U; txt = new String(txt.getBytes(Cp1251)); PrefixQuery query = new PrefixQuery(new Term(PortalHTMLDocument.CONTENT_FIELD, txt)); hits = searcher.search(query); and lucene can't find nothing. Also I checked for the DecodeInterceptor in my server.xml - there isn't any I tried UTF-8/16 - and got the same result. if I list all index's content via iterating IndexReader- I can see that my russian content is stored in index... Can you please help me? Do you have any more ideas about what else can be done here to fix this problem? I will appreciate any help. Thanks, Andrey. P.S. I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: problems with search on Russian content
Sorry, my bad! Didn't read this informative post :-) mvh karl øie On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote: Look at CHANGES.txt document in CVS - there is some new stuff in org.apache.lucene.analysis.ru package that you will want to use. Get the Lucene from the nightly build... Otis --- Andrey Grishin [EMAIL PROTECTED] wrote: Hi All, I have a problems with searching on Russian content using lucene 1.2 I indexed the content using Cp1251 charset text = new String(text.getBytes(Cp1251)); doc.add(Field.Text(CONTENT_FIELD,text)); and I am searching using the same charset String txt = ·Œƒ; txt = new String(txt.getBytes(Cp1251)); PrefixQuery query = new PrefixQuery(new Term(PortalHTMLDocument.CONTENT_FIELD, txt)); hits = searcher.search(query); or Analyzer analyzer = new StandardAnalyzer(); String txt = ·Œƒ“≈ ; txt = new String(txt.getBytes(Cp1251)); Query query = QueryParser.parse(txt, PortalHTMLDocument.CONTENT_FIELD, analyzer); hits = searcher.search(query); and lucene can't find nothing. Also I checked for the DecodeInterceptor in my server.xml - there isn't any I tried UTF-8/16 - and got the same result. Also, if I list all index's content via iterating IndexReader - I can see that my russian content is stored in index... Can you please help me? Do you have any more ideas about what else can be done here to fix this problem? I will appreciate any help. Thanks, Andrey. P.S. I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS __ Do you Yahoo!? Yahoo! Mail Plus ñ Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Help on creating and maintaining an index that changes
I want to do something similiar with Lucene, but I don't know how to approach it. I thought maybe keeping the first hashmap as is, and building a Directory in lucene that replaces the master Hashmap. When I get hits back from lucene I look them up in the first hashmap, and return those. If your index is big its probably best to do it this way. I got indexes that takes up to 12 hours to build and takes about 1gb of harddrive space but searching is still fast. if you put the client id's into keyword fields you can use lucenes to filter out hits from the clients you know is offline by using a boolean NOT, either manually or through the queryparser. How do I put the needed information into Directory so I can look them up in the first hashmap. I would need the unique id identifying the client, and a key that identifies the document that the client has. you add a keyword field to each document that contains the unique id identifying the client. This way you can search for documents from a client, and also filter out documents from that client. Then how do I clean up the Directory when a client is not available? How do I remove a document from Lucene's Directory? the org.apache.lucene.index.IndexReader class contains a delete() function to delete documents from lucene. But as said before, if your index is big it's best not to delete the documents just because a client goes offline, its better to filter out the hits. mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing of documents in memory
The org.apache.lucene.store.InputStream is not a _stream_ per see as it requires a seek() function and it is therefore not compatible with the java.io.InputStream consept, however you can quite easily create a java.io.InputStream by grabbing hold of the byte content of a org.apache.lucene.store.InputStream and stuff it into a java.io.ByteArrayInputStream. This doesn't make any sense anyhow because the raw bytestream from a RAMDirectory will not make any real sense to a HTML parser because the content of the RAMDir is an binary index. If you want to store the input HTML documents you should store them into a byte or char array in a file or database. mvh karl øie On Monday, Nov 18, 2002, at 03:24 Europe/Oslo, Vinay Kakade wrote: Hi I am trying to use RAMDirectory to store the input HTML documents which are used to create index by the IndexHTML demo program, but I am facing problems. I tried to get individual InputStream objects for individual files from RAMDirectory pass it to HTMLParser class to parse the file, but the HTMLParser class accepts java.io.InputStream object while RAMDirectory returns lucene.store.InputStream object. Is there any way to perform any conversion between there two objects? or do I have to modify HTMLParser class all other classes it uses to achieve this?? Please let me know regards Vinay. --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Look at RAMDirectory. Otis --- Vinay Kakade [EMAIL PROTECTED] wrote: Hi, I want to use Lucene for indexing some documents which are in memory. I do not want to store them in a seperate directory. The IndexWriter class accepts directory name, where all documents to be indexed are stored. Is there any way by which we can specify memory buffer in which documents are stored while creating Index? Thanks Vinay. __ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing distant web sites
oh, sorry.. i was perhaps not making me self clear here... you will have to use the crawler to retrieve the content and store it locally for indexing, so you will have to set up your crawler to fetch a site and store every html page's content to disk, then run Lucene on the locally stored html pages and afterwards delete the html pages... you will also need a way to get the original url from the crawler and store that in Lucene as well as a keyword field. a much more efficient way is to get the crawler to get one page, store it in memory, run Lucene on it, and then discard the buffer and then keep on to the next page. if you want to take a look at a real lucene+ crawler implementation you can check out the Cocoon project at http://xml.apache.org/cocoon/index.html : Lucene integration: http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ cocoon/components/search/ Crawler implementation: http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ cocoon/components/crawler/ This impl is indexing XML, but the principe is the same... mvh karl øie On Monday, Nov 4, 2002, at 14:29 Europe/Oslo, Friaa Nafaa wrote: Thank you,I was installed this crawler and I run it,but I would like to index the web site and not to list the visited links by the crawler,Is there a way to serch a web page by lucene witch use this crawler for visiting the pages.thanks--- On Mon 11/04, Karl Marx lt; [EMAIL PROTECTED] gt; wrote:From: Karl Marx [mailto: [EMAIL PROTECTED]]To: [EMAIL PROTECTED]: Mon, 4 Nov 2002 12:31:50 +0100Subject: Re: Indexing distant web sitesAs stated in the official FAQ Lucene doesn't implement a web-crawler, you can however use a self-made crawler or customate a crawler framework like websphinx (http://www-2.cs.cmu.edu/~rcm/websphinx/) to retrieve html documents from a site and then feed them to Lucene.mvh karl ¯ieOn Monday, Nov 4, 2002, at 11:49 Europe/Oslo, Friaa Nafaa wrote:gt; Hello,is there any way to index web sites by lucene, assuming we know gt; only the url of the site ? :--amp;gt;In local use we passe to lucene the gt; full arborexcence or directory of our site (contain all the documents) gt; and we begin the indexing operation, but when I would like to index a gt; distant site on the web... what i do ?For exemple I installed Lucene gt; on my computer and I would like to index the site : gt; http://www.excite.com ...Thanksgt;gt; ___gt; Join Excite! - http://www.excite.comgt; The most personalized portal on the Web!--To unsubscribe, e-mail: For additional commands, e-mail: ___ Join Excite! - http://www.excite.com The most personalized portal on the Web! -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: Multithread searching problem on Linux
if you still have problems, take a look at this note found in the newest tomcat release... it might help. mvh karl øie --- Linux and Sun JDK 1.2.x - 1.3.x: --- Virtual machine crashes can be experienced when using certain combinations of kernel / glibc under Linux with Sun Hotspot 1.2 to 1.3. The crashes were reported to occur mostly on startup. Sun JDK 1.4 does not exhibit the problems, and neither does IBM JDK for Linux. The problems can be fixed by reducing the default stack size. At bash shell, do ulimit -s 2048; use limit stacksize 2048 for tcsh. GLIBC 2.2 / Linux 2.4 users should also define an environment variable: export LD_ASSUME_KERNEL=2.2.5 On onsdag, okt 2, 2002, at 15:34 Europe/Oslo, Stas Chetvertkov wrote: Yes, it works without errors with classic JVM, but if it was not so painfully slow :( Anyway, I'll check what is faster - classic JVM with multiple thread search or Hotspot with 1 searching thread (as we have now). Thanks, Stas. Try to run your vm in classic mode java -classic to disable the hotspot features... mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: How to include strange characters??
Also note that both apache and tomcat has a default setting that force re-encodes all pages. in tomcat it is the DecodeInterceptor / in server.xml, in apache it is a line that says AddDefaultCharset on in httpd.conf. These are applied _after_ any servlet output so it might lead to strange result, be sure to turn off both directives when you test different encoding problems. Last but not least is the encoding the SQL database was created in. On DB2 i have to use the right database constructor to get norwegian character support (db2 CREATE DATABASE mydb USING CODESET ISO-8859-1 TERRITORY NO COLLATE USING SYSTEM;). Without the correct encoding on the database constructor the database behave strange in sorting and insert/update scenarios. To be sure to get everything make sure that all steps are using the same encoding, just like you use the same analyzer (perhaps encoding should be a part of a analyzer?!?) 1: create the database with ISO-8859-1 encoding (my favorite)... CREATE DATABASE mydb USING CODESET ISO-8859-1 TERRITORY NO COLLATE USING SYSTEM; 2: in the indexer force feed lucene with ISO-8859-1 strings: String value = resultset.getString(fieldname); document.add(Field.UnStored(fieldname, new String(value.getBytes(ISO-8859-1; ... 3: force encode all queries to lucene in the same manner String querystring = httprequest.getParameter(query); querystring = new String(querystring.getBytes(ISO-8859-1)); ... mvh karl øie On søndag, okt 13, 2002, at 14:15 Europe/Oslo, Chris Davis wrote: To Dominator, Where you able to solve the display problem as well? I am having a similiar problem with documents that contain the (open double quote #8220). I am not concerned with searching on the character, but when I attempt to dsiplay a stored field with this character, it does not display correctly. Even stranger, the closing quote #8221 does display. To All, I have browsed through the majority of messages related to Unicode in the archive, and my reading tells me that Lucene does not normally change the data that is stored for a field. Can someone give me some pointers on how to troubleshoot this problem. Note: I am indexing data that is being pulled from a SQL Server 2000 DB on Windows 2000. --- In an earlier message Dominator wrote: I print out a result string it shows a very strange result, for example search for: civilingenircaron;r string: civilingeniAbreve;¸r.. I'm sure it's an unicode problem, but where can I change it?? Dominator wrote: thx, with your help I could solve the problem karl ie [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... i had such problems with norwegian characters and it resolved into making sure the querystring has the same encoding as the index has. since this is again a java.lang.String encoding question i had these problems with querystrings coming from java Servlets and CLI. For both the quickfix was to re-encode the query in UTF-8/16: String querystring = argv[0]; ' String querystring = httprequest.getParameter(query); querystring = new String(querystring.getBytes(UTF-8)); ... this fixed my norwegian/samii problems... mvh karl ie On mandag, okt 7, 2002, at 13:04 Europe/Oslo, Dominator wrote: I use czech language with more bizzare characters and there is no problem at all. Are you sure, that your XML contains character set information? yes, I tried ?xml version=1.0 encoding=ISO-8859-2? and ?xml version=1.0 encoding=UTF-8? but I get the same strange characters. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: How to include strange characters??
i had such problems with norwegian characters and it resolved into making sure the querystring has the same encoding as the index has. since this is again a java.lang.String encoding question i had these problems with querystrings coming from java Servlets and CLI. For both the quickfix was to re-encode the query in UTF-8/16: String querystring = argv[0]; ' String querystring = httprequest.getParameter(query); querystring = new String(querystring.getBytes(UTF-8)); ... this fixed my norwegian/samii problems... mvh karl øie On mandag, okt 7, 2002, at 13:04 Europe/Oslo, Dominator wrote: I use czech language with more bizzare characters and there is no problem at all. Are you sure, that your XML contains character set information? yes, I tried ?xml version=1.0 encoding=ISO-8859-2? and ?xml version=1.0 encoding=UTF-8? but I get the same strange characters. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Multithread searching problem on Linux
there have been numerous problems/bad features with the hot-spot mode on sun's linux vm, the reason for this is that hot-spotting is optimizing your code by doing very weird stuff with your code :-) anyway i'm glad -classic works well for you. the bad performance is a known problem with the linux-kernel java threading system. When it comes to threads windows jvms outperforms linux jvms because of kernel internal things i don't even try to understand... i have used the 1.3.1 linux jvm from ibm with great stability, how does the ibm jvm perform against the sun jvm when it comes to thread performance? there is also a 1.3 jvm from a group called blackdown that is free and optimized for linux. there was some talking in the news about it being very good at threading... you could try it.. ( http://www.blackdown.org/ ) mvh karl øie On onsdag, okt 2, 2002, at 15:34 Europe/Oslo, Stas Chetvertkov wrote: Yes, it works without errors with classic JVM, but if it was not so painfully slow :( Anyway, I'll check what is faster - classic JVM with multiple thread search or Hotspot with 1 searching thread (as we have now). Thanks, Stas. Try to run your vm in classic mode java -classic to disable the hotspot features... mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Multithread searching problem on Linux
Try to run your vm in classic mode java -classic to disable the hotspot features... mvh karl øie On tirsdag, okt 1, 2002, at 18:16 Europe/Oslo, Stas Chetvertkov wrote: Hi All, I am building a search engine based on Lucene. Recently I created a test simulating multiple users searching in the same index simultaneously and found out that quite often JVM crashes with 'Hotspot Virtual Machine Error : 11'. I couldnot reproduce this bug on Windows box, but observed it a lot on Red Hat Linux 7.3 with different versions of Sun's 1.3 JVM, including the most recent one (1.3.1_04 at the moment). I am attaching a simple test that generates hotspot error in 90% of cases. In our code we have to create new IndexSearcher for every search because indices are updated in real time. The only workaround I found for this problem so far is reducing the number of searching threads which doesnot seem to be a good solution. Had anyone encountered problems like this one? Regards, Stas. SearchTest.java-- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Problems understanding RangeQuery...
thank you, that works! :-) and saves my day! mvh karl øie -Original Message- From: Terry Steichen [mailto:[EMAIL PROTECTED]] Sent: 10. august 2002 18:29 To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Problems understanding RangeQuery... Hi Karl, I have discovered that with range queries you *must* ensure there is a space on either side of the dash. That is, [1971 - 1979] rather than [1971-1979]. If you don't, Lucene will interpret it as [1979 - null]. To illustrate a bit more, here are some result totals that I get on my index: pub_mo:[07 - 08] -- 8370 (note the spaces around the dash pub_mo:[07-08]-- 2133 (note the absence of spaces) pub_mo:[08 - null] -- 2133 pub_mo:(07 08)-- 8370 (note the use of parentheses, not brackets) Just put the spaces in and all should be OK. Regards, Terry - Original Message - From: Karl Øie [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, August 10, 2002 11:47 AM Subject: Problems understanding RangeQuery... Hi, i have a problem with understanding RangeQueries in Lucene-1.2: I have created an index with posts that has the field W_PUBLISHING_YEAR which contains the year of publishing. After indexing i loop through the terms and finds that i have the following terms present in the index: 1923,1925,1926,1930,1933,1935,1936,1938,1942,1943,1945,1946,1947,1948,1949,1 950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,19 65,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,198 0,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995 ,1996,1997,1998,1999,2000,2001,2002,2003,2004,2010,2018,2097 in 232290 documents. Then i run these queries on the index W_PUBLISHING_YEAR:[1971-1979] and W_PUBLISHING_YEAR:[2000-2002] and both queries gives me some strange results: W_PUBLISHING_YEAR:[1971-1979] found={1975, 1974, 1973, 1972, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 2018, 1992, 1991, 1990, 2010, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 2004, 2003, 2002, 2001, 2097, 2000, 1979, 1978, 1977, 1976} in 150793 matching documents. W_PUBLISHING_YEAR:[2000-2002] found={2002, 2001, 2097, 2010, 2018, 2004, 2003} in 10756 matching documents. Is there something i do wrong here? How is the RangeQuery supposed to work? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Problems understanding RangeQuery...
Hi, i have a problem with understanding RangeQueries in Lucene-1.2: I have created an index with posts that has the field W_PUBLISHING_YEAR which contains the year of publishing. After indexing i loop through the terms and finds that i have the following terms present in the index: 1923,1925,1926,1930,1933,1935,1936,1938,1942,1943,1945,1946,1947,1948,1949,1 950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,19 65,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,198 0,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995 ,1996,1997,1998,1999,2000,2001,2002,2003,2004,2010,2018,2097 in 232290 documents. Then i run these queries on the index W_PUBLISHING_YEAR:[1971-1979] and W_PUBLISHING_YEAR:[2000-2002] and both queries gives me some strange results: W_PUBLISHING_YEAR:[1971-1979] found={1975, 1974, 1973, 1972, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 2018, 1992, 1991, 1990, 2010, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 2004, 2003, 2002, 2001, 2097, 2000, 1979, 1978, 1977, 1976} in 150793 matching documents. W_PUBLISHING_YEAR:[2000-2002] found={2002, 2001, 2097, 2010, 2018, 2004, 2003} in 10756 matching documents. Is there something i do wrong here? How is the RangeQuery supposed to work? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Crash / Recovery Scenario
only deletes the old one while it's working on the new one, so is there a way of checking for the .lock files in case of a crash a rolling back to the old index image? Nader Henein i have some thoughts about crash/recovery/rollback that i haven't found any good solutions for. If a crash happends during writing happens there is no good way to know if the index is intact, removing lock files doesn't help this fact, as we really don't know. So providing rollback functionality is a good but expensive way of compensating for lack of recovery. To provide rollback i have used a RAMDirectory and serialized it to a SQL table. By doing this i can catch any exceptions and ask the database to rollback if required. This works great for small indexes but if the index grows you will have problems with performance as the whole RAMDir has to be serialized/deserialized into the BLOB all the time. A better solution would be to hack the FSDirectory to store each file it would store in a file-directory as a serialized byte array in a blob of a sql table. This would increase performance because the whole Directory don't have to change each time, and it doesn't have to read the while directory into memory. I also suspect lucene to sort its records into these different files for increased performance (like: i KNOW that record will be in segment xxx if it is there at all). I have looked at the source for the RAMDirectory and the FSDirectory and they could both be altered to store their internal buffers into a BLOB, but i haven't managed to do this successfully. The problem i have been pounding is the lucene.InputStream's seek() function. This really requires the underlying impl to be either a file, or a array in memory. For a BLOB this would mean that the blob has to be fetched, then read/seek-ed/written/ then stored back again. (is this correct?!?, and if so is there a way to know WHEN it is required to fetch/store the array). I would really appreciate any tips on this as i would think crash/recovery/rollback functionality to benefit lucene greatly. I have indexes that uses 5 days to build, and it's really bad to receive exceptions during a long index run, and no recovery/rollback functionality. Mvh Karl Øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: SearchBean Persistence
if the array is of a serializable sort, just store it in a sql table !?! mvh karl øie On Wednesday 03 July 2002 16:22, Terry Steichen wrote: I'm using Peter's SearchBean code to sort search results. It works fine, but it creates the sorting field array from scratch with every invocation (which takes on the order of a second or so to complete - each search itself takes about one tenth of that or less). While I can conduct several searches in the same module, I can't figure out how to persist the sorting field array between invocations of the search module. Any advice on how to do this would be much appreciated. Regards, Terry -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: SearchBean Persistence
if it is a Stateful SessionBan you will have to create an EntityBean implementation with the same functionality. And then in the EJB's load() and store() you will have to serialize the array. Or if it is a CMP EJB, just declare the array as a persistent field. mvh karl On Wednesday 03 July 2002 16:39, Terry Steichen wrote: Karl, Just to clarify. I have an application that runs searches as requested by users. The application is persistent across multiple requests, so there's no problem creating it at startup. And, given the application's persistence, there should be no problem storing it in memory to serve subsequent requests. I just can't figure out how to modify the SearchBean code to do this. I seemed like it would be simple, but try as I might, nothing has so far worked. Regards, Terry - Original Message - From: Karl Øie [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, July 03, 2002 10:34 AM Subject: Re: SearchBean Persistence if the array is of a serializable sort, just store it in a sql table !?! mvh karl øie On Wednesday 03 July 2002 16:22, Terry Steichen wrote: I'm using Peter's SearchBean code to sort search results. It works fine, but it creates the sorting field array from scratch with every invocation (which takes on the order of a second or so to complete - each search itself takes about one tenth of that or less). While I can conduct several searches in the same module, I can't figure out how to persist the sorting field array between invocations of the search module. Any advice on how to do this would be much appreciated. Regards, Terry -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: SearchBean Persistence
oh, i see. i was misleaded by the Bean part of the SearchBean... im sorry! :-) Anyhow, if it is not a Statefull SessionBean you are not restricted by EJB rules and can thus serialize everything you want to disk or db... mvh karl øie On Wednesday 03 July 2002 17:20, Otis Gospodnetic wrote: I think you guys are not understanding each other. Terry is talking about the code in Lucene Sandbox, not about EJBs. I don't use that code (yet?), so I don't know the answer. Otis --- Karl Øie [EMAIL PROTECTED] wrote: if it is a Stateful SessionBan you will have to create an EntityBean implementation with the same functionality. And then in the EJB's load() and store() you will have to serialize the array. Or if it is a CMP EJB, just declare the array as a persistent field. mvh karl On Wednesday 03 July 2002 16:39, Terry Steichen wrote: Karl, Just to clarify. I have an application that runs searches as requested by users. The application is persistent across multiple requests, so there's no problem creating it at startup. And, given the application's persistence, there should be no problem storing it in memory to serve subsequent requests. I just can't figure out how to modify the SearchBean code to do this. I seemed like it would be simple, but try as I might, nothing has so far worked. Regards, Terry - Original Message - From: Karl Øie [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, July 03, 2002 10:34 AM Subject: Re: SearchBean Persistence if the array is of a serializable sort, just store it in a sql table !?! mvh karl øie On Wednesday 03 July 2002 16:22, Terry Steichen wrote: I'm using Peter's SearchBean code to sort search results. It works fine, but it creates the sorting field array from scratch with every invocation (which takes on the order of a second or so to complete - each search itself takes about one tenth of that or less). While I can conduct several searches in the same module, I can't figure out how to persist the sorting field array between invocations of the search module. Any advice on how to do this would be much appreciated. Regards, Terry -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Wildcard searching
Hi, i have experimented with prefixing all Field values with the letter A to allow wildcards * and ? to be positioned first in the a query term. What i would like to do next is to prefix all the terms produced by the QueryParser with the letter A so the hack is transparent to the user. Is there a simple way to do this as the Query's subclasses dosn't allow you to modify the term it holds. Secondly i can not find any way to get al sub queries of a query. Does anyone here know something really smart i can do short of learning to program JavaCC ?!? And in the end: is there a reason why lucene doesn't use java interfaces for eh. interfaces like the Query class? mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: MS Word Search ??
to search MS office documents you must first be able to a: access the documents through java with apis like POI etc b: convert the documents to something that is accessable through java like xml, etc... the best way is to convert as the java api's for MSOffice documents still are under development mvh karl øie On Wednesday 29 May 2002 11:48, Rama Krishna wrote: Hi, I am trying to build a search engine which search in MS Word, excel, ppt and adobe pdf. I am not sure whether i can use Lucene for this or not. pl. help me out in this regard. Regards, Ramakrishna _ Chat with friends online, try MSN Messenger: http://messenger.msn.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Searching UNICODE
what language are you trying to use lucene with? mh karl øie On Tuesday 30 April 2002 18:57, Hyong Ko wrote: Hello, I think there's something wrong with the QueryParser.jj file. I downloaded lucene-1.2-rc4-src and compiled successfully with JAVA_UNICODE_ESCAPE=true and DEBUG_TOKEN_MANAGER = true. My output debug info for Indexing looked okay. It showed the correct byte arrays in UTF8. However, when I ran SearchFiles, the output debug showed the byte arrays in default byte! I tried calling QueryParser.parse after converting the search string to UTF-8, but still got non-UTF8 bytes. I think that's why my search's been failing. Any ideas?? Thank you very much. Hyong Ko [EMAIL PROTECTED] _ Send and receive Hotmail on your mobile device: http://mobile.msn.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene index integrity... or lack of :-(
there are some strange problems with FSDirectory, i have found that building chuncks in a RAMDirectory and then merge these into a FSDirectory is more stable than indexing directly into the FSDirectory, i ran into your problem and the dreaded too many open files problems when indexing large documents with many fields using a RAMDir as a middle man solved my problems... mvh karl øie On Friday 26 April 2002 13:54, petite_abeille wrote: Hello, I'm starting to wander how bullet proof are Lucene indexes? Do they get corrupted easely? If so is there a way to rebuild them? I'm started to get the following exception left and right... 04/25 18:34:39 (Warning) Indexer.indexObjectWithValues: java.io.IOException: _91.fnm already exists I build a little app (http://homepage.mac.com/zoe_info/) that uses Lucene quiet extensively, and I would like to keep it that way. However, I'm starting to have second thought about Lucene's reliability... :-( I'm sure I'm doing something wrong somewhere, but I really cannot see what... Any help or insight greatly appreciated. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene index integrity... or lack of :-(
ah, now i see, what i have is a server with 512mb of ram, so i have used two different approaches and both works ok; 1 - i index a fixed number of documents into a RAMDir, like 10 (each of the docs are xml docs about 1,5-2mb) and then i optimize the RAMDir and merge it into the FSDir and then optimize the FSDir... 2 - i use the Runtime.freeMemory() and Runtime.totalMemory() to see if i have reached more than 80% of the available memory, if so i optimize the RAMDir, merge it and optimize the FSDir..., if not i just add more documents to the RAMDir as far as i have tested i have never experienced a failure while merging a RAMDir into a FSDir regardless of size, so it's my systems memory that is the problem mvh karl øie On Friday 26 April 2002 15:33, petite_abeille wrote: Thanks. What's is your heuristic to flush the RAMDirectory? please explain this because i don't understand english that good :-( That's ok, I don't really understand English either :-) Simply put, when do you flush the RAMDirectory into the FSDirectory? Every five documents? Ten? A thousand? What is a good balance between RAM and FS? Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene index integrity... or lack of :-(
forgot this: its a bit hard to determine a good number of balance while indexing XML documents because the internal relations of a DOM can make a XML document become nearly 21 times as big in memory compared to disk (i am not lying, i have seen it my self)... also the RAMDir must be kept in memory while indexing and merging, so checking the systems free memory is easier that trying to calculate memoryusage mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Italian web sites
hm... this looks very interesting! if it is a perl exe you can just copy the text into a temp file and run the per exe on that file and redirect the output to another tmp file. then read the file and use the result in a lucene keyword. mvh karl øie On Wednesday 24 April 2002 13:46, [EMAIL PROTECTED] wrote: Hi all, I have found a very interesting library which is written in perl. The problem is now how I can use this library. Anyway the library is Textcat an you can find it: http://odur.let.rug.nl/~vannoord/TextCat/ Bye Laura combined with that you could use an italian stop- word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: delete document
it's actually the IndexReader, not the IndexWriter... happy hacking! On Wednesday 24 April 2002 15:27, Tim Tschampel wrote: How do you delete a document from the index? I see in the FAQ to user IndexWriter.delete(Term), however I don't see this in the current API JavaDocs, and don't have this method present in the lucene-1.2-rc4.jar that I downloaded from this site. Tim Tschampel -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Some questions
Well, I saw that lucene create the index on the filesystem: I think that this is a problem for producion enviroment. I usually use Database, for example Oracle. Is it possible integrate Lucene with Oracle or some other db (Mysql)? you can store the index in blob-fields, but thats about it so far I think that there isn't any Italian Anylizer, is it? How can I write one? the implementation for lucene is pretty straight forward, take a look at the contributed GermanAnalyzer. Inside the implementing class you implement stopwords, language dependent case switching etc... When it comes to the english and german analyzers they also perform stemming (making computers match computer and histories match history etc). This requires to create a program that can understand the plurals/singulars of Italian. A good start might be to look at http://snowball.sourceforge.net as they have a italian stemmer allready. The last question is: I suppose that my search engine is able to spider web sites. Is it possible spidering urls? For example is it possible that with a page I spider this page, then I extract the links of the page and at least I'd like spidering also these links? How can I do this? As lucene works with only the text content for a doc you will have to create a spider that retrieves a url, extracts the text and feeds it to lucene, then extract the links and process each of these links in the same manner. for this you will need a html parser.. happy hacking! mvh karl øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Read only filesystem
thank you! i actually ran into this today when i buildt a index with crond as root and found that even my own user could read the index, lucene couldn't. :-D mvh karl øie On Friday 05 April 2002 15:15, you wrote: Hi, after some trial with Lucene, I discovered it doesn't work with index on CD-ROM. So, I write a replacement for FSDirectory class that work on Read Only filesystem. It works for me. If you think that can be useful, you can download it from http://www.csita.unige.it/software/free/lucene/ Bye. -- Marco Ferrante ([EMAIL PROTECTED]) CSITA (Centro Servizi Informatici e Telematici d'Ateneo) Università degli Studi di Genova - Italy Via Brigata Salerno, ponte - 16147 Genova tel (+39) 0103532621 (interno tel. 2621) -- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: storing index in third party database.
without having investigated the problem much i would think that a SQL database would be a very bad match for lucene as most of lucene's working is creating key's for words and documents and then creating indexes of these keys. for these purposes a SQL database is an unecessary overhead, not even talking about the overhead represented by the SQL language parser. for these kind of indexes a lower-level database would be better suited. I have good experiences with BerkeleyDB (http://www.sleepycat.com) and a friend of me uses gdbm successfully for such key-pair indexing tasks. the advantage of these low-level databasesystems is that they are really much or less persistent b-tree/hashtable implementations, and thus created for key-pairing. they have no SQL layer as you will have to program against them as they are more subroutines that applications. but for key-pair indexes i have experienced that BerkeleyDB runs circles around any SQL database (including db2 and oracle!!!). Berkeley has a java-api and a b-tree record type that could be a very good match for a key-based searchtree, and it's free. take a look at it! mvh karl øie (ps: i am not payed by the sleepy cat to write this :-) On Wednesday 03 April 2002 16:12, you wrote: If you want to store indices in a database search the mailing list archives for SqlDirectory. Once I considered using it for one application at work, so I asked its author about performance. The answer was that it doesn't perform all that well when the index grows, if I recall correctly. Consequently, we chose to use file-based indices instead. Otis --- [EMAIL PROTECTED] wrote: Hi all I want to index the datas which I already stored in a thirdparty database table and develop a search facility using lucene. I am thinking of storing this indexes back to the database in another table. I know for this we have to create a 'directory' which do all the indexing operations, for example Indexwriter indwriter = new Indexwriter(dirStore,null,create); where dirStore is the directory, create is boolean. but I don't know the format to be followed for the directory(dirStore).Please help me if anybody has done similar thing. TIA Amith __ Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop@Netscape! http://shopnow.netscape.com/ Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: optimizing index - too many open files
I have to index 1650mb of documents, and eventually i will get out of memory with a RAMDir and get too many open files with a FSdir, so to get around this i am indexing 100 documents at a time in a RAMDir, then merge this RAMDir into a FSDir before i index the next set of 100 files. This made me work around both out of memory and too many files exceptions... mvh karl øie -Original Message- From: Paul Friedman [mailto:[EMAIL PROTECTED]] Sent: 28. februar 2002 21:38 To: Lucene Users List Subject: Re: optimizing index - too many open files Sorry to bother y'all again. Found an answer in the archives under the Thread Indexing problem. About to try using RAMDirectory first. pax et bonum. p. - Original Message - From: Paul Friedman [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, February 28, 2002 1:23 PM Subject: optimizing index - too many open files Hello all, I am running into an error: java.io.FileNotFoundException: /lucene/index/_2vx.tii ( too many open files ) after my class calls IndexWriter.optimize(). Does anybody know what causes this error? Any help is appreciated. ( By the way, the site that I am indexing is huge. I have a crawler run through the site calling many .jsps, .pdfs, and .html docs. It ran fine two days ago after indexing 3700+ pages. ) Could the index be too large for Lucene to handle? The error: java.io.FileNotFoundException: /lucene/index/-2vx.tii ( too many open files ) at java.io.RandomAccessFile.open( Native Method ) at java.io.RandomAccessFile.init at java.io.RandomAccessFile.init at org.apache.lucene.store.FSInputStream$Descriptor.init at org.apache.lucene.store.FSInputStream.init at org.apache.lucene.store.FSDirectory.openFile at org.apache.lucene.index.TermInfosReader.readIndex at org.apache.lucene.index.TermInfosReader.init at org.apache.lucene.index.SegmentReader.init at org.apache.lucene.index.IndexWriter.mergeSegments at org.apache.lucene.index.IndexWriter.optimize _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: How to do web searching
if you want to create websearch you must create a servlet or a jsp page that can create a IndexSearcher class and read a index reated by a IndexWriter class. To make a long story short : try to create a servlet that does the same as the demo searcher: http://cvs.apache.org/viewcvs/jakarta-lucene/src/demo/org/apache/lucene/demo /SearchFiles.java?rev=1.1content-type=text/vnd.viewcvs-markup mvh karl øie -Original Message- From: Parag Dharmadhikari [mailto:[EMAIL PROTECTED]] Sent: 19. februar 2002 10:12 To: lucene-user Subject: How to do web searching Hi all, Pls can anybody tell me if I want to provide web searching as a feature then what exactly I should go?Can lucene help me in this matter? regards parag -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Filter and stop-words
to remove plural form you have to create a stemmer for your language, i have been working with porting a stemmer for norwegian for lucene, to get a head start i have ported the norwegian snowball stemmer, there is one for portuguese as well, check it out! http://snowball.sourceforge.net/portuguese/stemmer.html mvh karl øie -Original Message- From: Bizu de Anúncio [mailto:[EMAIL PROTECTED]] Sent: 3. desember 2001 13:22 To: [EMAIL PROTECTED] Subject: Filter and stop-words I'm new to Lucene. First of all I would like to know if there is a search arquive like sun servlets list. My first problem is that I want to index a Portuguese database and I need to remove the s (plural) and acents (à é ...) from the words. Is there a way of passing a filter class to the Lucene indexer ? And about the stop-words, where should I configure Lucene to ignore it ? Any help would be appreciated, thanks a lot, jk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
scandinavian characters.
Hi, i got a problem with scandinavian characters (æåø), when i insert text with scand-chars it passes the analyzer correctly, but the QueryParser chokes when i try to search for the same characters. anyone know anything about how i can fix this? karl øie/gan meida -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: scandinavian characters.
no it's even stranger than that, i have decoded the querystring, the problem is that it seems like something is changed on the way in. if i search for fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is changed to Auml; and 's' is removed. is the querystring translated some where? mvh karl øie -Original Message- From: David Bonilla [mailto:[EMAIL PROTECTED]] Sent: 27. november 2001 10:43 To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: scandinavian characters. Hi Karl !!! I´m spanish and I have a lot of problems programming with our not english characters. I use LUCENE with spanish accents and it works fine... Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with your fields to index ? Best Regards from Spain ! __ David Bonilla Fuertes THE BIT BANG NETWORK http://www.bit-bang.com Profesor Waksman, 8, 6º B 28036 Madrid SPAIN Tel.: (+34) 914 577 747 Móvil: 656 62 83 92 Fax: (+34) 914 586 176 __
RE: scandinavian characters.
there must be something seriously broken with the queryparse code. if a query starts with ø/æ/å (oslash;, oaelig;, aring;) then an exception in the queryparser occurs. org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 1. Encountered: \u00c3 (195), after : at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown Source) at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source) at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source) at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source) at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source) at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source) but if the query contains ø/æ/å (oslash;, oaelig;, aring;) then it is translated wrongly into the swedish/german auml; regardless of what character it was. if someone could point me to where to start I could try to find the problem because I guess it is errorous unicode translation... mvh karl no it's even stranger than that, i have decoded the querystring, the problem is that it seems like something is changed on the way in. if i search for fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is changed to Auml; and 's' is removed. is the querystring translated some where? mvh karl øie -Original Message- From: David Bonilla [mailto:[EMAIL PROTECTED]] Sent: 27. november 2001 10:43 To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: scandinavian characters. Hi Karl !!! I´m spanish and I have a lot of problems programming with our not english characters. I use LUCENE with spanish accents and it works fine... Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with your fields to index ? Best Regards from Spain ! __ David Bonilla Fuertes THE BIT BANG NETWORK http://www.bit-bang.com Profesor Waksman, 8, 6º B 28036 Madrid SPAIN Tel.: (+34) 914 577 747 Móvil: 656 62 83 92 Fax: (+34) 914 586 176 __ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: scandinavian characters.
after i had replaced QueryParser.jj with the newest version from cvs the queryparser accepts my query, and i can now perform ø/æ/å searches from commandline, then i guess there is something wrong with my search servlets unicode handling :-) thank you very much! karl øie/gan media -Original Message- From: Jonas Bechlund [mailto:[EMAIL PROTECTED]] Sent: 27. november 2001 13:52 To: 'Lucene Users List' Subject: RE: scandinavian characters. Hi Karl, It is a little bit tricky - but when you get the idea it is not that bad... I had the same problem with the danish characters. I made changes TOKEN definition in the Token Definitions section of the file QueryParser.jj and that actually solved the problem. One minor detail is that you have to rebuild the jar file with ANT. (See build.txt for instructions) I guess that solves your problem, Regards, / Jonas -Original Message- From: Karl Øie [mailto:[EMAIL PROTECTED]] Sent: 27 November 2001 13:01 To: Lucene Users List Subject: RE: scandinavian characters. there must be something seriously broken with the queryparse code. if a query starts with ø/æ/å (oslash;, oaelig;, aring;) then an exception in the queryparser occurs. org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 1. Encountered: \u00c3 (195), after : at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown Source) at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source) at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source) at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source) at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source) at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source) but if the query contains ø/æ/å (oslash;, oaelig;, aring;) then it is translated wrongly into the swedish/german auml; regardless of what character it was. if someone could point me to where to start I could try to find the problem because I guess it is errorous unicode translation... mvh karl no it's even stranger than that, i have decoded the querystring, the problem is that it seems like something is changed on the way in. if i search for fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is changed to Auml; and 's' is removed. is the querystring translated some where? mvh karl øie -Original Message- From: David Bonilla [mailto:[EMAIL PROTECTED]] Sent: 27. november 2001 10:43 To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: scandinavian characters. Hi Karl !!! I´m spanish and I have a lot of problems programming with our not english characters. I use LUCENE with spanish accents and it works fine... Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with your fields to index ? Best Regards from Spain ! __ David Bonilla Fuertes THE BIT BANG NETWORK http://www.bit-bang.com Profesor Waksman, 8, 6º B 28036 Madrid SPAIN Tel.: (+34) 914 577 747 Móvil: 656 62 83 92 Fax: (+34) 914 586 176 __ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]