Re: PDF Text extraction

2003-01-02 Thread Karl Øie
to get the string value of a inputstream you can use it to fill a 
ByteArrayInputStream and get the content from that;

ByteArrayInputStream bais = new ByteArrayInputStream(inputstream);
System.out.println( new String(bais.getBytes()) );

mvh karl øie

On Friday, Dec 27, 2002, at 07:34 Europe/Oslo, Suhas Indra wrote:

Hello List

I am using PDFBox to index some of the PDF documents. The parser works 
fine
and I can read the summary. But the contents are displayed as
java.io.InputStream.

When I try the following:
System.out.println(doc.getField(contents)) (where doc is the Document
object)

The result will be:

Textcontents:java.io.InputStreamReader@127dc0

I want to print the extracted data.

Can anyone please let me know how to extract the contents?

Regards

Suhas



--
Robosoft Technologies - Partners in Product Development









--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Hi i took a look at Andrey Grishin russian character problem and found 
something strange happening while we tried to debug it. It seems that 
he has avoided the usual querying with different encoding than 
indexed problem as he can dump out correctly encoded russian at all 
points in his application.

Is the strings for terms treated differently than the text stored in 
text fields? The reason i ask is that his russian words are correct in 
the stored text fields, but shows up faulty in a terms() dump. If he 
had a character encoding problem in his application the fields should 
show up faulty as well i think. Even stranger is that i use Lucene 1.2 
successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is 
this problem showing in russian(Cp1251) and not the other encodings?

Strangeness number two is the theory that if the russian word ,!,_,U was 
skewed to say 0d66539qw upon indexing, and the problem was just a 
consistent encoding problem, wouldn't a query with  ,!,_,U be skewed to 
0d66539qw and be found anyway?

mvh karl )*ie


Begin forwarded message:

From: Andrey Grishin [EMAIL PROTECTED]
Date: Thu Nov 21, 2002  15:13:33 Europe/Oslo
To: Karl Oie [EMAIL PROTECTED]
Subject: Re: How to include strange characters??

yes, you are right - there are no russian words in returned terms :(((
I've just executed the following
--
IndexReader r =
IndexReader.open(C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo);
TermEnum e = r.terms();
while (e.next()) {
  Term term = (Term) e.term();
  System.out.println(term :  + term.text());
}
--
and got no russian words in result
there are some strange terms returned instead of russian:
term : 0d4xvp70w
term : 0d66539qw
term : 0d67les2o
term : 0d6eqgic0
etc.

So, I think we got a problem. THis is great :)), thank you...
but how to fix it?




- Original Message -
From: Karl ?e [EMAIL PROTECTED]
To: Andrey Grishin [EMAIL PROTECTED]
Sent: Thursday, November 21, 2002 3:56 PM
Subject: Re: How to include strange characters??


another thing to check is weither the IndexReader.terms() actually
contains your term.

mvh karl oie

On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote:


Karl,
I have the same problem with lucene search within russian content.
I tried all your advises, but lucene still can't find anything :
I indexed the content using Cp1251 charset

text = new String(text.getBytes(Cp1251));
doc.add(Field.Text(CONTENT_FIELD,text));

and I am searching using the same charset
String txt = ,!,_,U;
txt = new String(txt.getBytes(Cp1251));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any
I tried UTF-8/16 - and got the same result.
if I list all index's content via iterating IndexReader- I can see
that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can
be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Sorry, my bad! Didn't read this informative post :-)

mvh karl øie


On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote:


Look at CHANGES.txt document in CVS - there is some new stuff in
org.apache.lucene.analysis.ru package that you will want to use.
Get the Lucene from the nightly build...

Otis

--- Andrey Grishin [EMAIL PROTECTED] wrote:

Hi All,
I have a problems with searching on Russian content using lucene 1.2

I indexed the content using Cp1251 charset

text = new String(text.getBytes(Cp1251));
doc.add(Field.Text(CONTENT_FIELD,text));


and I am searching using the same charset

String txt = ·Œƒ;
txt = new String(txt.getBytes(Cp1251));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

or

Analyzer analyzer = new StandardAnalyzer();
String txt = ·Œƒ“≈ ;
txt = new String(txt.getBytes(Cp1251));
Query query = QueryParser.parse(txt,
PortalHTMLDocument.CONTENT_FIELD, analyzer);

hits = searcher.search(query);


and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any

I tried UTF-8/16 - and got the same result.

Also, if I list all index's content via iterating IndexReader - I can
see that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else
can be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS



__
Do you Yahoo!?
Yahoo! Mail Plus ñ Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Help on creating and maintaining an index that changes

2002-11-21 Thread Karl Øie
I want to do something similiar with Lucene, but I
don't know how to approach it.  I thought maybe
keeping the first hashmap as is, and building a
Directory in lucene that replaces the master Hashmap.
 When I get hits back from lucene I look them up in
the first hashmap, and return those.


If your index is big its probably best to do it this way. I got indexes 
that takes up to 12 hours to build and takes about 1gb of harddrive 
space but searching is still fast. if you put the client id's into 
keyword fields you can use lucenes to filter out hits from the clients 
you know is offline by using a boolean NOT, either manually or through 
the queryparser.

How do I put the needed information into Directory so
I can look them up in the first hashmap.  I would need
the unique id identifying the client, and a key that
identifies the document that the client has.


you add a keyword field to each document that contains the unique id 
identifying the client. This way you can search for documents from a 
client, and also filter out documents from that client.

Then how do I clean up the Directory when a client is
not available?  How do I remove a document from
Lucene's Directory?


the org.apache.lucene.index.IndexReader class contains a delete() 
function to delete documents from lucene. But as said before, if your 
index is big it's best not to delete the documents just because a 
client goes offline, its better to filter out the hits.

mvh karl øie


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Indexing of documents in memory

2002-11-17 Thread Karl Øie
The org.apache.lucene.store.InputStream is not a _stream_ per see as it 
requires a seek() function and it is therefore not compatible with the 
java.io.InputStream consept, however you can quite easily create a 
java.io.InputStream by grabbing hold of the byte content of a 
org.apache.lucene.store.InputStream and stuff it into a 
java.io.ByteArrayInputStream.

This doesn't make any sense anyhow because the raw bytestream from a 
RAMDirectory will not make any real sense to a HTML parser because the 
content of the RAMDir is an binary index. If you want to store the 
input HTML documents you should store them into a byte or char array in 
a file or database.

mvh karl øie



On Monday, Nov 18, 2002, at 03:24 Europe/Oslo, Vinay Kakade wrote:

Hi
I am trying to use RAMDirectory to store the input
HTML documents which are used to create index by the
IndexHTML demo program, but I am facing problems.
I tried to get individual InputStream objects for
individual files from RAMDirectory  pass it to
HTMLParser class to parse the file, but the HTMLParser
class accepts java.io.InputStream object while
RAMDirectory returns lucene.store.InputStream object.
Is there any way to perform any conversion between
there two objects? or do I have to modify HTMLParser
class  all other classes it uses to achieve this??
Please let me know
regards
Vinay.


--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

Look at RAMDirectory.

Otis

--- Vinay Kakade [EMAIL PROTECTED] wrote:

Hi,

I want to use Lucene for indexing some documents

which

are in memory. I do not want to store them in a
seperate directory.
The IndexWriter class accepts directory name,

where

all documents to be indexed are stored. Is there

any

way by which we can specify memory buffer in which
documents are stored while creating Index?
Thanks
Vinay.


__
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:


mailto:[EMAIL PROTECTED]

For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




__
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




__
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Indexing distant web sites

2002-11-04 Thread Karl Øie
oh, sorry.. i was perhaps not making me self clear here...

you will have to use the crawler to retrieve the content and store it  
locally for indexing, so you will have to set up your crawler to fetch  
a site and store every html page's content to disk, then run Lucene on  
the locally stored html pages and afterwards delete the html pages...  
you will also need a way to get the original url from the crawler and  
store that in Lucene as well as a keyword field.

a much more efficient way is to get the crawler to get one page, store  
it in memory, run Lucene on it, and then discard the buffer and then  
keep on to the next page.

if you want to take a look at a real lucene+ crawler implementation you  
can check out the Cocoon project at  
http://xml.apache.org/cocoon/index.html :

Lucene integration:

  
http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ 
cocoon/components/search/

Crawler implementation:

  
http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ 
cocoon/components/crawler/

This impl is indexing XML, but the principe is the same...


mvh karl øie



On Monday, Nov 4, 2002, at 14:29 Europe/Oslo, Friaa Nafaa wrote:


Thank you,I was installed this crawler and I run it,but I would like  
to index the web site and not to list the visited links by the  
crawler,Is there a way to serch a web page by lucene witch use this  
crawler for visiting the pages.thanks--- On Mon 11/04, Karl Marx lt;  
[EMAIL PROTECTED] gt; wrote:From: Karl Marx [mailto: [EMAIL PROTECTED]]To:  
[EMAIL PROTECTED]: Mon, 4 Nov 2002 12:31:50  
+0100Subject: Re: Indexing distant web sitesAs stated in the official  
FAQ Lucene doesn't implement a web-crawler, you can however use a  
self-made crawler or customate a crawler framework like websphinx  
(http://www-2.cs.cmu.edu/~rcm/websphinx/) to retrieve html documents  
from a site and then feed them to Lucene.mvh karl ¯ieOn Monday, Nov 4,  
2002, at 11:49 Europe/Oslo, Friaa Nafaa wrote:gt; Hello,is there any  
way to index web sites by lucene, assuming we know gt; only the url  
of the site ? :--amp;gt;In local use we passe to lucene the gt; full  
arborexcence or directory of our site (contain all the documents) gt;  
and we begin the indexing operation, but when I would like to index a  
gt; distant site on the web... what i do ?For exemple I installed  
Lucene gt; on my computer and I would like to index the site : gt;  
http://www.excite.com ...Thanksgt;gt;  
___gt; Join Excite! -  
http://www.excite.comgt; The most personalized portal on the Web!--To  
unsubscribe, e-mail: For additional commands, e-mail:

___
Join Excite! - http://www.excite.com
The most personalized portal on the Web!


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




Re: Multithread searching problem on Linux

2002-10-14 Thread karl øie

if you still have problems, take a look at this note found in the 
newest tomcat release... it might help.

mvh karl øie


  ---
  Linux and Sun JDK 1.2.x - 1.3.x:
  ---
 
  Virtual machine crashes can be experienced when using certain 
combinations of
  kernel / glibc under Linux with Sun Hotspot 1.2 to 1.3. The crashes 
were
  reported to occur mostly on startup. Sun JDK 1.4 does not exhibit the 
problems,
  and neither does IBM JDK for Linux.
 
  The problems can be fixed by reducing the default stack size. At bash 
shell,
  do ulimit -s 2048; use limit stacksize 2048 for tcsh.
 
  GLIBC 2.2 / Linux 2.4 users should also define an environment 
variable:
  export LD_ASSUME_KERNEL=2.2.5





On onsdag, okt 2, 2002, at 15:34 Europe/Oslo, Stas Chetvertkov wrote:

 Yes, it works without errors with classic JVM, but if it was not so
 painfully slow :(

 Anyway, I'll check what is faster - classic JVM with multiple thread 
 search
 or Hotspot
 with 1 searching thread (as we have now).

 Thanks,
 Stas.

 Try to run your vm in classic mode java -classic to disable the
 hotspot features...

 mvh karl øie


 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How to include strange characters??

2002-10-14 Thread karl øie

Also note that both apache and tomcat has a default setting that force 
re-encodes all pages. in tomcat it is the DecodeInterceptor / in 
server.xml, in apache it is a line that says AddDefaultCharset on in 
httpd.conf. These are applied _after_ any servlet output so it might 
lead to strange result, be sure to turn off both directives when you 
test different encoding problems.

Last but not least is the encoding the SQL database was created in. On 
DB2 i have to use the right database constructor to get norwegian 
character support (db2 CREATE DATABASE mydb USING CODESET ISO-8859-1 
TERRITORY NO COLLATE USING SYSTEM;). Without the correct encoding on 
the database constructor the database behave strange in sorting and 
insert/update scenarios.

To be sure to get everything make sure that all steps are using the 
same encoding, just like you use the same analyzer (perhaps encoding 
should be a part of a analyzer?!?)

1: create the database with ISO-8859-1 encoding (my favorite)...

CREATE DATABASE mydb USING CODESET ISO-8859-1 TERRITORY NO COLLATE 
USING SYSTEM;

2: in the indexer force feed lucene with ISO-8859-1 strings:

String value = resultset.getString(fieldname);
document.add(Field.UnStored(fieldname, new 
String(value.getBytes(ISO-8859-1;
...

3: force encode all queries to lucene in the same manner

String querystring = httprequest.getParameter(query);
querystring = new String(querystring.getBytes(ISO-8859-1));
...


mvh karl øie


On søndag, okt 13, 2002, at 14:15 Europe/Oslo, Chris Davis wrote:

 To Dominator,

 Where you able to solve the display problem as well?  I am having a 
 similiar problem with documents that contain the  (open double quote 
 #8220).  I am not concerned with searching on the character, but when 
 I attempt to dsiplay a stored field with this character, it does not 
 display correctly.  Even stranger, the closing quote #8221 does 
 display.

 To All,

 I have browsed through the majority of messages related to Unicode in 
 the archive, and my reading tells me that Lucene does not normally 
 change the data that is stored for a field.  Can someone give me 
 some pointers on how to troubleshoot this problem.

 Note:  I am indexing data that is being pulled from a SQL Server 2000 
 DB on Windows 2000.

 ---


 In an earlier message Dominator wrote:

 I print out a result string it shows a very strange result, for 
 example
 search for: civilingenircaron;r string: civilingeniAbreve;¸r.. 
 I'm sure it's an
 unicode problem, but where can I change it??



 Dominator wrote:

 thx, with your help I could solve the problem

 karl ie [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 i had such problems with norwegian characters and it resolved into
 making sure the querystring has the same encoding as the index has.

 since this is again a java.lang.String encoding question i had these
 problems with querystrings coming from java Servlets and CLI. For both
 the quickfix was to re-encode the query in UTF-8/16:

 String querystring = argv[0]; ' String querystring =
 httprequest.getParameter(query);
 querystring = new String(querystring.getBytes(UTF-8));
 ...

 this fixed my norwegian/samii problems...


 mvh karl ie

 On mandag, okt 7, 2002, at 13:04 Europe/Oslo, Dominator wrote:

 I use czech language with more bizzare characters and there is no
 problem at all. Are you sure, that your XML contains character set
 information?

 yes, I tried ?xml version=1.0 encoding=ISO-8859-2? and ?xml
 version=1.0 encoding=UTF-8? but I get the same strange 
 characters.






 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How to include strange characters??

2002-10-07 Thread karl øie

i had such problems with norwegian characters and it resolved into 
making sure the querystring has the same encoding as the index has.

since this is again a java.lang.String encoding question i had these 
problems with querystrings coming from java Servlets and CLI. For both 
the quickfix was to re-encode the query in UTF-8/16:

String querystring = argv[0]; ' String querystring = 
httprequest.getParameter(query);
querystring = new String(querystring.getBytes(UTF-8));
...

this fixed my norwegian/samii problems...


mvh karl øie

On mandag, okt 7, 2002, at 13:04 Europe/Oslo, Dominator wrote:

 I use czech language with more bizzare characters and there is no
 problem at all. Are you sure, that your XML contains character set
 information?

 yes, I tried ?xml version=1.0 encoding=ISO-8859-2? and ?xml
 version=1.0 encoding=UTF-8? but I get the same strange characters.






 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Multithread searching problem on Linux

2002-10-02 Thread karl øie

there have been numerous problems/bad features with the hot-spot mode 
on sun's linux vm, the reason for this is that hot-spotting is 
optimizing your code by doing very weird stuff with your code :-)

anyway i'm glad -classic works well for you. the bad performance is a 
known problem with the linux-kernel java threading system. When it 
comes to threads windows jvms outperforms linux jvms because of kernel 
internal things i don't even try to understand...

i have used the 1.3.1 linux jvm from ibm with great stability, how does 
the ibm jvm perform against the sun jvm when it comes to thread 
performance?

there is also a 1.3 jvm from a group called blackdown that is free 
and optimized for linux. there was some talking in the news about it 
being very good at threading... you could try it..  ( 
http://www.blackdown.org/ )

mvh karl øie



On onsdag, okt 2, 2002, at 15:34 Europe/Oslo, Stas Chetvertkov wrote:

 Yes, it works without errors with classic JVM, but if it was not so
 painfully slow :(

 Anyway, I'll check what is faster - classic JVM with multiple thread 
 search
 or Hotspot
 with 1 searching thread (as we have now).

 Thanks,
 Stas.

 Try to run your vm in classic mode java -classic to disable the
 hotspot features...

 mvh karl øie


 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Multithread searching problem on Linux

2002-10-01 Thread karl øie

Try to run your vm in classic mode java -classic to disable the 
hotspot features...

mvh karl øie


On tirsdag, okt 1, 2002, at 18:16 Europe/Oslo, Stas Chetvertkov wrote:

 Hi All,

 I am building a search engine based on Lucene. Recently I created a 
 test
 simulating multiple users searching in the same index simultaneously 
 and
 found out that quite often JVM crashes with 'Hotspot Virtual Machine 
 Error :
 11'. I couldnot reproduce this bug on Windows box, but observed it a 
 lot on
 Red Hat Linux 7.3 with different versions of Sun's 1.3 JVM, including 
 the
 most recent one (1.3.1_04 at the moment).

 I am attaching a simple test that generates hotspot error in 90% of 
 cases.
 In our code we have to create new IndexSearcher for every search 
 because
 indices are updated in real time.

 The only workaround I found for this problem so far is reducing the 
 number
 of searching threads which doesnot seem to be a good solution.

 Had anyone encountered problems like this one?

 Regards,
 Stas.
 SearchTest.java--
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Problems understanding RangeQuery...

2002-08-12 Thread Karl Øie

thank you, that works! :-) and saves my day!

mvh karl øie



-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: 10. august 2002 18:29
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Problems understanding RangeQuery...


Hi Karl,

I have discovered that with range queries you *must* ensure there is a space
on either side of the dash.

That is, [1971 - 1979] rather than [1971-1979].  If you don't, Lucene will
interpret it as [1979 - null].

To illustrate a bit more, here are some result totals that I get on my
index:
pub_mo:[07 - 08]  -- 8370 (note the spaces around the dash
pub_mo:[07-08]-- 2133 (note the absence of spaces)
pub_mo:[08 - null] -- 2133
pub_mo:(07 08)-- 8370 (note the use of parentheses, not brackets)

Just put the spaces in and all should be OK.

Regards,

Terry




- Original Message -
From: Karl Øie [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, August 10, 2002 11:47 AM
Subject: Problems understanding RangeQuery...


 Hi, i have a problem with understanding RangeQueries in Lucene-1.2:

 I have created an index with posts that has the field W_PUBLISHING_YEAR
 which contains the  year of publishing. After indexing i loop through
 the terms and finds that i have the following terms present in the index:


1923,1925,1926,1930,1933,1935,1936,1938,1942,1943,1945,1946,1947,1948,1949,1

950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,19

65,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,198

0,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995
 ,1996,1997,1998,1999,2000,2001,2002,2003,2004,2010,2018,2097 in 232290
 documents.

 Then i run these queries on the index W_PUBLISHING_YEAR:[1971-1979] and
 W_PUBLISHING_YEAR:[2000-2002] and both queries gives me some strange
 results:


 W_PUBLISHING_YEAR:[1971-1979]

 found={1975, 1974, 1973, 1972, 1999, 1998, 1997, 1996, 1995, 1994, 1993,
 2018, 1992, 1991, 1990, 2010, 1989, 1988, 1987, 1986, 1985, 1984, 1983,
 1982, 1981, 1980, 2004, 2003, 2002, 2001, 2097, 2000, 1979, 1978, 1977,
 1976} in 150793 matching documents.


 W_PUBLISHING_YEAR:[2000-2002]

 found={2002, 2001, 2097, 2010, 2018, 2004, 2003} in 10756 matching
 documents.



 Is there something i do wrong here? How is the RangeQuery supposed to
work?


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Problems understanding RangeQuery...

2002-08-10 Thread Karl Øie

Hi, i have a problem with understanding RangeQueries in Lucene-1.2:

I have created an index with posts that has the field W_PUBLISHING_YEAR
which contains the  year of publishing. After indexing i loop through
the terms and finds that i have the following terms present in the index:

1923,1925,1926,1930,1933,1935,1936,1938,1942,1943,1945,1946,1947,1948,1949,1
950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,19
65,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,198
0,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995
,1996,1997,1998,1999,2000,2001,2002,2003,2004,2010,2018,2097 in 232290
documents.

Then i run these queries on the index W_PUBLISHING_YEAR:[1971-1979] and
W_PUBLISHING_YEAR:[2000-2002] and both queries gives me some strange
results:


W_PUBLISHING_YEAR:[1971-1979]

found={1975, 1974, 1973, 1972, 1999, 1998, 1997, 1996, 1995, 1994, 1993,
2018, 1992, 1991, 1990, 2010, 1989, 1988, 1987, 1986, 1985, 1984, 1983,
1982, 1981, 1980, 2004, 2003, 2002, 2001, 2097, 2000, 1979, 1978, 1977,
1976} in 150793 matching documents.


W_PUBLISHING_YEAR:[2000-2002]

found={2002, 2001, 2097, 2010, 2018, 2004, 2003} in 10756 matching
documents.



Is there something i do wrong here? How is the RangeQuery supposed to work?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Crash / Recovery Scenario

2002-07-09 Thread Karl Øie

 only deletes the old one while it's working on the new one, so is there a
 way of checking for the .lock files in case
 of a crash a rolling back to the old index image?

 Nader Henein

i have some thoughts about crash/recovery/rollback that i haven't found any 
good solutions for.

If a crash happends during writing happens there is no good way to know if the 
index is intact, removing lock files doesn't help this fact, as we really 
don't know. So providing rollback functionality is a good but expensive way 
of compensating for lack of recovery.

To provide rollback i have used a RAMDirectory and serialized it to a SQL 
table. By doing this i can catch any exceptions and ask the database to 
rollback if required. This works great for small indexes but if the index 
grows you will have problems with performance as the whole RAMDir has to be 
serialized/deserialized into the BLOB all the time.

A better solution would be to hack the FSDirectory to store each file it would 
store in a file-directory as a serialized  byte array in a blob of a sql 
table. This would increase performance because the whole Directory don't have 
to change each time, and it doesn't have to read the while directory into 
memory. I also suspect lucene to sort its records into these different files 
for increased performance (like: i KNOW that record will be in segment xxx 
if it is there at all).

I have looked at the source for the RAMDirectory and the FSDirectory and they 
could both be altered to store their internal buffers into a BLOB, but i 
haven't managed to do this successfully. The problem i have been pounding is 
the lucene.InputStream's seek() function. This really requires the underlying 
impl to be either a file, or a array in memory. For a BLOB this would mean 
that the blob has to be fetched, then read/seek-ed/written/ then stored back 
again. (is this correct?!?, and if so is there a way to know WHEN it is 
required to fetch/store the array).

I would really appreciate any tips on this as i would think 
crash/recovery/rollback functionality to benefit lucene greatly.

I have indexes that uses 5 days to build, and it's really bad to receive 
exceptions during a long index run, and no recovery/rollback functionality.

Mvh Karl Øie

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: SearchBean Persistence

2002-07-03 Thread Karl Øie

if the array is of a serializable sort, just store it in a sql table !?!

mvh karl øie

On Wednesday 03 July 2002 16:22, Terry Steichen wrote:
 I'm using Peter's SearchBean code to sort search results.  It works fine,
 but it creates the sorting field array from scratch with every invocation
 (which takes on the order of a second or so to complete - each search
 itself takes about one tenth of that or less).  While I can conduct several
 searches in the same module, I can't figure out how to persist the sorting
 field array between invocations of the search module.

 Any advice on how to do this would be much appreciated.

 Regards,

 Terry


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: SearchBean Persistence

2002-07-03 Thread Karl Øie

if it is a Stateful SessionBan you will have to create an EntityBean 
implementation with the same functionality. And then in the EJB's load() and 
store() you will have to serialize the array. Or if it is a CMP EJB, just 
declare the array as a persistent field.


mvh karl

On Wednesday 03 July 2002 16:39, Terry Steichen wrote:
 Karl,

 Just to clarify.  I have an application that runs searches as requested by
 users.  The application is persistent across multiple requests, so there's
 no problem creating it at startup.  And, given the application's
 persistence, there should be no problem storing it in memory to serve
 subsequent requests.  I just can't figure out how to modify the SearchBean
 code to do this.  I seemed like it would be simple, but try as I might,
 nothing has so far worked.

 Regards,

 Terry


 - Original Message -
 From: Karl Øie [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, July 03, 2002 10:34 AM
 Subject: Re: SearchBean Persistence

  if the array is of a serializable sort, just store it in a sql table !?!
 
  mvh karl øie
 
  On Wednesday 03 July 2002 16:22, Terry Steichen wrote:
   I'm using Peter's SearchBean code to sort search results.  It works

 fine,

   but it creates the sorting field array from scratch with every

 invocation

   (which takes on the order of a second or so to complete - each search
   itself takes about one tenth of that or less).  While I can conduct

 several

   searches in the same module, I can't figure out how to persist the

 sorting

   field array between invocations of the search module.
  
   Any advice on how to do this would be much appreciated.
  
   Regards,
  
   Terry
 
  --
  To unsubscribe, e-mail:

 mailto:[EMAIL PROTECTED]

  For additional commands, e-mail:

 mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: SearchBean Persistence

2002-07-03 Thread Karl Øie

oh, i see. i was misleaded by the Bean part of the SearchBean... im sorry! :-)

Anyhow, if it is not a Statefull SessionBean you are not restricted by EJB 
rules and can thus serialize everything you want to disk or db...

mvh karl øie


On Wednesday 03 July 2002 17:20, Otis Gospodnetic wrote:
 I think you guys are not understanding each other.
 Terry is talking about the code in Lucene Sandbox, not about EJBs.

 I don't use that code (yet?), so I don't know the answer.

 Otis

 --- Karl Øie [EMAIL PROTECTED] wrote:
  if it is a Stateful SessionBan you will have to create an EntityBean
  implementation with the same functionality. And then in the EJB's
  load() and
  store() you will have to serialize the array. Or if it is a CMP EJB,
  just
  declare the array as a persistent field.
 
 
  mvh karl
 
  On Wednesday 03 July 2002 16:39, Terry Steichen wrote:
   Karl,
  
   Just to clarify.  I have an application that runs searches as
 
  requested by
 
   users.  The application is persistent across multiple requests, so
 
  there's
 
   no problem creating it at startup.  And, given the application's
   persistence, there should be no problem storing it in memory to
 
  serve
 
   subsequent requests.  I just can't figure out how to modify the
 
  SearchBean
 
   code to do this.  I seemed like it would be simple, but try as I
 
  might,
 
   nothing has so far worked.
  
   Regards,
  
   Terry
  
  
   - Original Message -
   From: Karl Øie [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Wednesday, July 03, 2002 10:34 AM
   Subject: Re: SearchBean Persistence
  
if the array is of a serializable sort, just store it in a sql
 
  table !?!
 
mvh karl øie
   
On Wednesday 03 July 2002 16:22, Terry Steichen wrote:
 I'm using Peter's SearchBean code to sort search results.  It
 
  works
 
   fine,
  
 but it creates the sorting field array from scratch with every
  
   invocation
  
 (which takes on the order of a second or so to complete - each
 
  search
 
 itself takes about one tenth of that or less).  While I can
 
  conduct
 
   several
  
 searches in the same module, I can't figure out how to persist
 
  the
 
   sorting
  
 field array between invocations of the search module.

 Any advice on how to do this would be much appreciated.

 Regards,

 Terry
   
--
To unsubscribe, e-mail:
  
   mailto:[EMAIL PROTECTED]
  
For additional commands, e-mail:
  
   mailto:[EMAIL PROTECTED]
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]

 __
 Do You Yahoo!?
 Sign up for SBC Yahoo! Dial - First Month Free
 http://sbc.yahoo.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Wildcard searching

2002-07-02 Thread Karl Øie

Hi, i have experimented with prefixing all Field values with the letter A to 
allow wildcards * and ? to be positioned first in the a query term.

What i would like to do next is to prefix all the terms produced by the 
QueryParser with the letter A so the hack is transparent to the user. Is 
there a simple way to do this as the Query's subclasses dosn't allow you to 
modify the term it holds. Secondly i can not find any way to get al sub 
queries of a query. Does anyone here know something really smart i can do 
short of learning to program JavaCC ?!?

And in the end: is there a reason why lucene doesn't use java interfaces for 
eh. interfaces like the Query class?

mvh karl øie


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: MS Word Search ??

2002-05-29 Thread Karl Øie

to search MS office documents you must first be able to

a: access the documents through java with apis like POI etc

b: convert the documents to something that is accessable through java like 
xml, etc...

the best way is to convert as the java api's for MSOffice documents still are 
under development

mvh karl øie



On Wednesday 29 May 2002 11:48, Rama Krishna wrote:
 Hi,

 I am trying to build a search engine which search in MS Word, excel, ppt
 and adobe pdf. I am not sure whether i can use Lucene for this or not.  pl.
 help me out in this regard.


 Regards,
 Ramakrishna


 _
 Chat with friends online, try MSN Messenger: http://messenger.msn.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Searching UNICODE

2002-05-02 Thread Karl Øie

what language are you trying to use lucene with?

mh karl øie

On Tuesday 30 April 2002 18:57, Hyong Ko wrote:
 Hello,

 I think there's something wrong with the QueryParser.jj file. I downloaded
 lucene-1.2-rc4-src and compiled successfully with JAVA_UNICODE_ESCAPE=true
 and DEBUG_TOKEN_MANAGER = true. My output debug info for Indexing looked
 okay. It showed the correct byte arrays in UTF8. However, when I ran
 SearchFiles, the output debug showed the byte arrays in default byte! I
 tried calling QueryParser.parse after converting the search string to
 UTF-8, but still got non-UTF8 bytes. I think that's why my search's been
 failing. Any ideas?? Thank you very much.

 Hyong Ko
 [EMAIL PROTECTED]


 _
 Send and receive Hotmail on your mobile device: http://mobile.msn.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie

there are some strange problems with FSDirectory, i have found that building 
chuncks in a RAMDirectory and then merge these into a FSDirectory is more 
stable than indexing directly into the FSDirectory, i ran into your problem 
and the dreaded too many open files problems when indexing large documents 
with many fields

using a RAMDir as a middle man solved my problems...

mvh karl øie

On Friday 26 April 2002 13:54, petite_abeille wrote:
 Hello,

 I'm starting to wander how bullet proof are Lucene indexes? Do they
 get corrupted easely? If so is there a way to rebuild them?

 I'm started to get the following exception left and right...

 04/25 18:34:39 (Warning) Indexer.indexObjectWithValues:
 java.io.IOException: _91.fnm already exists

 I build a little app (http://homepage.mac.com/zoe_info/) that uses
 Lucene quiet extensively, and I would like to keep it that way. However,
 I'm starting to have second thought about Lucene's reliability... :-(

 I'm sure I'm doing something wrong somewhere, but I really cannot see
 what...

 Any help or insight greatly appreciated.

 Thanks.

 PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie

ah, now i see, what i have is a server with 512mb of ram, so i have used two 
different approaches and both works ok;

1 - i index a fixed number of documents into a RAMDir, like 10 (each of the 
docs are xml docs about 1,5-2mb) and then i optimize the RAMDir and merge it 
into the FSDir and then optimize the FSDir...

2 - i use the Runtime.freeMemory() and Runtime.totalMemory() to see if i have 
reached more than 80% of the available memory, if so i optimize the RAMDir, 
merge it and optimize the FSDir..., if not i just add more documents to the 
RAMDir

as far as i have tested i have never experienced a failure while merging a 
RAMDir into a FSDir regardless of size, so it's my systems memory that is the 
problem

mvh karl øie


On Friday 26 April 2002 15:33, petite_abeille wrote:
  Thanks. What's is your heuristic to flush the RAMDirectory?
 
  please explain this because i don't understand english that good :-(

 That's ok, I don't really understand English either :-)

 Simply put, when do you flush the RAMDirectory into the FSDirectory?
 Every five documents? Ten? A thousand? What is a good balance between
 RAM and FS?

 Thanks.

 PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie

forgot this:

its a bit hard to determine a good number of balance while indexing XML 
documents because the internal relations of a DOM can make a XML document 
become nearly 21 times as big in memory compared to disk (i am not lying, i 
have seen it my self)...

also the RAMDir must be kept in memory while indexing and merging, so checking 
the systems free memory is easier that trying to calculate memoryusage

mvh karl øie



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Italian web sites

2002-04-24 Thread Karl Øie

hm... this looks very interesting! if it is a perl exe you can just copy the 
text into a temp file and run the per exe on that file and redirect the 
output to another tmp file. then read the file and use the result in a lucene 
keyword.

mvh karl øie

On Wednesday 24 April 2002 13:46, [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have found a very interesting library which is written in perl.
 The problem is now how I can use this library.
 
 Anyway the library is Textcat an you can find it:
 
 http://odur.let.rug.nl/~vannoord/TextCat/
 
 Bye
 
 Laura
 

  combined with that you could use an italian stop-

 word list to run statistics 

  on a page :-) ?!?
  
  On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote:
 
   Hi all,
   
   I'm using Jobo for spidering web sites and lucene for indexing. The 
   problem is that I'd like spidering only Italian web sites. 
   How can I see discover the country of a web site?
   
   Dou you know some method that tou can suggest me?
   
   Thanks
   
   
   Laura
   
 
  
  
  --
  To unsubscribe, e-mail:   mailto:lucene-user-

 [EMAIL PROTECTED]

  For additional commands, e-mail: mailto:lucene-user-

 [EMAIL PROTECTED]

  


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: delete document

2002-04-24 Thread Karl Øie

it's actually the IndexReader, not the IndexWriter...

happy hacking!




On Wednesday 24 April 2002 15:27, Tim Tschampel wrote:
 How do you delete a document from the index?
 I see in the FAQ to user IndexWriter.delete(Term), however I don't see
 this in the current API JavaDocs, and don't have this method present in
 the lucene-1.2-rc4.jar that I downloaded from this site.


 Tim Tschampel


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Some questions

2002-04-19 Thread Karl Øie

 Well, I saw that lucene create the index on the filesystem: I think 
 that this is a problem for producion enviroment. I usually use 
 Database, for example Oracle. 
 Is it possible integrate Lucene with Oracle or some other db (Mysql)?

you can store the index in blob-fields, but thats about it so far

 I think that there isn't any Italian Anylizer, is it?
 How can I write one?

the implementation for lucene is pretty straight forward, take a look at the 
contributed GermanAnalyzer. Inside the implementing class you implement 
stopwords, language dependent case switching etc...

When it comes to the english and german analyzers they also perform stemming 
(making computers match computer and histories match history etc). 
This requires to create a program that can understand the plurals/singulars 
of Italian. A good start might be to look at http://snowball.sourceforge.net 
as they have a italian stemmer allready.

 The last question is: I suppose that my search engine is able to spider 
 web sites. Is it possible spidering urls?
 For example is it possible that with a page I spider this page, then I 
 extract the links of the page and at least I'd like spidering also 
 these links?
 How can I do this?

As lucene works with only the text content for a doc you will have to create a 
spider that retrieves a url, extracts the text and feeds it to lucene, then 
extract the links and process each of these links in the same manner. for 
this you will need a html parser..


happy hacking!


mvh karl øie

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Read only filesystem

2002-04-05 Thread Karl Øie

thank you! i actually ran into this today when i buildt a index with crond as 
root and found that even my own user could read the index, lucene couldn't.

:-D

mvh karl øie

On Friday 05 April 2002 15:15, you wrote:
 Hi,
   after some trial with Lucene, I discovered it doesn't work with
 index on CD-ROM. So, I write a replacement for FSDirectory class that
 work on Read Only filesystem. It works for me.
   If you think that can be useful, you can download it from
 http://www.csita.unige.it/software/free/lucene/

 Bye.
 --
 Marco Ferrante ([EMAIL PROTECTED])
 CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
 Università degli Studi di Genova - Italy
 Via Brigata Salerno, ponte - 16147 Genova
 tel (+39) 0103532621 (interno tel. 2621)
 --

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: storing index in third party database.

2002-04-03 Thread Karl Øie

without having investigated the problem much i would think that a SQL 
database would be a very bad match for lucene as most of lucene's working is 
creating key's for words and documents and then creating indexes of these 
keys. for these purposes a SQL database is an unecessary overhead, not even 
talking about the overhead represented by the SQL language parser.

for these kind of indexes a lower-level database would be better suited. I 
have good experiences with BerkeleyDB (http://www.sleepycat.com) and a friend 
of me uses gdbm successfully for such key-pair indexing tasks. the advantage 
of these low-level databasesystems is that they are really much or less 
persistent b-tree/hashtable implementations, and thus created for key-pairing.

they have no SQL layer as you will have to program against them as they are 
more subroutines that applications. but for key-pair indexes i have 
experienced that BerkeleyDB runs circles around any SQL database (including 
db2 and oracle!!!).

Berkeley has a java-api and a b-tree record type that could be a very good 
match for a key-based searchtree, and it's free. take a look at it!

mvh karl øie

(ps: i am not payed by the sleepy cat to write this :-)



On Wednesday 03 April 2002 16:12, you wrote:
 If you want to store indices in a database search the mailing list
 archives for SqlDirectory.

 Once I considered using it for one application at work, so I asked its
 author about performance.  The answer was that it doesn't perform all
 that well when the index grows, if I recall correctly.  Consequently,
 we chose to use file-based indices instead.

 Otis

 --- [EMAIL PROTECTED] wrote:
  Hi all
 
  I want to index the datas which I already stored in a thirdparty
  database table and develop a search facility using lucene. I am
  thinking of storing this indexes back to the database in another
  table. I know for this we have to create a 'directory' which do all
  the indexing operations,
 
  for example
 
  Indexwriter indwriter = new Indexwriter(dirStore,null,create);
 
  where dirStore is the directory, create is boolean.
 
  but I don't know the format to be followed for the
  directory(dirStore).Please help  me if anybody has done similar
  thing.
  TIA
  Amith
 
 
  __
  Your favorite stores, helpful shopping tools and great gift ideas.
  Experience the convenience of buying online with Shop@Netscape!
  http://shopnow.netscape.com/
 
  Get your own FREE, personal Netscape Mail account today at
  http://webmail.netscape.com/
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]

 __
 Do You Yahoo!?
 Yahoo! Tax Center - online filing with TurboTax
 http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: optimizing index - too many open files

2002-03-01 Thread Karl Øie

I have to index 1650mb of documents, and eventually i will get out of
memory with a RAMDir and get too many open files with a FSdir, so to get
around this i am indexing 100 documents at a time in a RAMDir, then merge
this RAMDir into a FSDir before i index the next set of 100 files. This made
me work around both out of memory and too many files exceptions...

mvh karl øie



-Original Message-
From: Paul Friedman [mailto:[EMAIL PROTECTED]]
Sent: 28. februar 2002 21:38
To: Lucene Users List
Subject: Re: optimizing index - too many open files


Sorry to bother y'all again.
Found an answer in the archives under the Thread Indexing problem.

About to try using RAMDirectory first.

pax et bonum. p.

- Original Message -
From: Paul Friedman [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, February 28, 2002 1:23 PM
Subject: optimizing index - too many open files


 Hello all,

 I am running into an error:
 java.io.FileNotFoundException: /lucene/index/_2vx.tii ( too many open
 files )
 after my class calls IndexWriter.optimize().

 Does anybody know what causes this error?
 Any help is appreciated.

 ( By the way, the site that I am indexing is huge.
   I have a crawler run through the site calling many .jsps, .pdfs, and
.html
 docs.
   It ran fine two days ago after indexing 3700+ pages. )

 Could the index be too large for Lucene to handle?

 The error:
 java.io.FileNotFoundException: /lucene/index/-2vx.tii ( too many open
 files )
 at java.io.RandomAccessFile.open( Native Method )
 at java.io.RandomAccessFile.init
 at java.io.RandomAccessFile.init
 at org.apache.lucene.store.FSInputStream$Descriptor.init
 at org.apache.lucene.store.FSInputStream.init
 at org.apache.lucene.store.FSDirectory.openFile
 at org.apache.lucene.index.TermInfosReader.readIndex
 at org.apache.lucene.index.TermInfosReader.init
 at org.apache.lucene.index.SegmentReader.init
 at org.apache.lucene.index.IndexWriter.mergeSegments
 at org.apache.lucene.index.IndexWriter.optimize


 _
 Do You Yahoo!?
 Get your free @yahoo.com address at http://mail.yahoo.com


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: How to do web searching

2002-02-19 Thread Karl Øie

if you want to create websearch you must create a servlet or a jsp page that
can create a IndexSearcher class and read a index reated by a
IndexWriter class.

To make a long story short : try to create a servlet that does the same as
the demo searcher:

http://cvs.apache.org/viewcvs/jakarta-lucene/src/demo/org/apache/lucene/demo
/SearchFiles.java?rev=1.1content-type=text/vnd.viewcvs-markup


mvh karl øie



-Original Message-
From: Parag Dharmadhikari [mailto:[EMAIL PROTECTED]]
Sent: 19. februar 2002 10:12
To: lucene-user
Subject: How to do web searching


Hi all,
Pls can anybody tell me if I want to provide web searching as a feature then
what exactly I should go?Can lucene help me in this matter?

regards
parag


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Filter and stop-words

2001-12-03 Thread Karl Øie

to remove plural form you have to create a stemmer for your language, i have
been working with porting a stemmer for norwegian for lucene, to get a head
start i have ported the norwegian snowball stemmer, there is one for
portuguese as well, check it out!

http://snowball.sourceforge.net/portuguese/stemmer.html

mvh karl øie


-Original Message-
From: Bizu de Anúncio [mailto:[EMAIL PROTECTED]]
Sent: 3. desember 2001 13:22
To: [EMAIL PROTECTED]
Subject: Filter and stop-words


I'm new to Lucene. First of all I would like to know if there is a search
arquive like sun servlets list.

My first problem is that I want to index a Portuguese database and I need
to remove the s (plural) and acents (à é ...) from the words. Is there a
way of passing a filter class to the Lucene indexer ? And about the
stop-words, where should I configure Lucene to ignore it ?

Any help would be appreciated,

thanks a lot,

jk


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




scandinavian characters.

2001-11-27 Thread Karl Øie

Hi, i got a problem with scandinavian characters (æåø), when i insert text
with scand-chars it passes the analyzer correctly, but the QueryParser
chokes when i try to search for the same characters.

anyone know anything about how i can fix this?

karl øie/gan meida


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: scandinavian characters.

2001-11-27 Thread Karl Øie

no it's even stranger than that, i have decoded the querystring, the problem
is that it seems like something is changed on the way in. if i search for
fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is
changed to Auml; and 's' is removed.

is the querystring translated some where?

mvh karl øie
  -Original Message-
  From: David Bonilla [mailto:[EMAIL PROTECTED]]
  Sent: 27. november 2001 10:43
  To: Lucene Users List; [EMAIL PROTECTED]
  Subject: Re: scandinavian characters.


  Hi Karl !!!

  I´m spanish and I have a lot of problems programming with our not english
characters. I use LUCENE with spanish accents and it works fine...

  Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with
your fields to index ?

  Best Regards from Spain !
  __
  David Bonilla Fuertes
  THE BIT BANG NETWORK
  http://www.bit-bang.com
  Profesor Waksman, 8, 6º B
  28036 Madrid
  SPAIN
  Tel.: (+34) 914 577 747
  Móvil: 656 62 83 92
  Fax: (+34) 914 586 176
  __





RE: scandinavian characters.

2001-11-27 Thread Karl Øie

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (oslash;, oaelig;, aring;) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1.  Encountered: \u00c3 (195), after : 
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (oslash;, oaelig;, aring;) then it is
translated wrongly into the swedish/german auml; regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...


mvh karl



no it's even stranger than that, i have decoded the querystring, the
problem
is that it seems like something is changed on the way in. if i search for
fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is
changed to Auml; and 's' is removed.

is the querystring translated some where?

mvh karl øie
  -Original Message-
  From: David Bonilla [mailto:[EMAIL PROTECTED]]
  Sent: 27. november 2001 10:43
  To: Lucene Users List; [EMAIL PROTECTED]
  Subject: Re: scandinavian characters.


  Hi Karl !!!

  I´m spanish and I have a lot of problems programming with our not english
characters. I use LUCENE with spanish accents and it works fine...

  Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
your fields to index ?

  Best Regards from Spain !
  __
  David Bonilla Fuertes
  THE BIT BANG NETWORK
  http://www.bit-bang.com
  Profesor Waksman, 8, 6º B
  28036 Madrid
  SPAIN
  Tel.: (+34) 914 577 747
  Móvil: 656 62 83 92
  Fax: (+34) 914 586 176
  __




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: scandinavian characters.

2001-11-27 Thread Karl Øie

after i had replaced QueryParser.jj with the newest version from cvs the
queryparser accepts my query, and i can now perform ø/æ/å searches from
commandline, then i guess there is something wrong with my search servlets
unicode handling :-)


thank you very much!


karl øie/gan media



-Original Message-
From: Jonas Bechlund [mailto:[EMAIL PROTECTED]]
Sent: 27. november 2001 13:52
To: 'Lucene Users List'
Subject: RE: scandinavian characters.


Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the Token Definitions section of the file QueryParser.jj
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
Regards,
/ Jonas

-Original Message-
From: Karl Øie [mailto:[EMAIL PROTECTED]]
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.


there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (oslash;, oaelig;, aring;) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1.  Encountered: \u00c3 (195), after : 
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (oslash;, oaelig;, aring;) then it is
translated wrongly into the swedish/german auml; regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...


mvh karl



no it's even stranger than that, i have decoded the querystring, the
problem
is that it seems like something is changed on the way in. if i search for
fjøs (fjoslash;s) i get the swedish fjä (fjAuml;). Where oslash; is
changed to Auml; and 's' is removed.

is the querystring translated some where?

mvh karl øie
  -Original Message-
  From: David Bonilla [mailto:[EMAIL PROTECTED]]
  Sent: 27. november 2001 10:43
  To: Lucene Users List; [EMAIL PROTECTED]
  Subject: Re: scandinavian characters.


  Hi Karl !!!

  I´m spanish and I have a lot of problems programming with our not english
characters. I use LUCENE with spanish accents and it works fine...

  Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
your fields to index ?

  Best Regards from Spain !
  __
  David Bonilla Fuertes
  THE BIT BANG NETWORK
  http://www.bit-bang.com
  Profesor Waksman, 8, 6º B
  28036 Madrid
  SPAIN
  Tel.: (+34) 914 577 747
  Móvil: 656 62 83 92
  Fax: (+34) 914 586 176
  __




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]