Re: Not entire document being indexed?

2005-02-25 Thread [EMAIL PROTECTED]
Thanks Andrzej and Pasha for your prompt replies and suggestions.
I will try everything you have suggested and report back on the findings!
regards
-pedja

Pasha Bizhan said the following on 2/25/2005 6:32 PM:
Hi, 

whole document was indexed or not.
Luke can help you to give an answer the question: does my index contain a
correct data?
Let do the following steps:
- run Luke
- open the index
- find the specified document (document tab)
- click reconstruct and edit button
- select the field and look the original stored content of this field
reconstructed from index
Does this reconstructed content contain your last 2-3 paragraphs?
Also, 230Kb is not equal 20.000. Try to set  writer.maxFieldLength to 250
000.
Pasha Bizhan
http://lucenedotnet.com


Re: Not entire document being indexed?

2005-02-24 Thread [EMAIL PROTECTED]
Hi Otis
Thanks for the reply, what exactly should I be looking for with Luke?
What would setting the max value to maxInteger do? Is this some 
arbitrary value or...?

-pedja
Otis Gospodnetic said the following on 2/24/2005 2:24 PM:
Use Luke to peek in your index and find out what really got indexed.
You could also try the extreme case and set that max value to the max
Integer.
Otis
--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 

Hi everyone
I'm having a bizzare problem with a few of the documents here that do
not seem to get indexed entirely.
I use textmining WordExtractor to convert M$ Word to plain text and
then 
index that text.
For example one document which is about 230KB in size when converted
to 
plain text, when indexed and
later searched for a pharse in the last 2-3 paragraphs returns no
hits, 
yet searching anything above those
paragraphs works just fine. WordExtractor does convert the entire 
document to text, I've checked that.

I've tried increasing the number of terms per field from default
10,000 
to 20,000 with writer.maxFieldLength
but that didnt make any difference, still cant find phrases from the 
last 2-3 paragraphs.

Any ideas as to why this could be happening and how I could rectify
it?
thanks,
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



Re: PHP-Lucene Integration

2005-02-07 Thread [EMAIL PROTECTED]
Howdy,
For starters, compile and install the java bridge (and if necessary 
recompile PHP and Apache2) and make sure it works (there's a test php 
file supplied).

Then, here's a simplified part of my code, just to give you an example 
how it works. This is the part that does the searching, indexing is done 
in a similar way.

PHP:
...some code here for HTML page setup etc...
$lucene_dir = $GLOBALS[lucene_dir];
java_set_library_path(/path/to/your/custom/lucene-classes.jar);
$obj = new Java(searcher); // searcher is the custom written class 
that does actual searching and data output
$writer = new Java(java.io.StringWriter);
$obj-setWriter($writer);
$obj-initSearch($lucene_dir);
$obj-getQuery($query); // $query is the user supplied query from the 
HTML form, not visible here

// get the last exception
$e = java_last_exception_get();
if ($e) {
   // print error
   echo $e-toString();
} else {
   echo $writer-toString();
   $writer-flush();
   $writer-close();
}
java_last_exception_get();
// clear the exception
java_last_exception_clear();
-
JAVA (custom written class located in the 
/path/to/your/custom/lucene-classes.jar):

import ...whatever is needed here for the class...
public class searcher {
  IndexReader reader = null;
  IndexSearcher s= null;  //the searcher used to 
open/search the index
  Query q= null;  //the Query created by the 
QueryParser
  BooleanQuery query  = new BooleanQuery();
  Hits hits  = null;  //the search results
  
  public Writer out;

  public void setWriter(Writer out) {
   this.out=out;
  }
 public void initSearch(String indexName) throws Exception {
   try {
   File indexFile= new File(indexName);
   Directory activeDir   = 
FSDirectory.getDirectory(indexFile, false);
   if(reader.isLocked(activeDir)) {
   //out.write(Lucene index is locked, waiting 5 
sec.);
   Thread.sleep(5000);
   }
   reader = IndexReader.open(indexName);
   s = new IndexSearcher(reader);
   //out.write(Index opened);
   } catch (Exception e) {
   throw new Exception(e.getMessage());
   }
  }

  public void getQuery(String queryString) throws Exception {
   int totalhits   = 0;
   Analyzer analyzer = new StandardAnalyzer();
  
   String[] queryFields = 
{field1,field2,field3,field4,field5};
   float[] boostFields = {10, 6, 2, 1, 1};

   try {
   for ( int i = 0; i  queryFields.length; i++)
   {
   q = QueryParser.parse(queryString, queryFields[i], 
analyzer);
   if (boostFields[i]  1)
   q.setBoost(boostFields[i]);
   query.add(q, false, false);
   }
   } catch (ParseException e) {
   throw new Exception(e.getMessage());
   }

   try {
   hits = s.search(query);
   } catch (Exception e) {
   throw new Exception(e.getMessage());
   }
  
   totalhits = hits.length();

   if (totalhits == 0) { // if we find 
no hits, tell the user
   out.write(brI'm sorry I couldn't find your query:  + 
queryString);
   } else {

   for (int i = 0; i  totalhits; i++) {
   Document doc = hits.doc(i);
   String field1 = doc.get(field1);
   String field2 = doc.get(field2);
   String field3 = doc.get(field3);
   String field4 = doc.get(field4);
   String field5 = doc.get(field5);
   out.write(Field1:  + field1 + , Field2:  + field2 + , 
Field3:  + field3 + , Field4:  + field4 + , Field5:  + field5 + 
br);
   }
 }
}


Sanyi said the following on 2/7/2005 3:54 AM:
Hi!
Can you please explain how did you implement the java and php part to let them 
communicate through
this bridge?
The brige's project summary talks about java application-server or a dedicated 
java process
and I'm not into Java that much.
Currenty I'm using a self-written command-line search program and it outputs 
its results to the
standard output.
I guess your solution must be better ;)
If the communication parts of your code aren't top secret, can you please 
share them with me/us?
Regards,
Sanyi

		
__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-06 Thread [EMAIL PROTECTED]
Hi Owen
I am using Lucene with PHP, though in previous replies it was suggested 
to run Tomcat on an alternate port, but for me that was not a solution. 
I did not want to run too many tasks or too many servers
for various reasons (maintenance, security etc) and also needed to have 
control over PHP sessions and what not.

The original PHP extension for Java is broken and is far fro being 
usable in production. Instead I have been using PHP and Lucene with a 
PHP-Java-Bridge for the past 6 months or so.
It does the job very well and I can call classes and methods right out 
of PHP just like you would expect with a PHP extension.

The bridge is available here: 
http://sourceforge.net/projects/php-java-bridge

Hope this helps,
-pedja


Owen Densmore said the following on 2/6/2005 12:10 PM:
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back  
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread [EMAIL PROTECTED]
Morus Walter said the following on 1/21/2005 2:14 AM:
No. You could do a ( ( french-query ) or ( english-query ) ) construct 
using

one query. So query construction would be a bit more complex but querying
itself wouldn't change.
The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.
Morus
I've done some thinking afterwards, and instead of messing with complex 
queries, would it make sense to
replace all special characters such as é, è with e during 
indexing (I suppose write a custom analyzer)
and then during searching parse the query and replace all occurances of 
special characters (if any) with their
normal latin equivalents?

This should produce the required results, no? Since the index would not 
contain any French characters and
searching for French words would return them since they were indexed as 
normal words.

-pedja



English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but there 
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search. 
Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have to 
specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during 
searching (users would like to search in both document types).

There's also the issue of letter accents in french words and searching 
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]
Right now I am using StandardAnalyzer but the results are not what I'd 
hope for. Also since my understanding is that we should use the same 
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply to 
the SnowBall analyzer I wouldn't be able to use SnowBall for searching 
because users want to search through both
English and French and I suppose I would not get the same results if 
used with StandardAnalyzer?

Another problem with StandardAnalyzer is that it breaks up some words 
that should not be broken (in our case document identifiers such as 
ABC-1234 etc) but that's a secondary issue...

thanks
-pedja

Bernhard Messer said the following on 1/20/2005 1:05 PM:
i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
want to use the snowball stemmers, you have to add a language guesser 
to get the language for the particular document before creating the 
analyzer.

regards
Bernhard
[EMAIL PROTECTED] schrieb:
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but 
there are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular 
search. Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have 
to specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time 
during searching (users would like to search in both document types).

There's also the issue of letter accents in french words and 
searching for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]

you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?

Would that mean I would have to actually conduct two searches (one in 
English and one in French) then merge the results and display them to 
the user?
It sounds to me like a long way around, so then actually writing an 
analyzer that has the language guesser might be a better solution on the 
long run?

This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer 
does not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based 
tokenizer.
Hmm I feel this is beyond my abilities at the moment, writing my own 
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change it 
based on other tokenziers available in Lucene such as WhiteSpaceTokenizer.

thanks
-pedja
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Book in UK

2005-01-06 Thread [EMAIL PROTECTED]
Have you checked Manning's site (http://www.manning.com), where you can 
order the book directly from them (the publisher) and they will also 
provide you
with a copy of an eBook in the mean time until your paperback arrives in 
mail?

-pedja
P.S. two cubes of sugar with that tea, please :)
David Townsend said the following on 1/6/2005 1:23 PM:
Sorry if this is the wrong forum but I wondered what's happened to 'Lucene In Action' in the UK.  Looking forward to reading it but amazon.co.uk report it as a 'hard to find' item and are now quoting a 4-6 week delivery time and  tacking on a rare book charge.  Amazon.com are quoting shipping in 24hrs.  Is this a new 'Boston Tea Party'? 

cheers
David

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene working with a DB

2004-12-21 Thread [EMAIL PROTECTED]
Hello
I'll just paste the relevant MySQL code, you add the calls to it per 
your needs..it has no checking of anything so better add that as well...
It's possible I didnt copy/paste everything but you should get the idea 
where this is going...

-pedja
--

import java.sql.*;
import lucene stuff...

public class  sqlTest {
 public static void main(String[] args) throws Exception {
   String sTable  = args[0];
   String sThing = args[1];
   String indexDir = /path/to/lucene/index;
 try {
   Analyzer analyzer   = new StandardAnalyzer();
   IndexWriter fsWriter  = new IndexWriter(indexDir, analyzer, false);
   addSQLDoc(fsWriter, sTable, sThing);
   fsWriter.close();
 } catch (Exception e) {
   throw new Exception( caught a  + e.getClass() + \n with 
message:  + e.getMessage());
 }
}

private void addSQLDoc(IndexWriter writer, String sqlTable, String 
somethingElse) throws Exception {

   String cs = 
jdbc:mysql://HOST/DATABASE?user=SQLUSERpassword=SQLPASSWORD;
   String sql= SELECT * FROM  + sqlTable +  WHERE 
something=\ + somethingElse + \;

   // establish a connection to MySQL database
   try {
   Class.forName(com.mysql.jdbc.Driver).newInstance();
   } catch (Exception e) {
   System.out.println(Lucene: ERROR: Unable to load driver);
   e.printStackTrace();
   }
   // get the record data...
   try {
  Connection conn = DriverManager.getConnection(cs);
  Statement Stmt = conn.createStatement();
  ResultSet RS = Stmt.executeQuery(sql);
  while(RS.next()) {
 // make a new, empty document
 Document doc = new Document();
 // get the database fields
 String field2 = RS.getString(1);
 String field2 = RS.getString(2);
 String field3 = RS.getString(3);
 String field4 = RS.getString(4);
 String field5 = RS.getString(5);
 // add the first group of fields
 //
 doc.add(Field.Keyword(FIELD1, field1));
 doc.add(Field.Keyword(FIELD2, field2));
 doc.add(Field.Keyword(FIELD3, field3));
 doc.add(Field.Keyword(FIELD4, field4));
 doc.add(Field.Text(FIELD5, field5));
 // add the document
 writer.addDocument(doc);
   } catch (Exception e) {
   e.printStackTrace();
   throw new Exception();
   }
  } // close while(..)
  RS.close();
  Stmt.close();
  conn.close();
   } catch(SQLException e) {
   throw new Exception();
   }
 }
}
--
Daniel Cortes said the following on 12/21/2004 10:39 AM:
I read a lot of messages that Lucene can index a DB because it use 
that INPUTSTREAM type
I don't understand how to do this. For example if I've a forum with 
Mysql  and a lot of files on my web, for every search I've to select 
the index that I want use in my search, true? But I don't know how to 
do that Lucene writes an index about the information of the DB of 
forum (for example  MySQL)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to index Windows' Compiled HTML Help (CHM) Format

2004-12-11 Thread [EMAIL PROTECTED]
I suggest you look at the chmlib at: 
http://66.93.236.84/~jedwin/projects/chmlib/

-pedja
Tom said the following on 12/11/2004 11:20 AM:
Hi,
Does anybody know how to index chm-files? 
A possible solution I know is to convert chm-files to pdf-files (there are
converters available for this job) and then use the known tools (e.g.
PDFBox) to index the content of the pdf files (which contain the content of
the chm-files). Are there any tools which can directly grab the textual
content out of the (binary) chm-files?

I think chm-file indexing-support is really a big missing piece in the
currently supported indexable filetype-collection (XML, HTML, PDF,
MSWord-DOC, RTF, Plaintext). 

WBR,
Tom.  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: No of docs using IndexSearcher

2004-12-10 Thread [EMAIL PROTECTED]
numDocs()
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#numDocs()

Ravi said the following on 12/10/2004 2:42 PM:
How do I get the number of docs in an index If I just have access to a
searcher on that index?
Thanks in advance
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: No of docs using IndexSearcher

2004-12-10 Thread [EMAIL PROTECTED]
If your index is open shouldnt there be an instance of IndexReader 
already there?

Ravi said the following on 12/10/2004 3:13 PM:
I already have a field with a constant value in my index. How about
using IndexSearcher.docFreq(new Term(field,value))? Then I don't have to
instantiate IndexReader. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 2:59 PM
To: Lucene Users List
Subject: Re: No of docs using IndexSearcher

numDocs()
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexR
eader.html#numDocs()

Ravi said the following on 12/10/2004 2:42 PM:
 

How do I get the number of docs in an index If I just have access to a
   

 

searcher on that index?
Thanks in advance
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



Re: Empty/non-empty field indexing question

2004-12-08 Thread [EMAIL PROTECTED]
Hi Otis
What kind of implications does that produce on the search?
If I understand correctly that record would not be searched for if the 
field is not there, correct?
But then is there a point putting an empty value in it, if an 
application will never search for empty values?

thanks
-pedja
Otis Gospodnetic said the following on 12/8/2004 1:31 AM:
Empty fields won't add any value, you can skip them.  Documents in an
index don't have to be uniform.  Each Document could have a different
set of fields.  Of course, that has some obvious implications for
search, but is perfectly fine technically.
Otis
--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 

Here's probably a silly question, very newbish, but I had to ask.
Since I have mysql documents that contain over 30 fields each and
most of them
are added to the index, is it a common practice to add fields to the
index with 
empty values, for that perticular record, or should the field be
totally omitted.

What I mean is if let's say a Title field is empty on a specific
record (in mysql)
should I still add that field into Lucene index with an empty value
or just
skip it and only add the fields that contain non-empty values?
thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



Empty/non-empty field indexing question

2004-12-07 Thread [EMAIL PROTECTED]
Here's probably a silly question, very newbish, but I had to ask.
Since I have mysql documents that contain over 30 fields each and most of them
are added to the index, is it a common practice to add fields to the index with 
empty values, for that perticular record, or should the field be totally omitted.

What I mean is if let's say a Title field is empty on a specific record (in 
mysql)
should I still add that field into Lucene index with an empty value or just
skip it and only add the fields that contain non-empty values?
thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Problem with indexing/merging indices - documents not indexed.

2004-12-06 Thread [EMAIL PROTECTED]
(file.separator) + index;

try{

  Date start= new Date();
  Analyzer analyzer = new StandardAnalyzer();
  String doctype= args[0];
  String crofileno  = args[1];

/*
  IndexReader reader = IndexReader.open(indexDir);;
  int deleted = reader.delete(new Term(crofileno, crofileno));
  System.out.println(Lucene deleted records:  + deleted + br);
  reader.close();
*/

  // let's make two writers, RAM and FS so that we index to RAM first then 
merge at the end..
  //
  RAMDirectory ramDir   = new RAMDirectory();
  IndexWriter ramWriter= new IndexWriter(ramDir, analyzer, true);
  addDoc(ramWriter, doctype, crofileno);
  System.out.println(Docs In the RAM index:  + ramWriter.docCount());

  IndexWriter fsWriter  = new IndexWriter(indexDir, analyzer, true);
  //fsWriter.setUseCompoundFile(false);
  //fsWriter.mergeFactor  = 1000;
  //fsWriter.maxMergeDocs = 10;
  fsWriter.addIndexes(new Directory[] { ramDir });
  //fsWriter.optimize();
  System.out.println(Docs in the FS index:  + fsWriter.docCount());
  ramWriter.close();
  fsWriter.close();

  Date end = new Date();
  System.out.println(Lucene Added OK:  + Long.toString(end.getTime() - 
start.getTime()) +  total millisecondsbr);

} catch (IOException e) {
throw new Exception(Something bad happened:  + e.getClass() +  with 
message:  + e.getMessage());
} catch (Exception e) {
throw new Exception( caught a  + e.getClass() + \n with message:  + 
e.getMessage());
}
  }
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with indexing/merging indices - documents not indexed.

2004-12-06 Thread [EMAIL PROTECTED]
Hi Chris
actually for merging indices that's how Otis did it in the article I quoted:
   // if -r argument was specified, use RAMDirectory
   RAMDirectory ramDir= new RAMDirectory();
   IndexWriter  ramWriter = new IndexWriter(ramDir, analyzer, 
true);
   addDocs(ramWriter, docsInIndex);
   IndexWriter fsWriter   = new IndexWriter(indexDir, analyzer, 
true);
   fsWriter.addIndexes(new Directory[] { ramDir });
   ramWriter.close();
   fsWriter.close();

..which works great, and all I've done is replace the addDocs with my 
MySQL version of the function.

I do know about having to close and re-open to find the document count, 
which I've also tried but it didnt yield any difference,
a simple look in the index directory shows no files there except 
segements, even though it should've merged the RAMDir index into the fsDir.

thanks
-pedja

Chris Hostetter said the following on 12/6/2004 6:09 PM:
: I would appreciate any feedback on my code and whether I'm doing
: something in a wrong way, because I'm at a total loss right now
: as to why documents are not being indexed at all.
I didn't try running your code (because i don't have a DB to test it with)
but a quick read gives me a good guess as to your problem:
I believe you to call...
ramWriter.close();
...before you call...
fsWriter.addIndexes(new Directory[] { ramDir });
(I've never played with merging indexes, so i could be completley wrong)
Everything I've ever read/seen/tried has indicated that untill you close
your IndexWritter, nothing you do will be visible to anybody else who
opens that Directory
I'm also guessing that when you were trying to add the docs to fsWriter
directly, you were using an IndexReader you had opened prior to calling
fsWriter.close() to check the number of docs ... that won't work for hte
same reason.

-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Is this a bug or a feature with addIndexes?

2004-12-06 Thread [EMAIL PROTECTED]
Greetings,
Ok, so maybe this is common knowledge to most of you but I'm a lamen 
when it comes to Lucene and
I couldnt find any details about this after some searching.

When you merge two indexes via addIndexes, does it only work in batches 
(10 or more documents)?

Because I've been banging my head off the wall wondering why my code 
does not want to index 1 (one) document and
then I went to run Otis's MemoryVsDisk class from 
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last
but I didnt use 10,000 documents as suggested, I used 5 and 15 instead.
And what do you know, less than 10 it doesnt merge at all while more 
than 10 it will merge only first 10 documents and
gently forget about the other 5.

My project requires me to index/update one single document as required 
and make it immediately available for searching.

How do I accomplish this if index merging will not merge less than 10 
and in increments of 10, and single indexing doesnt
seem to do it at all (please see my other post 
http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2)

thanks
-pedja
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Is this a bug or a feature with addIndexes?

2004-12-06 Thread [EMAIL PROTECTED]
Hi Otis
I did try, here's what I get:
[EMAIL PROTECTED] tmp]# time java MemoryVsDisk 1 1 10 -r  
Docs in the RAM index: 1
Docs in the FS index: 0
Total time: 142 ms

real0m0.322s
user0m0.268s
sys 0m0.033s
I tried other combinations but they dont seem to affect the outcome 
either :(

thanks
-pedja
Otis Gospodnetic said the following on 12/6/2004 8:11 PM:
Hello,
Try changing IndexWriter's mergeFactor variable.  It's 10 by default. 
Change it to 1, for instance.

Otis
--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 

Greetings,
Ok, so maybe this is common knowledge to most of you but I'm a lamen 
when it comes to Lucene and
I couldnt find any details about this after some searching.

When you merge two indexes via addIndexes, does it only work in
batches 
(10 or more documents)?

Because I've been banging my head off the wall wondering why my code 
does not want to index 1 (one) document and
then I went to run Otis's MemoryVsDisk class from 
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last
but I didnt use 10,000 documents as suggested, I used 5 and 15
instead.
And what do you know, less than 10 it doesnt merge at all while more 
than 10 it will merge only first 10 documents and
gently forget about the other 5.

My project requires me to index/update one single document as
required 
and make it immediately available for searching.

How do I accomplish this if index merging will not merge less than 10
and in increments of 10, and single indexing doesnt
seem to do it at all (please see my other post 
http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2)

thanks
-pedja
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



problems search number range

2004-11-18 Thread [EMAIL PROTECTED]
(excuse me for my english)

hi people:
i am trying to do a search between two numbers.. 
at the very beginning it was all right, 
for example: when i had the number 20 and i searched between 10 and 30

query= 'number:[10 TO 30]'

then lucene found it.. but..

if i change the range numbers: 5 and 130 i started to have problems..
lucene didn't find the number 20 yet¡

i solved this changing the format of the numbers and putting this: 
number to look for: 020
range: 005, 130
query= 'number:[005 TO 030]

up to this point all correct.. 

but then another problem starts:
i need to use negative numbers and then all becomes crazy for me...

i need to solve this search:
number: -10
range: -50 TO 5

i need help.. 
i dont find anything using google.. 

thanks
d2clon





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problems search number range

2004-11-18 Thread [EMAIL PROTECTED]
hi morus  company;

On Thursday 18 November 2004 12:49, Morus Walter wrote:
 [EMAIL PROTECTED] writes:
  i need to solve this search:
  number: -10
  range: -50 TO 5
 
  i need help..
  i dont find anything using google..

 If your numbers are in the interval MIN/MAX and MIN0 you can shift
 that to a positive interval 0 ... (MAX-MIN) by subtracting MIN from
 each number.

thx, this is just what i have done.. 




 Alternatively you have to find a string represantation providing the
 correct order for signed integers.
 E.g.
 -0010
 -0001
 0
 1
 00020
 should work (in the range -..9), since '0' has a higher ascii
 (unicode) code than '-'.
 Of course the analayzer has to preserve the '-' and the '-' should not
 be eaten by the query parser in case you use it. I don't know if there are
 problems with that, but I suspect that at least for the query parser.


this solution was the first that i tried.. but this does not run correctly.. 
because:

when we try to sort this number in alphanumeric order we obtain that number 
-0010 is higher than -0001

so, the final solution is what you comment us at the beginning of your post.

thx a lot
d2clon


 Morus

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hibernate y el algoritmo para asignar scores a los hits

2004-10-05 Thread [EMAIL PROTECTED]
(first excuse me for my english)

hi people..

we are programming a few tests battery to see how lucene creates the scores in a 
search into a very simple and controllable documents.

but we don't understand why the results looks not ok for us..

i am going to try to explain, the details of the tests described at botton..

if you look the tests you will see:

in test 1)
query: info:house
lucene returns a 50.00 score for this document
id  info
-   --
house2  house noise noise noise

and the same score for this document:
id  info
-   --
house3  house noise noise noise noise


in test 2)
query: info:house

lucene returns more score for this document:
id  info
-   --
house3  noise noiseso noise noiseso noise noiseso house house house house

than for this one:
id  info
-   --
house5  noise noiseso noise noiseso noise noiseso house house house house house house



i have seen lucene's algoritm for creating this scores, i don't understand all but i 
think the results of these searches are not ok for me.. 

i need comments, or urls to study and improve my search method


thanks for all
d2clon





---
TESTS
---




we always execute this query, using the StandardAnalizer:
query: info:house


 test 1
---

document list:

id  info
---
house0  house noise
house1  house noise noise
house2  house noise noise noise
house3  house noise noise noise noise
house4  house noise noise noise noise noise
nohouse noise noiseso noise noiseso noise noiseso 


scores:

id   score
 ---
house062.5  
house150.0  
house250.0  
house343.75 
house437.5







 test 2:


document list:

id  info
---
house0  noise noiseso noise noiseso noise noiseso house
house1  noise noiseso noise noiseso noise noiseso house house
house2  noise noiseso noise noiseso noise noiseso house house house
house3  noise noiseso noise noiseso noise noiseso house house house house
house4  noise noiseso noise noiseso noise noiseso house house house house house
house5  noise noiseso noise noiseso noise noiseso house house house house house house
house6  noise noiseso noise noiseso noise noiseso house house house house house house 
house
house7  noise noiseso noise noiseso noise noiseso house house house house house house 
house house
house8  noise noiseso noise noiseso noise noiseso house house house house house house 
house house house
house9  noise noiseso noise noiseso noise noiseso house house house house house house 
house house house house
nohouse noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso noise 
noiseso nohouse

scores:

id   score
 ---
house979.0569
house875.0
house770.7107
house666.1438
house362.5
house561.2372
house455.9017
house254.1266
house144.1942
house037.5


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hibernate and algorithm for asing score to hits

2004-10-05 Thread [EMAIL PROTECTED]
(first excuse me for my english)

hi people..

we are programming a few tests battery to see how lucene creates the scores in 
a search into a very simple and controllable documents.

but we don't understand why the results looks not ok for us..

i am going to try to explain, the details of the tests described at botton..

if you look the tests you will see:

in test 1)
query: info:house
lucene returns a 50.00 score for this document
id  info
-   --
house2  house noise noise noise

and the same score for this document:
id  info
-   --
house3  house noise noise noise noise


in test 2)
query: info:house

lucene returns more score for this document:
id  info
-   --
house3  noise noiseso noise noiseso noise noiseso house house house house

than for this one:
id  info
-   --
house5  noise noiseso noise noiseso noise noiseso house house house house house 
house



i have seen lucene's algoritm for creating this scores, i don't understand all 
but i 
think the results of these searches are not ok for me.. 

i need comments, or urls to study and improve my search method


thanks for all
d2clon





---
TESTS
---




we always execute this query, using the StandardAnalizer:
query: info:house


 test 1
---

document list:

id  info
---
house0  house noise
house1  house noise noise
house2  house noise noise noise
house3  house noise noise noise noise
house4  house noise noise noise noise noise
nohouse noise noiseso noise noiseso noise noiseso 


scores:

id   score
 ---
house062.5  
house150.0  
house250.0  
house343.75 
house437.5







 test 2:


document list:

id  info
---
house0  noise noiseso noise noiseso noise noiseso house
house1  noise noiseso noise noiseso noise noiseso house house
house2  noise noiseso noise noiseso noise noiseso house house house
house3  noise noiseso noise noiseso noise noiseso house house house house
house4  noise noiseso noise noiseso noise noiseso house house house house house
house5  noise noiseso noise noiseso noise noiseso house house house house house 
house
house6  noise noiseso noise noiseso noise noiseso house house house house house 
house house
house7  noise noiseso noise noiseso noise noiseso house house house house house 
house house house
house8  noise noiseso noise noiseso noise noiseso house house house house house 
house house house house
house9  noise noiseso noise noiseso noise noiseso house house house house house 
house house house house house
nohouse noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso 
noise noiseso nohouse

scores:

id   score
 ---
house979.0569
house875.0
house770.7107
house666.1438
house362.5
house561.2372
house455.9017
house254.1266
house144.1942
house037.5


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE and algorithm for asing score to hits

2004-10-05 Thread [EMAIL PROTECTED]

im sorry friends.. i put the title incorrectly for two times


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Implement custom score

2004-09-22 Thread [EMAIL PROTECTED]
Hi,
I know this is probably a common question and I've found a couple of posts
about it in the archive but none with a complete answer. If there is one
please point me to it! 

The question is that I want to discard the default scoring and implement my
own. I want all the the hits to be sorted after popularity (a field) and
not by anything else. How can I do this? What classes and methods do I have
to change?

thanks,
William  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implement custom score

2004-09-22 Thread [EMAIL PROTECTED]
Yes thanks,
I implemented my own Similarity class that returns 1.0f from lengthNorm()
and idf() then I use setBoost when writing the document. However I get some
small round errors. When I boost with 0.7 that document gets the score
0.625. I've found that this has to do with the encode/decode norm in
Simliarity. Should I do anything about it? or does'nt it matter?

/William

 You need your own Similarity implementation and you need to set it as
 shown in this javadoc:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarit
y.html
 
 Otis
 
 --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
  Hi,
  I know this is probably a common question and I've found a couple of
  posts
  about it in the archive but none with a complete answer. If there is
  one
  please point me to it! 
  
  The question is that I want to discard the default scoring and
  implement my
  own. I want all the the hits to be sorted after popularity (a field)
  and
  not by anything else. How can I do this? What classes and methods do
  I have
  to change?
  
  thanks,
  William  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implement custom score

2004-09-22 Thread [EMAIL PROTECTED]
Thanks for the reply,
I've looked in to the search method that takes a Sort object as argument.
As I understand it the sorting is only done on the best matches (100 by
default)? I don't want the default score to have any impact at all. I want
to sort all hits on popularity not just the best matches. 

/William  
  
 Actually what William should use is the new Sort facility to order  
 results by a field.  Doing this with a Similarity would be much  
 trickier.  Look at the IndexSearcher.sort() methods which take a Sort  
 and follow the Javadocs from there.  Let us know if you have any  
 questions on sorting.
 
 It would be best if you represent your 'popularity' field as an integer  
 (or at least numeric) since sorting by String uses more memory.
 
   Erik
 
 
 On Sep 22, 2004, at 4:52 AM, Otis Gospodnetic wrote:
 
  You need your own Similarity implementation and you need to set it as
  shown in this javadoc:
  http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
  Similarity.html
 
  Otis
 
  --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
  Hi,
  I know this is probably a common question and I've found a couple of
  posts
  about it in the archive but none with a complete answer. If there is
  one
  please point me to it!
 
  The question is that I want to discard the default scoring and
  implement my
  own. I want all the the hits to be sorted after popularity (a field)
  and
  not by anything else. How can I do this? What classes and methods do
  I have
  to change?
 
  thanks,
  William
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pdf in Chinese

2004-09-08 Thread [EMAIL PROTECTED]
it is not about analyzer ,i  need to read text from pdf file first.

- Original Message - 
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 08, 2004 4:15 PM
Subject: Re: pdf in Chinese


 which analyzer you are using to index chinese pdf documents ?
 I think you should use cjkanalyzer
 - Original Message - 
 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Wednesday, September 08, 2004 11:27 AM
 Subject: pdf in Chinese
 
 
  Hi all,
  i use pdfbox to parse pdf file to lucene document.when i parse
 Chinese
  pdf file,pdfbox is not always success.
  Is anyone have some advice?
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



pdf in Chinese

2004-09-07 Thread [EMAIL PROTECTED]
Hi all,
i use pdfbox to parse pdf file to lucene document.when i parse  Chinese
pdf file,pdfbox is not always success.
Is anyone have some advice?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



problem in IndexSearcher

2004-09-06 Thread [EMAIL PROTECTED]
java.io.IOException: Lock obtain timed out

I was trying to create two instance of IndexSearcher with different index files

Is there something i've missed?

tia, 
buics

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: memory leek in lucene?

2004-09-03 Thread [EMAIL PROTECTED]
I also have problems regarding my application, 
what would be the ideal memory allocation for lucene
considering my application will serve at least 20 transactions per second?

tia
--buics


On Fri, 3 Sep 2004 15:20:45 +0200, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Terence,
 
 still had not not time to prepare a test case, but... I worked around it:
 
 The idea is to replace the score with timestamp on populating hits (in
 case You are not interesting too much in real score), where the field_sort
 is in MMddHHmm etc. format.
 Works fine, at least no outOfMemory crush until now.
 
   public TopDocs search(Query query, Filter filter, final int nDocs, final String 
 field_sort)
throws IOException {
 Scorer scorer = query.weight(this).scorer(reader);
 if (scorer == null)
   return new TopDocs(0, new ScoreDoc[0]);
 
 final BitSet bits = filter != null ? filter.bits(reader) : null;
 final HitQueue hq = new HitQueue(nDocs);
 final int[] totalHits = new int[1];
 scorer.score(new HitCollector() {
 public final void collect(int doc, float score) {
 // this bloody piece of code fakes scorer to deliver results sorted by
 date
 // because valid way runs into outOfMemory problem:( JGO 2004/08/31
 // note: modules touched -
 //
 Searcher,Searchable,Hits,ParallelMultiSearcher,MultiSearcher,RemoteSearchable
 (these for new field field_sort)
 // ScoreDoc, FieldSortedHitQueue,Hits,FieldScore (these for float -
 double)
   double new_score = score;
   String fval=;
   if (field_sort!=null){  //if null, just sort as usual by real score
 try {
 new_score= new 
 Double(0.+reader.document(doc).get(field_sort)+d).doubleValue();
 } catch (IOException e) {
 e.printStackTrace();
 }
   }
 if (score  0.0f  // ignore zeroed buckets
   (bits==null || bits.get(doc))) { // skip docs not in bits
 totalHits[0]++;
 hq.insert(new ScoreDoc(doc, new_score));
   }
 }
   });
 
 ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()];
 for (int i = hq.size()-1; i = 0; i--)// put docs in array
   scoreDocs[i] = (ScoreDoc)hq.pop();
 
 return new TopDocs(totalHits[0], scoreDocs);
   }
 
 Hi Iouli,
 
 Sorry, I am having a very tight schedule at work right now. I don't have
 time to come up the test case. The problem is really related to the
 search(query,sort) method call. If you can come up the test case, that
 would be great.
 
 Thanks,
 Terence
 
  back to biz.
 
  Terence,
 
  probably u prepared it already or should I do it?
 
  Otis,
 
  actually it's just a common way to execute a query.
 
  If  the code is like
 
  hits = ms.search(query);
 
  or
 
sort = new Sort(SortField.FIELD_DOC);
hits = ms.search(query,sort);
 
  or even
 
filter = DateFilter(published,stamp_from,stamp_to);
sort = new Sort(SortField.FIELD_DOC);
hits = ms.search(query,filter,sort);
 
  everything is ok, memory is getting free (you see it with top -p pid)
 
  The problem starts only in case:
 
sort = new Sort(new SortField(published_short,SortField.FLOAT,
 true));
hits = ms.search(query,sort);
 
  The memory never comes back  and grows up with every iteration even You
  start garbage collector explicitly and code runs somehow into finalize()
 
  Regards
  J.
 
 
 
 
 
 
 
  Iouli  Terence,
 
  Could you create a self-sufficient test case that demonstrates the
  memory leak?  If you can do that, please open a new bug entry in
  Bugzilla (the link to it is on Lucene's home page), and then attach
  your test case to it.
 
  Thanks!
  Otis
 
  --- [EMAIL PROTECTED] wrote:
 
   Yes Terence, it's exactly what I do
  
  
  
  
  
  
   Terence Lai [EMAIL PROTECTED]
   21.08.2004 01:50
   Please respond to Lucene Users List
  
  
   To: Lucene Users List [EMAIL PROTECTED]
   cc:
   Subject:RE: memory leek in lucene?
   Category:
  
  
  
   Are you calling ParallelMultiSearcher.search(Query query, Sort sort)
   to do
   your search? If so, I am currently having a similar problem.
  
   Terence
  
   
Doing query against lucene  I run into memomry problem, i.e. it's
   look
   like
it's not giving memory back after the
query have been  executed.
   
I use ParallelMultiSearcher ant call close method after results are
displayed.

hits=null; // Hits class
if (ms!=null) ms.close(); //ParallelMultiSearcher

Doesn't help. The memory getting not free. On queries like No* I
   get
incremental memory consume of c. 20-70mb. per query.
Imagine what happens with my web server...
   
I tried also from command line and got the similar result.
   
Am I doing wrong or miss something?
   
Please help, I use 1.4.1 on linux box

thanks for your mail

2004-02-17 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-17 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-16 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-16 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-15 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-15 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-13 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-13 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-12 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-12 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-11 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-11 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Italian web sites

2002-04-29 Thread [EMAIL PROTECTED]

The first one.

Bye Laura


 What does it mean? Italian website can be:
   - site that use italian language
   - site owned by an italian organization
   - site hosted in a italian geographical site
 Every definition has a different solution.
 
 Date sent:Wed, 24 Apr 2002 11:02:32 +0200
 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 Subject:  Italian web sites
 To:   [EMAIL PROTECTED]
 Send reply to:Lucene Users List lucene-
[EMAIL PROTECTED]
 
  Hi all,
 
  I'm using Jobo for spidering web sites and lucene for indexing. The
  problem is that I'd like spidering only Italian web sites.
  How can I see discover the country of a web site?
 
  Dou you know some method that tou can suggest me?
 
  Thanks
 
 
  Laura
 
 
 
 --
 Marco Ferrante ([EMAIL PROTECTED])
 CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
 Università degli Studi di Genova - Italy
 Via Brigata Salerno, ponte - 16147 Genova
 tel (+39) 0103532621 (interno tel. 2621)
 --
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Re: Italian web sites

2002-04-24 Thread [EMAIL PROTECTED]

Hi all,

I have found a very interesting library which is written in perl.
The problem is now how I can use this library.

Anyway the library is Textcat an you can find it:

http://odur.let.rug.nl/~vannoord/TextCat/

Bye

Laura

 combined with that you could use an italian stop-
word list to run statistics 
 on a page :-) ?!?
 
 On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote:
  Hi all,
  
  I'm using Jobo for spidering web sites and lucene for indexing. The 
  problem is that I'd like spidering only Italian web sites. 
  How can I see discover the country of a web site?
  
  Dou you know some method that tou can suggest me?
  
  Thanks
  
  
  Laura
  
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Re: Lucene in action at www.mil.fi

2002-04-23 Thread [EMAIL PROTECTED]

Hi Jari 

whre do you build your index? On filesystem? Do you use database?

Laura


 Hello,
 
 I'm glad to inform you that I've built a complete Lucene-based web
 search solution for the Finnish Defence Forces web site and that it's
 online as of this moment.
 
   You can see it in action at:
   http://www2.mil.fi:8080/haku/haku?q=hornet
 
 The user interface is in Finnish, but I hope you can get a general gra
sp
 of what's going on there anyway.
 
 As for the technical facts, I basically built a web crawler for indexi
ng
 the www.mil.fi -sites, and a servlet/xml/xsl -based frontend that
 delivers the results to your screen.
   The crawler is capable of indexing HTML (I used the Swing
 parser), PDF (I used xpdf, which is kinda bubble-gum-ish, but it works
 ;) and images (they're searched for by filename only).
   And for the front end, I have a servlet that does the searching,
 prints out XML (raw XML output:
 http://www2.mil.fi:8080/haku/raw?
q=hornet) which is then transformed to
 HTML via XSL (I wrote a neat little servlet filter for this).
   The search servlet also has a simple query parser: the incoming
 query is parsed so that the default operand is AND instead of OR.
   So basically, if you type 'hornet picture', the actual search
 sent to Lucene will be '+hornet +picture' - I wanted it to be
 Google-like.
 
 Anyway, check it out and feel free to ask me if you'd like to know
 something more about the implementation.
 
 Also, feel free to mention Finnish Defence Forces at the Powered by

 -section of the Lucene web site.
 
 Thanks go to all the Lucene developers - it's great stuff :D
 
   Jari Aarniala
 
 --
 Jari Aarniala
 [EMAIL PROTECTED]  death is the
 Vantaa, .fi  last dance eternal
 
 
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Re:_HTML_parser

2002-04-22 Thread [EMAIL PROTECTED]

Hi all,

did someone try jobo?

It seems a good software which can be extended.

Has someone some experiences about it?

Laura


 Laura,
 
 http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=Spindleq=b
 
 Oops, it's JoBo, not MoJo :)
 http://www.matuschek.net/software/jobo/
 
 Otis
 
 --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Hi Otis,
  
  thanks for your reply. I have been looking for Spindle and Mojo for 
2
  
  hours but I don't found anything.
  
  Can you help me? Wher can I find something?
  
  Thanks for your help and time
  
  
  Laura
  
  

  
   Laura,
   
   Search the lucene-user and lucene-dev archives for things like:
   crawler
   spider
   spindle
   lucene sandbox
   
   Spindle is something you may want to look at, as is MoJo (not
  mentione
  d
   on lucene lists, use Google).
   
   Otis
   
Did someone solve the problem to spider recursively a web pages?
   
 While trying to research the same thing, I found the
following...here
's a 
 good example of link extraction.
 
 Try http://www.quiotix.com/opensource/html-parser
 
 Its easy to write a Visitor which extracts the links; should
  take
abou
t ten 
 lines of code.
   
   
   __
   Do You Yahoo!?
   Yahoo! Games - play chess, backgammon, pool and more
   http://games.yahoo.com/
   
   --
   To unsubscribe, e-mail:   mailto:lucene-user-
  [EMAIL PROTECTED]
   For additional commands, e-mail: mailto:lucene-user-
  [EMAIL PROTECTED]
   
   
 
 
 __
 Do You Yahoo!?
 Yahoo! Games - play chess, backgammon, pool and more
 http://games.yahoo.com/
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Re: HTML parser

2002-04-20 Thread [EMAIL PROTECTED]

Hi all,

I'm very interested about this thread. I also have to solve the problem 
of spidering web sites, creating index (weel about this there is the 
BIG problem that lucene can't be integrated easily with a DB), 
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think 
that with this library you can also parse a non well formed page (that 
you can take from the web with URLConnection) setting the property to 
clean the page. The class Tidy() returns a org.w3c.dom.Document that 
you can use for analizing all the document: for example you can use 
doc.getElementsByTagName(a) for taking all the a elements. You can 
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura




 
 While trying to research the same thing, I found the following...here
's a 
 good example of link extraction.
 
 Try http://www.quiotix.com/opensource/html-parser
 
 Its easy to write a Visitor which extracts the links; should take abou
t ten 
 lines of code.
 
 
 
 --
 Brian Goetz
 Quiotix Corporation
 [EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-
8032
 
 http://www.quiotix.com
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Some questions

2002-04-19 Thread [EMAIL PROTECTED]

Hi all,

my name is Laura and I'm a new member of this list. I'm a long date 
user of tomcat and I'm also a meber of tomcat user list. 
Yesterday looking at the jakarta menu I saw lucene and I said:What is 
this?
Reading lucene home page I understood that Lucene is a very interesting 
and big project. I'm looking for a product as lucene!!!
Because I'm a new member of this list o new user of lucene I have some 
questions that you answer easily to, I think.
Well, I saw that lucene create the index on the filesystem: I think 
that this is a problem for producion enviroment. I usually use 
Database, for example Oracle. 
Is it possible integrate Lucene with Oracle or some other db (Mysql)?

I think that there isn't any Italian Anylizer, is it?
How can I write one?

The last question is: I suppose that my search engine is able to spider 
web sites. Is it possible spidering urls?
For example is it possible that with a page I spider this page, then I 
extract the links of the page and at least I'd like spidering also 
these links?
How can I do this?

Well I hope to be able to use lucene.

Thanks for your help


Laura