RE: HTML text extraction

2006-06-21 Thread Rob Staveley (Tom)
I found that CyberNeko left style and script in the text and JTidy produced
better output, but both of them use DOM and were therefore subject to
OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved
over to TagSoup, which I needed to customise to strip style script (a simple
tweak), but "kept on trucking" with any size document. 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 21 June 2006 07:37
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John,

I also wrote about using NekoHTML, I think.  I prefer that to JTidy.  That
also tells you what Simpy.com uses.

Otis

- Original Message 
From: John Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: HTML text extraction

2006-06-21 Thread Simon Courtenage
I also use htmlparser, which is rather good.  I've had to customize it, 
though, to parse strings containing
html source rather than accept urls of resources to fetch etc.  Also it 
crashes on meta tags that don't have

name attributes (something I discovered only a couple of days ago).

Simon

Daniel Noll wrote:

John Wang wrote:

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.


We use this library to do our HTML parsing work:

http://htmlparser.sourceforge.net/

It's fairly resilient to bad code, including things like false 
assumptions about nesting HTML inside script.  (e.g. 
document.write("");


Daniel




--
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: [EMAIL PROTECTED]   Web: http://users.cscs.wmin.ac.uk/~courtes | 
http://www.sse.wmin.ac.uk


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML text extraction

2006-06-21 Thread Chris Hostetter

if you just want something to extract the text from HTML, without trying
to extract structure (ie: you don't care about title vs h1 vs bold vs meta
keywords) then the HTMLStripReader (or
HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be
usefull.  It wasn't intended to deal with full HTML documents (hence it
doesn't have any mechanism for infering strucutre) but it was intended to
do the best job possible when deling with dirty data that might be plain
text, or it might be a chunk of HTML, or it might be mostly plain text
with a little bit of html sprinkled in.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: faceting and categorizing on color?

2006-06-21 Thread Chris Hostetter

: I thought that having: F0 FF FF 00...
: in one field and then searching for FF in it would
: match all documents that contain that "word" so I
...
: the counts were equal. I guess I am still not clear on
: what the differences/advantages/disadvantages are
: between the above an document and one that like this:
...
: color in its own field? So perhaps a more general
: question is when is it better to collapse a bunch of
: words into a single field vs. spread them out over
: many fields, all of which have the same field name?

My point was that (depending on what exactly you were trying to express)
when you try to represent a document's structure in email like that both
examples have the exact same structure in the index.  For any given field
name, there is only one Field in the index, and it has a stream of values
regardless of how you constructed your document object -- technically
there isn't even really a "field" as much as there is an ordered list of
Terms.  Specificly, this...

  IndexWriter w = new IndexWriter(d, new WhitespaceAnalyzer(), true);
  Document doc = new Document();
  doc.add(new Field("color", "red blue green", Store.NO, Index.TOKENIZED));
  w.addDocument(doc);

...is going to produce and index with the exact same terms maped to that
document as this...

  IndexWriter w = new IndexWriter(d, new WhitespaceAnalyzer(), true);
  Document doc = new Document();
  doc.add(new Field("color", "red", Store.NO, Index.TOKENIZED));
  doc.add(new Field("color", "blue", Store.NO, Index.TOKENIZED));
  doc.add(new Field("color", "green", Store.NO, Index.TOKENIZED));
  w.addDocument(doc);

...the only differnece that might pop up, is if you configure your
analyzer to have a positionIncrimentGap greater then 0, in which case
there will be a bigger "gap" between the red and blue and green in the
second case then in the fist (which will affect any sloppy searches that
you might do)

LIA has a lot of gret info on exactly how thisworks, and can relaly help
you "think" in terms of Terms.

: That's a great idea and seems way more straightforward
: than the RGB/HSV etc. distance calculation algorithms
: I've been reading about :o) I'll have to run some
: tests to see how accurate that reduction appears to
: people.

that's just an easy one if you are already representing your colors in RGB
hex codes -- the bottom line is any method you have for simplifying your
pallet will make it easier to do facet counts (because you ar reducing the
number of facets)

: Huh, perhaps I don't understand the HitCollector
: fully. Are you saying that if I have an index with 100
: documents, each of which have a color field (let's say
: 25 of the documents have FF in the color field)
: and I do a search for FF...using a HitCollector
: I'm iterating over all 100 documents while extracting
: values whereas with the regular Hits based search I
: would only be iterating over 25?

First off, forget the Hits class exists -- it has no place in a discussion
of facet counts where you need to look at every document that matches a
given query.

Second: my point is that when you are starting with an arbitrary query,
and then you wnat to provide counts for a bunch of facet values (ie:
colors) you have no idea which facet values are the "best" -- you have no
way of knowing that you should start with FF, you have to try them
all...

Conceptually facets are about set intersections and the cardinality of
those intersections.  You have a main result set of documents which match
your user's query, and you have many hypothetical sets of documents that
each represent all docs that contain some single color that deifines that
set.  You want to intersect each of those sets with your main result set,
and find out which colors product the greatest intersection cardinality.

With the BitSet approach, you can directly implimenting this comceptual
model -- making real sets out of your hypothetical sets.  But if that's
not feasible from a memory usage perspective, then you can traverse the
two dimensions of the problem space (colors and documents) in either
order:
 1) you can use the FieldCache (or some other data structure you build
from TermEnums) to have fast access to the list of colors in each doc, and
then in your HitCollector as you collect each docId matching your primary
query, incriment a counter for the corrisponding colors.
 2) you can start with a HitCollector that just records all of the docIds
that match your primary query -- a BitSet is convinient for this -- and
then use a TermEnum with a TermDocs to iterate over every color in your
index, and count up how many of the documents those colors map to are in
your BitSet

(NOTE: using a FieldCache for this is non-trivial since FieldCache only
works if each doc has at most one value, i wasn't really thinking about
the specifics of your color example when i mentioned it before -- but the
concept is still the same: an array that allows fast lookup by do

Re: Modifying the stored norm type

2006-06-21 Thread karl wettin
On Tue, 2006-06-20 at 18:01 +0200, Paul Elschot wrote:
> On Tuesday 20 June 2006 12:02, Marcus Falck wrote:

> encodeNorm method of the Similarity class will encode my boost value
> into a single byte decimal number. And I will loose a lot of
> resolution and will get severe rounding errors.

> Are 256 different values enough for your case?

Marcus is trying to use the norms to enforce results in chronological
order when matching a TB-sized corpus. He can't get any speed by sorting
on a date field.

Here is an idea:

Never delete documents. Use unsafe document number as system clock. Make
sure TermDocs always return references in reversed chronological order
and write a HitCollector that does not re-order.

That should work, right? 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Custom ScoreDocComparator and normalized Scores

2006-06-21 Thread Gustavo Comba
Thanks Chris, I didn't know the "solr" package, it is not in the release
distribution, isn't? I'm going to read about it to see if it matchs our
needs.

The need for normalization is derived from converting a list of values
in "polynomial" like ranking function. We define our "ranking" in a way
like that:

Our Score = x . LuceneScore + y . SomeField + z . SomeOtherField + w .
YetAnotherField

Being x, y, z and w our coeffiecients or "scaling factors" in our
ranking function.

In order to have some sense, all the other values (LuceneScore,
SomeField, SomeOtherField  and YetAnotherField) must be normalized,
being positive (because we want to) values linear scaled to fit some
fixed segment, let say, 0 to 1.

To achive pre-ordering normalization I'm using an "all collector" like:

public class AllCollector extends HitCollector {

private ArrayList scoreDocs;

public AllCollector() {
scoreDocs = new ArrayList(1);
}

public void collect(int doc, float score) {
if (score > 0.0f) {
maxScore = Math.max(maxScore, score);
scoreDocs.add(new ScoreDoc(doc, score));
totalHits++;
}
}
}

And to get the "best-n" we rewrite topDocs() to: 

public TopDocs topDocs(IndexReader reader, Sort sort, int
numHits) throws IOException {
TopFieldDocCollector collector = new
TopFieldDocCollector(reader, sort, numHits);
if (maxScore > 0.0f) {
for(Iterator it = scoreDocs.iterator();it.hasNext();) {
ScoreDoc scoreDoc = (ScoreDoc) it.next();
scoreDoc.score /= maxScore;
collector.collect(scoreDoc.doc, scoreDoc.score);
}
}
collector.totalHits = totalHits;
return collector.topDocs();
}

This workaround has some evident "cons", like:

* It makes a big list with all the results
* It duplicates the work, first a List, then a PriorityQue
* Could generate problems with "multi indexes".

But it works for us by now. I'm going to look the FunctionQuery to see
if it can do the job.

Thanks a Lot for your help!

Gustavo


-Mensaje original-
De: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Enviado el: martes, 20 de junio de 2006 21:55
Para: java-user@lucene.apache.org
Asunto: Re: Custom ScoreDocComparator and normalized Scores



First off: why do you need the normalized scores in your equation?  for
the purposes of comparing the calculated values in order to sort them,
it shouldn't matter if they are normalized or not.

Second: I strongly suggest you take a look at FunctionQuery ... it was
created for hte expres purpose of letting you define functions that be
applied to indexed field values of each document to affect the score

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/functio
n/package-summary.html


: Date: Tue, 20 Jun 2006 11:31:42 +0200
: From: Gustavo Comba <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Custom ScoreDocComparator and normalized Scores
:
: Hi,
:
: I'm trying to sort the search results by a "combination" of the
: "lucene score" and the value of a document field. The "combination" is
: something like that:
:
: scoreWeight * i.score + fieldWeight * getFieldValue(i.doc)
:
: I expect results between 0 and scoreWeight + fieldWeight
:
: Until version 1.9 this use to works OK, but now Lucene doesn't
: normalize the documents scores before calling
: ScoreDocComparator#compare(ScoreDoc i, ScoreDoc j). I know this is
: necessary when combining several indexes, but it's not our case (we
have
: only one index).
:
: I'm diggin into Lucene's source code to find a way to normalize
: values before sorting the results. The solution I found requires a lot
: of "custom" code, and doing 2 passes over the results, one to
calculate
: alll the document's scores, and then a sort using a comparator "who
: knows" the maximum score value (in order to normalize values on the
: fly), so I think there should be a more efficient and elegant way to
do
: this.
:
: Any ideas? Any help will be appreciated! Thanks in advance,
:
: Gustavo Comba
: Emagister.com
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HTML text extraction

2006-06-21 Thread Liao Xuefeng
hi,
i wrote my own html parser to do html2text and it works well. i can send you
my code if it matches your require.

-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 21, 2006 1:40 PM
To: java-user@lucene.apache.org
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search within multiple different subfolders

2006-06-21 Thread Erick Erickson

Shagheyegh:

I'm hardly the lucene expert, but I don't think you can search just a
portion of the index. But that's effectively what you're doing if you
restrict the search to "son and.".

However, depending on your problem space, you could build separate indexes.
To continue the example, you could build 4 indexes, "mother", "father",
"son", "daughter" and only search the relevant ones.

You should still be able to aggregate the results. That is, you could search
over all 4 indexes when you needed to and combine the results into a single
response. Then you could only search a subset of them at other times. I
admit that I haven't yet had to use a MultiSearcher, but that sure looks
like what you want if you adopt this approach ...

Note: I'm looking at the 2.0 documentation

But I also have to ask, why are you trying to "search only a portion of the
index"? If you haven't encountered a bottleneck that's forcing you into this
option (or don't have a *very* high expectation that you will encounter such
a bottleneck), this strikes me as work you shouldn't be doing until there's
a demonstrated need. The eXtreme Programming folks look at it this way "make
it work, make it right, make it fast".


From Tony Hoare and Donald Knuth... "We should forget about small

efficiencies, say about 97% of the time: premature optimization is the root
of all evil."

Best
Erick


BooleanQuery

2006-06-21 Thread WATHELET Thomas
Why I retrive hits with this query : 

+doccontent:avian +doctype:AM +docdate:[2005033122000 TO
2006062022000]

and not with this one 

+doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO
2006062022000]





RE: BooleanQuery

2006-06-21 Thread Mile Rosu
You should specify the field name for influenza as well.

Like this:

+doccontent:avian +doccontent:influenza +doctype:AM
+docdate:[2005033122000 TO
2006062022000]

Mile

-Original Message-
From: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 21, 2006 4:40 PM
To: java-user@lucene.apache.org
Subject: BooleanQuery

Why I retrive hits with this query : 

+doccontent:avian +doctype:AM +docdate:[2005033122000 TO
2006062022000]

and not with this one 

+doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO
2006062022000]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: BooleanQuery

2006-06-21 Thread Gustavo Comba
Hello,

I don't know how are you parsing your query, but may be the query you 
are looking for is something like:

 +(doccontent:avian doccontent:influenza) +doctype:AM 
+docdate:[2005033122000 TO 2006062022000]

Regards,

Gustavo

-Mensaje original-
De: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Enviado el: miércoles, 21 de junio de 2006 15:40
Para: java-user@lucene.apache.org
Asunto: BooleanQuery

Why I retrive hits with this query : 

+doccontent:avian +doctype:AM +docdate:[2005033122000 TO
2006062022000]

and not with this one 

+doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO
2006062022000]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: BooleanQuery

2006-06-21 Thread WATHELET Thomas
Ok thanks a lot.
Before I use TermQuery for the filed doccotent now I use Query object with 
QueryParser.parse and it's work perfectly. 

-Original Message-
From: Gustavo Comba [mailto:[EMAIL PROTECTED] 
Sent: 21 June 2006 16:00
To: java-user@lucene.apache.org
Subject: RE: BooleanQuery

Hello,

I don't know how are you parsing your query, but may be the query you 
are looking for is something like:

 +(doccontent:avian doccontent:influenza) +doctype:AM 
+docdate:[2005033122000 TO 2006062022000]

Regards,

Gustavo

-Mensaje original-
De: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Enviado el: miércoles, 21 de junio de 2006 15:40
Para: java-user@lucene.apache.org
Asunto: BooleanQuery

Why I retrive hits with this query : 

+doccontent:avian +doctype:AM +docdate:[2005033122000 TO
2006062022000]

and not with this one 

+doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO
2006062022000]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: using lucene Lock inter-jvm

2006-06-21 Thread jm

ok, in case somebody has the same problem:
The problem is the true value in
FSDirectory directory = FSDirectory.getDirectory("C:\\temp\\a", true);
it deletes the previous lock file, that belongs to the lock adquired
by the first process. Changing it to false prevents the lock being
deleted and locking works ok

On 6/20/06, jm <[EMAIL PROTECTED]> wrote:

Hi,

I am trying to peruse lucene's Lock for my own purposes, I need to
lock several java processes and I thought I could reuse the Lock
stuff. I understand lucene locks work across jvm.

But I cannot make it work. I tried to reproduce my problem in a small class:

public class SysLock {
private static final Logger logger = Logger.getLogger(SysLock.class);

private int id;

public SysLock(int i) {
id = i;
}

//public static void main(String[] args) throws Exception {
//System.setProperty("org.apache.lucene.lockDir", "C:\\temp\\todel");
//SysLock l1 = new SysLock(1);
//SysLock l2 = new SysLock(2);
//
//TransferThread t = l1.new TransferThread(l1);
//t.start();
//TransferThread t2 = l2.new TransferThread(l2);
//t2.start();
//
//logger.info("Finished.");
//}

public static void main(String[] args) throws Exception {
System.setProperty("org.apache.lucene.lockDir", "C:\\temp\\todel");
SysLock l1 = new SysLock(new Date().getSeconds());

TransferThread t = l1.new TransferThread(l1);
t.start();

logger.info("Finished.");
}

private void forever() throws IOException {
FSDirectory directory = FSDirectory.getDirectory("C:\\temp\\a", true);
try {
new Lock.With(directory.makeLock("COMMIT_LOCK_NAME"),
COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
while (true) {
System.out.println("i'm " + id);
try {
Thread.sleep(2000);
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}.run();
}
catch (Exception e) {
System.out.println(id + " could not get lock");
}
}

class TransferThread extends Thread {
public TransferThread(SysLock sl) {
this.sl = sl;
}

public void run() {
try {
sl.forever();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

private SysLock sl;
}
}

When I run the main() that is commented (that is, the lock works with
two threads in the same jvm) it works ok, the second TransferThread
cannot get the lock.

But when I run the uncommented main() twice, both processes adquire a
lock, even if only one lock file exists in the lockdir. Something I am
missing probably

Many thanks
javi



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Giving weight to partial matches

2006-06-21 Thread Chun Wei Ho

I am performing searches on an index that includes a title field and a
content field, and return results only if either title or content
matches ALL the words searched. So searching for "miracle cure for
cancer" might yield:

(+title:miracle +title:cure +title:for +title:cancer)^5.0
(+content:miracle +content:cure +content:for +content:cancer)

What I like to do now is to give additional weight to a result if the
title field contains some of the words being search, for example the
document:

Title: Miracle Cure
Content: A miracle cure for cancer has been found!

would have higher weight/score because the title contains words that
were searched for (although not fully matched), even tho the content
field is the one that results in the match.

I've seen a few discussions on weighting on the list but they all seem
to revolve around FunctionQuery from Solr. My current application does
not use Solr and is based on Lucene 1.9.

Any suggestions would be great :)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Giving weight to partial matches

2006-06-21 Thread Gustavo Comba
Hi,

You can use something like:

(title:miracle title:cure title:for title:cancer)^5.0 +((+title:miracle 
+title:cure +title:for +title:cancer) (+content:miracle +content:cure 
+content:for +content:cancer))

Should do the work.

Regards,

Gustavo

-Mensaje original-
De: Chun Wei Ho [mailto:[EMAIL PROTECTED] 
Enviado el: miércoles, 21 de junio de 2006 16:49
Para: java-user@lucene.apache.org
Asunto: Giving weight to partial matches

I am performing searches on an index that includes a title field and a content 
field, and return results only if either title or content matches ALL the words 
searched. So searching for "miracle cure for cancer" might yield:

(+title:miracle +title:cure +title:for +title:cancer)^5.0 (+content:miracle 
+content:cure +content:for +content:cancer)

What I like to do now is to give additional weight to a result if the title 
field contains some of the words being search, for example the
document:

Title: Miracle Cure
Content: A miracle cure for cancer has been found!

would have higher weight/score because the title contains words that were 
searched for (although not fully matched), even tho the content field is the 
one that results in the match.

I've seen a few discussions on weighting on the list but they all seem to 
revolve around FunctionQuery from Solr. My current application does not use 
Solr and is based on Lucene 1.9.

Any suggestions would be great :)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: using lucene Lock inter-jvm

2006-06-21 Thread Michael McCandless
CC'ing java-dev to talk about details of locking.

I can reproduce this on Windows XP, Java 1.4.2: two separate JVMs are able 
to get the Lock at the same time.

The code looks correct to me.

Strangely, if I make a separate standalone test that just uses 
java.io.File.createNewFile directly, it works correctly (this is what 
FSDirectory.makeLock uses).  If I hardwire the lock filename inside 
FSDirectory.makeLock, it also fails, unless I use a very short filename, 
then it seems to work.  Something intermittent is going on.

Then in looking at the docs for java.io.File.createNewfile():

 http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#createNewFile()

There is this spooky comment:

   Note: this method should not be used for file-locking, as the 
resulting protocol cannot be made to work reliably. The FileLock facility 
should be used instead. 

Finally, it looks like Hadoop's LocalFileSystem class is already using 
FileLock's.  One benefit of FileLocks is if JVM crashes, the OS should 
handle removing the lock correctly.  I know this has been an issue in the 
past with "commit lock timeout" errors due to the lock file remaining in 
the filesystem, with the current approach.

Does anyone know of any reasons not to switch Lucene's FSDirectory locking 
to the java.nio.channels.FileLock?  EG, are there any performance issues 
that people are aware of?  It's available since Java 1.4.

Mike




jm <[EMAIL PROTECTED]> 
06/20/2006 01:19 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
using lucene Lock inter-jvm






Hi,

I am trying to peruse lucene's Lock for my own purposes, I need to
lock several java processes and I thought I could reuse the Lock
stuff. I understand lucene locks work across jvm.

But I cannot make it work. I tried to reproduce my problem in a small 
class:

public class SysLock {
private static final Logger logger = Logger.getLogger(SysLock.class);

private int id;

public SysLock(int i) {
id = i;
}

//public static void main(String[] args) throws Exception {
//System.setProperty("org.apache.lucene.lockDir", 
"C:\\temp\\todel");
//SysLock l1 = new SysLock(1);
//SysLock l2 = new SysLock(2);
//
//TransferThread t = l1.new TransferThread(l1);
//t.start();
//TransferThread t2 = l2.new TransferThread(l2);
//t2.start();
//
//logger.info("Finished.");
//}

public static void main(String[] args) throws Exception {
System.setProperty("org.apache.lucene.lockDir", 
"C:\\temp\\todel");
SysLock l1 = new SysLock(new Date().getSeconds());

TransferThread t = l1.new TransferThread(l1);
t.start();

logger.info("Finished.");
}

private void forever() throws IOException {
FSDirectory directory = FSDirectory.getDirectory("C:\\temp\\a", 
true);
try {
new Lock.With(directory.makeLock("COMMIT_LOCK_NAME"),
COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
while (true) {
System.out.println("i'm " + id);
try {
Thread.sleep(2000);
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}.run();
}
catch (Exception e) {
System.out.println(id + " could not get lock");
}
}

class TransferThread extends Thread {
public TransferThread(SysLock sl) {
this.sl = sl;
}

public void run() {
try {
sl.forever();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

private SysLock sl;
}
}

When I run the main() that is commented (that is, the lock works with
two threads in the same jvm) it works ok, the second TransferThread
cannot get the lock.

But when I run the uncommented main() twice, both processes adquire a
lock, even if only one lock file exists in the lockdir. Something I am
missing probably

Many thanks
javi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML text extraction

2006-06-21 Thread John Wang

Thanks everyone for your responses!
I will try them out.

-John

On 6/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


John,

I also wrote about using NekoHTML, I think.  I prefer that to JTidy.  That
also tells you what Simpy.com uses.

Otis

- Original Message 
From: John Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Possible improvement in BooleanQuery

2006-06-21 Thread Satuluri, Venu_Madhav
The method BooleanQuery.add( Query q, BooleanClause.Occur o) accepts
Query objects that are null for its first parameter i.e. it doesn't
throw any exception. However, when we try to get the string form of the
same BooleanQuery object, it throws a NullPointerException from within
the toString() code of BooleanQuery.

Try this code for example: 
  BooleanQuery bq = new BooleanQuery();
bq.add(null, BooleanClause.Occur.SHOULD);
System.out.println( bq.toString() );

It throws the following exception:
Exception in thread "main" java.lang.NullPointerException
at
org.apache.lucene.search.BooleanQuery.toString(BooleanQuery.java:421)
at org.apache.lucene.search.Query.toString(Query.java:79)
at
deshaw.reccms.search.LuceneGenerator.main(LuceneGenerator.java:751)

It seems to me that an IllegalArgumentException or some other
RuntimeException should be thrown at line no. 2, from the
BooleanQuery.add() method, because clearly BooleanQuery objects don't
handle null Queries. 

Thoughts?

Thanks,
Venu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: using lucene Lock inter-jvm

2006-06-21 Thread Yonik Seeley

On 6/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote:

Does anyone know of any reasons not to switch Lucene's FSDirectory locking
to the java.nio.channels.FileLock?  EG, are there any performance issues
that people are aware of?  It's available since Java 1.4.


Good question Michael, no reason that I know of... I think its
probably just that no one has revisited the issue since Lucene moved
to 1.4

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hi - testing

2006-06-21 Thread bruce
hi..

can someone please respond to this so i can see if i'm getting through..

thanks

-bruce



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: hi - testing

2006-06-21 Thread Yonik Seeley

Normally the way I see if I'm correctly sending something to a list is
to send the first post I really want to send, and go check an archive
of the list a little later.

-Yonik

On 6/21/06, bruce <[EMAIL PROTECTED]> wrote:

hi..

can someone please respond to this so i can see if i'm getting through..

thanks

-bruce



-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modifying the stored norm type

2006-06-21 Thread Yonik Seeley

On 6/21/06, karl wettin <[EMAIL PROTECTED]> wrote:

Marcus is trying to use the norms to enforce results in chronological
order when matching a TB-sized corpus. He can't get any speed by sorting
on a date field.


Once a FieldCache entry is populated, sorting on a DateField should be
about the same speed as sorting by score.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: hi - testing

2006-06-21 Thread bruce
thanks to all who replied!!!



-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21, 2006 8:59 AM
To: java-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: hi - testing


Normally the way I see if I'm correctly sending something to a list is
to send the first post I really want to send, and go check an archive
of the list a little later.

-Yonik

On 6/21/06, bruce <[EMAIL PROTECTED]> wrote:
> hi..
>
> can someone please respond to this so i can see if i'm getting through..
>
> thanks
>
> -bruce


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene...

2006-06-21 Thread bruce
hi...

after reading through the docs for lucene/nutch, i'm trying to straighten
out how it all works...

if i want to crawl through a portion of a web site for the purpose of
extracting information, it appears that this would work. however, i'm not
sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm
not going to be doing any query searching, at least not initially...

i'm also trying to understand just what gets returned when i 'crawl' a
portion of a site.. do i get information back in a series of html files.. do
i get a db of information, just what do i get..??

i'm looking at being able to take a given url www.foo.com, and to be able to
crawl through a portion of the site.. need to figure out how to accomplish
this... and once i have the returned information (if it's in a file/txt
format) i'd like to be able to extract certain information based upon the
DOM of the page... if the returned information from the 'crawler' is of a
textfile format, i can easily create a parsing function to go through the
files and generate the information...

can someone provide me with insight as to whether lucene/nutch is the way to
go with this project..

thanks

-bruce
=


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene...

2006-06-21 Thread Otis Gospodnetic
Hi Bruce,
You want to use Nutch.  Nutch uses Lucene under the hood, and provides all the 
crawling stuff.
Otis

- Original Message 
From: bruce <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 12:21:28 PM
Subject: lucene...

hi...

after reading through the docs for lucene/nutch, i'm trying to straighten
out how it all works...

if i want to crawl through a portion of a web site for the purpose of
extracting information, it appears that this would work. however, i'm not
sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm
not going to be doing any query searching, at least not initially...

i'm also trying to understand just what gets returned when i 'crawl' a
portion of a site.. do i get information back in a series of html files.. do
i get a db of information, just what do i get..??

i'm looking at being able to take a given url www.foo.com, and to be able to
crawl through a portion of the site.. need to figure out how to accomplish
this... and once i have the returned information (if it's in a file/txt
format) i'd like to be able to extract certain information based upon the
DOM of the page... if the returned information from the 'crawler' is of a
textfile format, i can easily create a parsing function to go through the
files and generate the information...

can someone provide me with insight as to whether lucene/nutch is the way to
go with this project..

thanks

-bruce
=


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Index search

2006-06-21 Thread Otis Gospodnetic
Hi,

You may want to look at the Lucene highligher, if you are after "keyword in 
context".

Otis
P.S.
Please use java-user list for questions.

- Original Message 
From: "Ngo, Anh (ISS Southfield)" <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Wednesday, June 21, 2006 12:17:38 PM
Subject: lucene Index search


Hello,

I am new to lucene index api.  Please help me with the following:

* Lucene search is returning  documents.  Is there an api that I use to
search content inside a document to return only the line that match the
search query?


Sincerely,


Anh Ngo


-Original Message-
From: Chuck Williams (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 19, 2006 10:24 PM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-398) ParallelReader crashes when
trying to merge into a new index

[
http://issues.apache.org/jira/browse/LUCENE-398?page=comments#action_124
16837 ] 

Chuck Williams commented on LUCENE-398:
---

Christian,

I'm going to open a new issue on this in order to rename it, post a
revised patch, and hopefully get the attention of a committer.

Chuck


> ParallelReader crashes when trying to merge into a new index
> 
>
>  Key: LUCENE-398
>  URL: http://issues.apache.org/jira/browse/LUCENE-398
>  Project: Lucene - Java
> Type: Bug

>   Components: Index
> Versions: unspecified
>  Environment: Operating System: All
> Platform: All
> Reporter: Sebastian Kirsch
> Assignee: Lucene Developers
>  Attachments: ParallelReader.diff, ParallelReaderTest1.java,
parallelreader.diff, patch-next.diff
>
> ParallelReader causes a NullPointerException in
>
org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(Parall
elReader.java:318)
> when trying to merge into a new index.
> See test case and sample output:
> $ svn diff
> Index: src/test/org/apache/lucene/index/TestParallelReader.java
> ===
> --- src/test/org/apache/lucene/index/TestParallelReader.java
(revision 179785)
> +++ src/test/org/apache/lucene/index/TestParallelReader.java
(working copy)
> @@ -57,6 +57,13 @@
>  
>}
>   
> +  public void testMerge() throws Exception {
> +Directory dir = new RAMDirectory();
> +IndexWriter w = new IndexWriter(dir, new StandardAnalyzer(),
true);
> +w.addIndexes(new IndexReader[] { ((IndexSearcher)
> parallel).getIndexReader() });
> +w.close();
> +  }
> +
>private void queryTest(Query query) throws IOException {
>  Hits parallelHits = parallel.search(query);
>  Hits singleHits = single.search(query);
> $ ant -Dtestcase=TestParallelReader test
> Buildfile: build.xml
> [...]
> test:
> [mkdir] Created dir:
> /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/build/test
> [junit] Testsuite: org.apache.lucene.index.TestParallelReader
> [junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 1.993
sec
> [junit] Testcase:
testMerge(org.apache.lucene.index.TestParallelReader):  
> Caused an ERROR
> [junit] null
> [junit] java.lang.NullPointerException
> [junit] at
>
org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(Parall
elReader.java:318)
> [junit] at
>
org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelRea
der.java:294)
> [junit] at
>
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:
325)
> [junit] at
>
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:2
96)
> [junit] at
>
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:
270)
> [junit] at
>
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
> [junit] at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
> [junit] at
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596)
> [junit] at
>
org.apache.lucene.index.TestParallelReader.testMerge(TestParallelReader.
java:63)
> [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> [junit] at
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
> [junit] at
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
> [junit] Test org.apache.lucene.index.TestParallelReader FAILED
> BUILD FAILED
>
/Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/common-build.xml
:188:
> Tests failed!
> Total time: 16 seconds
> $

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For

Re: Modifying the stored norm type

2006-06-21 Thread Paul Elschot
On Wednesday 21 June 2006 12:13, karl wettin wrote:
> On Tue, 2006-06-20 at 18:01 +0200, Paul Elschot wrote:
> > On Tuesday 20 June 2006 12:02, Marcus Falck wrote:
> 
> > encodeNorm method of the Similarity class will encode my boost value
> > into a single byte decimal number. And I will loose a lot of
> > resolution and will get severe rounding errors.
> 
> > Are 256 different values enough for your case?
> 
> Marcus is trying to use the norms to enforce results in chronological
> order when matching a TB-sized corpus. He can't get any speed by sorting
> on a date field.
> 
> Here is an idea:
> 
> Never delete documents. Use unsafe document number as system clock. Make

Deleting documents does not change the order of the remaining ones.

> sure TermDocs always return references in reversed chronological order

There is no need to write extra code for that, the documents would be
collected oldest first, newest last.

> and write a HitCollector that does not re-order.
> 
> That should work, right? 

In case you need oldest first, yes.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modifying the stored norm type

2006-06-21 Thread Paul Elschot
On Tuesday 20 June 2006 18:42, Dan Climan wrote:
> >Paul Elschot <[EMAIL PROTECTED]> 
> >>On Tuesday 20 June 2006 12:02, Marcus Falck wrote:
> >> After a lot of debugging and some API doc reading I have come to the
> > conclusion that the static encodeNorm method of the Similarity class
> > will encode my boost value into a single byte decimal number.
> >> 
> >> And I will loose a lot of resolution and will get severe rounding
> >> errors. 
> >> 
> >> Since I need the exact float value as boost representation this isn't
> >> good enough in my case.
> 
> >An exact float value is a bit of an oxymoron.
> >How exact do you need it to be? 
> 
> >The range of values that can be encoded by the existing encodeNorm()
> >and decodeNorm() is quite big (about 10e-10 to 10e+10 iirc),
> >and since there are only 255 possible values in there (excluding 0),
> >the rounding errors can be severe indeed.
> >However, with a smaller range, the rounding errors would also be smaller.
> 
> >Are 256 different values enough for your case?
> 
> >Regards,
> >Paul Elschot
> 
> I, too, have found that 256 values were not enough. I tried changing the
> encodeNorm function to use a narrower range of values, but the 256 limit
> makes it degrade quickly if I get any results outside the expected range.
> This was true when we tried various algorithms for boosting results based on
> external factors. 
> 
> FunctionQuery(not currently in core lucene) from the SOLR project may be be
> an alternative. Can it replace all uses of the norm?
> 
> Now that omitNorms is part of the core, the impact of allowing a 2 byte (or
> even 4 byte norm) is not nearly as severe on memory. Any suggestions for how
> to create a multi-byte norm as an option, but enable the same code to
> reading existing 1 byte norms?

No more than to add your needs here: 
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing,
and to subscribe to java-dev.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene consulting/support/help

2006-06-21 Thread bruce
hi...

anybody on the list provide consulting/support for lucene/nutch...

get back to me with your contact info if you do...

thanks

-bruce


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Custom ScoreDocComparator and normalized Scores

2006-06-21 Thread Chris Hostetter

: Thanks Chris, I didn't know the "solr" package, it is not in the release
: distribution, isn't? I'm going to read about it to see if it matchs our
: needs.

Solr is a seperate (incubation) project, that builds on top of Lucene, but
the FunctionQuery classes have no dependencies outside of the Lucene core.

: Our Score = x . LuceneScore + y . SomeField + z . SomeOtherField + w .
: YetAnotherField
:
: Being x, y, z and w our coeffiecients or "scaling factors" in our
: ranking function.

The thing to watch out for is that Normalizing Lucene scores isn't very
"clean" in a mathematical sense, you can asume that a score value of X
when searching for "foo" emans the match is as good as a score of X when
searching for "bar"

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03

The FunctionQuery stuff may make more sense for you, it lets you inject
your function right into the lucene score.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Giving weight to partial matches

2006-06-21 Thread Chris Hostetter
: content field, and return results only if either title or content
: matches ALL the words searched. So searching for "miracle cure for
: cancer" might yield:
:
: (+title:miracle +title:cure +title:for +title:cancer)^5.0
: (+content:miracle +content:cure +content:for +content:cancer)

first off, a really sloppy phrase query may serve you better hear .. it
will have the added beenfit of soring docs where these words appear closer
together higher...

   (title:"miracle cure for cancer" content:"miracle cure for cancer")

: What I like to do now is to give additional weight to a result if the
: title field contains some of the words being search, for example the
: document:

just treat your current query as a required clause of a new boolean query,
and add each of the individual words as optional clauses...

   (+(title:"miracle cure for cancer" content:"miracle cure for cancer")
title:miracle title:cure title:for title:cancer)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene consulting/support/help

2006-06-21 Thread Larry Ogrodnek
http://wiki.apache.org/jakarta-lucene/Support

-Original Message-
From: bruce [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 21, 2006 2:11 PM
To: java-user@lucene.apache.org
Subject: lucene consulting/support/help

hi...

anybody on the list provide consulting/support for lucene/nutch...

get back to me with your contact info if you do...

thanks

-bruce


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML text extraction

2006-06-21 Thread Daniel Noll

Simon Courtenage wrote:
I also use htmlparser, which is rather good.  I've had to customize it, 
though, to parse strings containing
html source rather than accept urls of resources to fetch etc.  Also it 
crashes on meta tags that don't have

name attributes (something I discovered only a couple of days ago).


Actually, it already accepts strings without modifying the library:

   String htmlSource = "...";
   Parser parser = new Parser(new Lexer(htmlSource));

I will have to watch out for those meta tags though.  Time to go test it.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Phrase Frequency For Analysis

2006-06-21 Thread Nader Akhnoukh

Hi, I've looked through the archives and it looks like this question has
been asked in one form or another a few times, but without a satisfactory
solution.

I am trying to get the most frequently occurring phrases in a document and
in the index as a whole.  The goal is compare the two to get something like
Amazon's SIPs.

This is straightforward for individual words.  Get the term frequency of
each term in a doc and compare it to the frequency of that term in the
index.  A high ratio indicates that the term appears in this doc much more
than the other docs on average.

Does anyone have an idea of how to do this with phrases of say 1 to 3 words?

Just to be clear,  in this case I am only using Lucene for it's built in
frequency analysis, I'm not actually using it to search for anything that is
indexed.

Thanks,
NSA


Creating initial index using FSDirectory

2006-06-21 Thread Leandro Saad

Hi all. I'm writing a avalon component that wrapps lucene. My problem is
that I can't start the component using FSDirectory unless the index files
are already in place (segment, etc) , or I set the rewrite flag to true.

I my case, I'd like to create the index file structure only the first time I
initialize the component, then reuse the same index for each application
run.

Any help?
--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Re: Creating initial index using FSDirectory

2006-06-21 Thread Erick Erickson

From an e-mail from Kent Fitch in the thread "*Detecting index existance* "


Try IndexReader static method indexExists:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(java.lang.String)

Can you condition how you set the rewrite (create!?) flag based upon the
return from this call?

Hope this helps
Erick


Re: addIndexes() is taking infinite time ...

2006-06-21 Thread heritrix . lucene

hi Otis,
Now this time it took 10 Hr 34 Min. to merge the indexes. During merging i
noticed it was not completey using the CPU. I have 512MB RAM. and here i
found it used upto the 256 MB.

Are there some more possibilities to make it more fast ...

With Regards,



On 6/21/06, heritrix. lucene <[EMAIL PROTECTED]> wrote:


hi,
thanks for your reply.
Now i restarted my application with maxBufferedDocs=10,000.
And i am sorry to say that i was adding those indexes one by one. :-)

Anyway Can you please explain me the addIndex ? I want to know what
exactly happens while adding these..

With Regards,



On 6/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> If you can tell how many indices you've merged, you must be mering them
> one at a time, and the pre and post merge optimize() calls are costing you.
>
> Also, that maxBufferedDocs looks pretty low.  Unless you are working
> with very large documents and small heap, you should be able to bump that up
> much higher.  I've used 10,000+ in some cases.
>
> Otis
>
> - Original Message 
> From: heritrix.lucene <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, June 20, 2006 8:07:19 AM
> Subject: addIndexes() is taking infinite time ...
>
> Hi all,
> I had five different indexes:
> 1 having 15469008 documents
> 2 having 7734504 documents
> 3 having 7734504 documents
> 4 having 7734504 documents
> 5 having 7734504 documents
> Which sums to 46407024.
> The constant values are
> maxMergeFactor = 1000
> maxBufferedDocs = 1000
>
> I wrote a simple program which uses the addIndex method for adding
> indexes.
> It has been more then 32 hours adding the indexes. My logs say upto now
> it
> has finished only first two indexes. It is adding the third one.
> I want to know what exactly happens while merging the indexes?? Why this
> time grows exponentially 
> Can anybody explain this in brief.
>
> Thanks in advance..
> With Regards
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: addIndexes() is taking infinite time ...

2006-06-21 Thread Daniel Noll

heritrix.lucene wrote:

hi Otis,
Now this time it took 10 Hr 34 Min. to merge the indexes. During merging i
noticed it was not completey using the CPU. I have 512MB RAM. and here i
found it used upto the 256 MB.

Are there some more possibilities to make it more fast ...


Have you tested how fast the same process runs on different storage 
devices?  It sounds like you're bound by IO, and perhaps finding ways to 
speed up the storage (RAID striping?) would help speed things up.


Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What is a "Lazy Field"...

2006-06-21 Thread heritrix . lucene

Hi,
Can anybody please tell me what a "Lazy Field" is ???
I noticed several time this term has come in discussion...
With Regards,


Re: addIndexes() is taking infinite time ...

2006-06-21 Thread heritrix . lucene

No. I haven't tried. Today i can try it. One thing that i m thinking is that
what role does the  file system plays here. I mean is there any difference
on if i am doing indexing on FAT32 or i am on EXT3???
i'll have to find it out
Can anybody put some light on this??

With regards

On 6/22/06, Daniel Noll <[EMAIL PROTECTED]> wrote:


heritrix.lucene wrote:
> hi Otis,
> Now this time it took 10 Hr 34 Min. to merge the indexes. During merging
i
> noticed it was not completey using the CPU. I have 512MB RAM. and here i
> found it used upto the 256 MB.
>
> Are there some more possibilities to make it more fast ...

Have you tested how fast the same process runs on different storage
devices?  It sounds like you're bound by IO, and perhaps finding ways to
speed up the storage (RAID striping?) would help speed things up.

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: HTML text extraction

2006-06-21 Thread 张瑾

Please send it to me,thanks very much!

2006/6/21, Liao Xuefeng <[EMAIL PROTECTED]>:


hi,
i wrote my own html parser to do html2text and it works well. i can send
you
my code if it matches your require.

-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21, 2006 1:40 PM
To: java-user@lucene.apache.org
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: addIndexes() is taking infinite time ...

2006-06-21 Thread Mike Streeton
>From memory addIndexes() also does and optimization before hand, this
might be what is taking the time.

Mike

www.ardentia.com the home of NetSearch
-Original Message-
From: heritrix.lucene [mailto:[EMAIL PROTECTED] 
Sent: 22 June 2006 05:05
To: java-user@lucene.apache.org
Subject: Re: addIndexes() is taking infinite time ...

No. I haven't tried. Today i can try it. One thing that i m thinking is
that
what role does the  file system plays here. I mean is there any
difference
on if i am doing indexing on FAT32 or i am on EXT3???
i'll have to find it out
Can anybody put some light on this??

With regards

On 6/22/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
>
> heritrix.lucene wrote:
> > hi Otis,
> > Now this time it took 10 Hr 34 Min. to merge the indexes. During
merging
> i
> > noticed it was not completey using the CPU. I have 512MB RAM. and
here i
> > found it used upto the 256 MB.
> >
> > Are there some more possibilities to make it more fast ...
>
> Have you tested how fast the same process runs on different storage
> devices?  It sounds like you're bound by IO, and perhaps finding ways
to
> speed up the storage (RAID striping?) would help speed things up.
>
> Daniel
>
> --
> Daniel Noll
>
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280
0699
> Web: http://www.nuix.com.au/Fax: +61 2 9212
6902
>
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]