Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Regarding real-time search and Solr, my feeling is the focus should be on
> first adding real-time search to Lucene, and then we'll figure out how to
> incorporate that into Solr later.


Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note
that Lucene, being a indexing/search library (and not a full blown search
engine), is by definition "real-time": once you add/write a document to the
index it becomes immediately searchable and if a document is logically
deleted and no longer returned in a search, though physical deletion happens
during an index optimization.

Now, the problem of adding/deleting documents in bulk, as part of a
transaction and making these documents available for search immediately
after the transaction is commited sounds more like a search engine problem
(i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
I/O expensive and thus are usually implemented bached proceeses with some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a
quasi-realtime enterprise search engine using Lucene, having a set of
multi-threaded indexers hitting a set of multiple indexes alocatted accross
different search services which powered a broker based distributed search
interface. The most recent documents provided to the indexers were always
added to the smaller in-memory (RAM) indexes which usually could absorbe the
load of a bulk "add" transaction and later would be merged into larger disk
based indexes and then flushed to make them ready to absorbe new fresh docs.
We even had further partitioning of the indexes that reflected time periods
with caps on size for them to be merged into older more archive based
indexes which were used less (yes the search engine default search was on
data no more than 1 month old, though user could open the time window by
including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search
engines are becomming more and more like relational databases with full-text
search capablities (without the benefit of full reletional algebra -- for
example joins are not possible using SOLR). Notice that "real-time" CRUD
operations and transactionality are core DB concepts adn have been studied
and developed by database communities for aquite long time. There has been
recent efforts on how to effeciently integrate Lucene into releational
databases (see Lucene JVM ORACLE integration, see
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
)

I think we should seriously look at joining efforts with open-source
Database engine projects, written in Java (see
http://java-source.net/open-source/database-engines) in order to blend IR
and ORM for once and for all.

-- Joaquin



>
>
> I've read Jason's Wiki as well.  Actually, I had to read it a number of
> times to understand bits and pieces of it.  I have to admit there is still
> some fuzziness about the whole things in my head - is "Ocean" something that
> already works, a separate project on googlecode.com?  I think so.  If so,
> and if you are working on getting it integrated into Lucene, would it make
> it less confusing to just refer to it as "real-time search", so there is no
> confusion?
>
> If this is to be initially integrated into Lucene, why are things like
> replication, crowding/field collapsing, locallucene, name service, tag
> index, etc. all mentioned there on the Wiki and bundled with description of
> how real-time search works and is to be implemented?  I suppose mentioning
> replication kind-of makes sense because the replication approach is closely
> tied to real-time search - all query nodes need to see index changes fast.
>  But Lucene itself offers no replication mechanism, so maybe the replication
> is something to figure out separately, say on the Solr level, later on "once
> we get there".  I think even just the essential real-time search requires
> substantial changes to Lucene (I remember seeing large patches in JIRA),
> which makes it hard to digest, understand, comment on, and ultimately commit
> (hence the luke warm response, I think).  Bringing other non-essential
> elements into discussion at the same time makes it more difficult to
>  process all this new stuff, at least for me.  Am I the only one who finds
> this hard?
>
> That said, it sounds like we have some discussion going (Karl...), so I
> look forward to understanding more! :)
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Yonik Seeley <[EMAIL PROTECTED]>
> > To: java-dev@lucene.apache.org
> > Sent: Thursday, September 4, 2008 10:13:32 AM
> > Subject: Re: Realtime Search for Social Networks Collaboration
> >
> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> > wrote:
> > > I also think it's got a
> > > lot of things now which m

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread mark harwood



Interesting discussion.

>>I think we should seriously look at joining efforts with open-source Database 
>>engine projects

I posted some initial dabblings here with a couple of the databases on your 
list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a 
scalable solution (which is what Jason and others need)

>>for example joins are not possible using SOLR). 

It's largely *because* Lucene doesn't do joins that it can be made to scale 
out. I've replaced two large-scale database systems this year with distributed 
Lucene solutions because this scale-out architecture provided significantly 
better performance. These were "semi-structured" systems too. Lucene's 
comparitively simplistic data model/query model is both a weakness and a 
strength in this regard.


Cheers,
Mark.


  

[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-09-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628953#action_12628953
 ] 

Michael McCandless commented on LUCENE-1131:


Hmm -- this breaks back compat (adds new abstract method to IndexReader).

Why don't we fallback to default impl, in IndexReader, of maxDoc() - numDocs()? 
 Patch is much less invasive, and, we don't break back compat?  maxDoc() is 
indeed cheap.

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1354) Provide Programmatic Access to CheckIndex

2008-09-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1354:
---

Fix Version/s: 2.4

> Provide Programmatic Access to CheckIndex
> -
>
> Key: LUCENE-1354
> URL: https://issues.apache.org/jira/browse/LUCENE-1354
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1354.patch, LUCENE-1354.patch
>
>
> Would be nice to have programmatic access to the CheckIndex tool, so that it 
> can be used in applications like Solr.  
> See SOLR-566

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Make auto fix delay configurable in CheckIndex.checkIndex?

2008-09-07 Thread Michael McCandless


OK -- I like that suggestion Andrew, so I incorporated it into new  
patch on LUCENE-1354.  Now, it's CheckIndex's static main() that does  
that sleep, and then calls fix.  This way you can call fix directly  
from your code.


Mike

Andrew Zhang wrote:




On Sat, Sep 6, 2008 at 12:01 AM, Michael McCandless <[EMAIL PROTECTED] 
> wrote:


This definitely makes sense -- there is an issue opened, with  
initial patch, to make programmatic access to CheckIndex possible,  
that may already cover this?


Hi,

Thanks for the information!  It's 
https://issues.apache.org/jira/browse/LUCENE-1354

I took a look at the initial patch, but it still sleeps 5 seconds  
before doing auto fix.


We may make it configurable, or provide a method fix() for end user?  
i.e.


IndexChecker checker = new IndexChecker();
boolean ok = checker.check();
if(!ok) {
  checker.fix(); // or do some other thing?
}


Mike


Andrew Zhang wrote:

Hi,

Currently, CheckIndex.checkIndex sleeps 5 seconds before fixing  
corrupted index. Does it make sense to make it configurable? Some  
applications just want to fix it asap.


--
Best regards,
Andrew Zhang

db4o - database for Android: www.db4o.com
http://zhanghuangzhu.blogspot.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Best regards,
Andrew Zhang

db4o - database for Android: www.db4o.com
http://zhanghuangzhu.blogspot.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1354) Provide Programmatic Access to CheckIndex

2008-09-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1354:
---

Attachment: LUCENE-1354.patch

Hi Grant, the patch looks good!  I tweaked it a bit, to pass all tests, and 
also pulled out a separate fix() method as suggested here:


http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200809.mbox/%3C4d0b24970809061944n5c617b36xc2951d74d989dc42%40mail.gmail.com%3E

If this looks good can you commit for 2.4?

> Provide Programmatic Access to CheckIndex
> -
>
> Key: LUCENE-1354
> URL: https://issues.apache.org/jira/browse/LUCENE-1354
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1354.patch, LUCENE-1354.patch
>
>
> Would be nice to have programmatic access to the CheckIndex tool, so that it 
> can be used in applications like Solr.  
> See SOLR-566

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1344) Make the Lucene jar an OSGi bundle

2008-09-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628965#action_12628965
 ] 

Michael McCandless commented on LUCENE-1344:



Thanks Nicolas.  I understand a bit more now :)

One problem: even though I was able to successfully run the above command, the 
resulting MANIFEST.MF in the Lucene core JAR 
(dist/maven/org/apache/lucene/lucene-core/2.3.0/lucene-core-2.3.0.jar) does not 
have any of your added lines (eg Export-Package) -- do you see this too?

{quote}
About the different version schemes, yep, this is yet another one to maintain. 
The version number taken into account in a OSGI environment is 
"Bundle-Version", I don't know what the header "Specification-Version" is used 
for. I tried to refactor a little bit in the build system to generate the 
version numbers, but I failed, a more bigger patch would be needed (I am 
willing to do some if needed).
{quote}
I think it's OK for now if we have to update the versions in 
META-INF/MANIFEST.MF manually as part of the release process?  (It sounds hard 
to get the build to autogen the versions).

> Make the Lucene jar an OSGi bundle
> --
>
> Key: LUCENE-1344
> URL: https://issues.apache.org/jira/browse/LUCENE-1344
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicolas Lalevée
> Fix For: 2.4
>
> Attachments: LUCENE-1344-r679133.patch, LUCENE-1344-r690675.patch, 
> LUCENE-1344-r690691.patch, MANIFEST.MF.diff
>
>
> In order to use Lucene in an OSGi environment, some additional headers are 
> needed in the manifest of the jar. As Lucene has no dependency, it is pretty 
> straight forward and it ill be easy to maintain I think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-09-07 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628967#action_12628967
 ] 

Shai Erera commented on LUCENE-1131:


What if we implement numDeletedDocs() in IndexReader, instead of defining it 
abstract?
Those that extend IndexReader (outside the scope of the attached patch) can 
then choose to override the implementation or not.

The purpose of the patch is to add an explicit method which developers can use, 
rather than understand the logic on maxDoc() - numDocs(). Not all extended 
classes implement it this way BTW. SegmentReader just calls 
deletedDocs.count(), rather then calling the two separate methods.

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1354) Provide Programmatic Access to CheckIndex

2008-09-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628966#action_12628966
 ] 

Grant Ingersoll commented on LUCENE-1354:
-

will do.



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









> Provide Programmatic Access to CheckIndex
> -
>
> Key: LUCENE-1354
> URL: https://issues.apache.org/jira/browse/LUCENE-1354
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1354.patch, LUCENE-1354.patch
>
>
> Would be nice to have programmatic access to the CheckIndex tool, so that it 
> can be used in applications like Solr.  
> See SOLR-566

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Food for Thought: Why Search Engines Choke

2008-09-07 Thread Grant Ingersoll

http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/

Some interesting ideas here on speeding up Lucene. (Thanks to Erik for  
passing me the link)


Note, the paper is comparing against 2.2.  It would be good to put up  
numbers for 2.3, and it might be interesting to look into the ideas  
presented to see if we can learn anything from it 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[EMAIL PROTECTED]>wrote:

>>for example joins are not possible using SOLR).
>
> It's largely *because* Lucene doesn't do joins that it can be made to scale
> out. I've replaced two large-scale database systems this year with
> distributed Lucene solutions because this scale-out architecture provided
> significantly better performance. These were "semi-structured" systems too.
> Lucene's comparitively simplistic data model/query model is both a weakness
> and a strength in this regard.
>

 Hey, maybe the right way to go for a truly scalable and high performance
semi-structured database is to marry HBase (Big-table like data storage)
with SOLR/Lucene.I concur with you in the sense that simplistic data models
coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

" Bigtable does not support a full relational data model; instead, it
provides clients with a simple data model that supports dynamic control over
data layout and format, and allows clients to reason about the locality
properties of the data represented in the underlying storage. Data is
indexed using row and column names that can be arbitrary strings. Bigtable
also treats data as uninterpreted strings, although clients often serialize
various forms of structured and semi-structured data into these strings.
Clients can control the locality of their data through careful choices in
their schemas. Finally, Bigtable schema parameters let clients dynamically
control whether to serve data out of memory or from disk."


Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene
implementation) the three minimal features a transactional DB should support
for Lucene integration are:

  1) The ability to define new functions (e.g. lcontains() lscore) which
would allow to bind queries to lucene and obtain document/row scores
  2) An API that would allow DML intercepts, like  Oracle's ODCI.
  3) The ability to extend and/or implement new types of "domain" indexes
that the engine's query evaluation and execution/optimization planner can
use efficiently.

Thanks Marcelo.

-- Joaquin

On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado <[EMAIL PROTECTED]>wrote:

> On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[EMAIL PROTECTED]>wrote:
>
>  >>for example joins are not possible using SOLR).
>>
>> It's largely *because* Lucene doesn't do joins that it can be made to
>> scale out. I've replaced two large-scale database systems this year with
>> distributed Lucene solutions because this scale-out architecture provided
>> significantly better performance. These were "semi-structured" systems too.
>> Lucene's comparitively simplistic data model/query model is both a weakness
>> and a strength in this regard.
>>
>
>  Hey, maybe the right way to go for a truly scalable and high performance
> semi-structured database is to marry HBase (Big-table like data storage)
> with SOLR/Lucene.I concur with you in the sense that simplistic data models
> coupled with high performance are the killer.
>
> Let me quote this from the original Bigtable paper from Google:
>
> " Bigtable does not support a full relational data model; instead, it
> provides clients with a simple data model that supports dynamic control over
> data layout and format, and allows clients to reason about the locality
> properties of the data represented in the underlying storage. Data is
> indexed using row and column names that can be arbitrary strings. Bigtable
> also treats data as uninterpreted strings, although clients often serialize
> various forms of structured and semi-structured data into these strings.
> Clients can control the locality of their data through careful choices in
> their schemas. Finally, Bigtable schema parameters let clients dynamically
> control whether to serve data out of memory or from disk."
>
>


[jira] Updated: (LUCENE-1366) Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS

2008-09-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1366:
---

Attachment: LUCENE-1366.patch

OK, this patch switches over all uses of the old names to the new ones.

I plan to commit in a day or two.

> Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS
> --
>
> Key: LUCENE-1366
> URL: https://issues.apache.org/jira/browse/LUCENE-1366
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1366.patch, LUCENE-1366.patch
>
>
> There is confusion about these current Field options and I think we
> should rename them, deprecating the old names in 2.4/2.9 and removing
> them in 3.0.  How about this:
> {code}
> TOKENIZED --> ANALYZED
> UN_TOKENIZED --> NOT_ANALYZED
> NO_NORMS --> NOT_ANALYZED_NO_NORMS
> {code}
> Should we also add ANALYZED_NO_NORMS?
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/%3C48a3076a.2679420a.1c53.a5c4%40mx.google.com%3E
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-09-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628991#action_12628991
 ] 

Michael McCandless commented on LUCENE-1131:


bq. What if we implement numDeletedDocs() in IndexReader, instead of defining 
it abstract?

Right, that's exactly what I'm thinking, with this body:

{code}
public int numDeletedDocs() {
  return maxDoc() - numDocs();
}
{code}

Then I think no classes need to override it (perf cost of calling 2 methods is 
tiny)?

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1369) Eliminate unnecessary uses of Hashtable and Vector

2008-09-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1369.


   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Thanks DM!

> Eliminate unnecessary uses of Hashtable and Vector
> --
>
> Key: LUCENE-1369
> URL: https://issues.apache.org/jira/browse/LUCENE-1369
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.3.2
>Reporter: DM Smith
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1369.patch
>
>
> Lucene uses Vector, Hashtable and Enumeration when it doesn't need to. 
> Changing to ArrayList and HashMap may provide better performance.
> There are a few places Vector shows up in the API. IMHO, List should have 
> been used for parameters and return values.
> There are a few distinct usages of these classes:
> # internal but with ArrayList or HashMap would do as well. These can simply 
> be replaced.
> # internal and synchronization is required. Either leave as is or use a 
> collections synchronization wrapper.
> # As a parameter to a method where List or Map would do as well. For contrib, 
> just replace. For core, deprecate current and add new method signature.
> # Generated by JavaCC. (All *.jj files.) Nothing to be done here.
> # As a base class. Not sure what to do here. (Only applies to SegmentInfos 
> extends Vector, but it is not used in a safe manner in all places. Perhaps, 
> implements List would be better.)
> # As a return value from a package protected method, but synchronization is 
> not used. Change return type.
> # As a return value to a final method. Change to List or Map.
> In using a Vector the following iteration pattern is frequently used.
> for (int i = 0; i < v.size(); i++) {
>   Object o = v.elementAt(i);
> }
> This is an indication that synchronization is unimportant. The list could 
> change during iteration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Make auto fix delay configurable in CheckIndex.checkIndex?

2008-09-07 Thread Andrew Zhang
On Sun, Sep 7, 2008 at 6:54 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> OK -- I like that suggestion Andrew, so I incorporated it into new patch on
> LUCENE-1354.  Now, it's CheckIndex's static main() that does that sleep, and
> then calls fix.  This way you can call fix directly from your code.


Great! I see the fix in the patch. Thanks a lot, Mike!

>
>
> Mike
>
>
> Andrew Zhang wrote:
>
>
>>
>> On Sat, Sep 6, 2008 at 12:01 AM, Michael McCandless <
>> [EMAIL PROTECTED]> wrote:
>>
>> This definitely makes sense -- there is an issue opened, with initial
>> patch, to make programmatic access to CheckIndex possible, that may already
>> cover this?
>>
>> Hi,
>>
>> Thanks for the information!  It's
>> https://issues.apache.org/jira/browse/LUCENE-1354
>>
>> I took a look at the initial patch, but it still sleeps 5 seconds before
>> doing auto fix.
>>
>> We may make it configurable, or provide a method fix() for end user? i.e.
>>
>> IndexChecker checker = new IndexChecker();
>> boolean ok = checker.check();
>> if(!ok) {
>>  checker.fix(); // or do some other thing?
>> }
>>
>>
>> Mike
>>
>>
>> Andrew Zhang wrote:
>>
>> Hi,
>>
>> Currently, CheckIndex.checkIndex sleeps 5 seconds before fixing corrupted
>> index. Does it make sense to make it configurable? Some applications just
>> want to fix it asap.
>>
>> --
>> Best regards,
>> Andrew Zhang
>>
>> db4o - database for Android: www.db4o.com
>> http://zhanghuangzhu.blogspot.com/
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>>
>> --
>> Best regards,
>> Andrew Zhang
>>
>> db4o - database for Android: www.db4o.com
>> http://zhanghuangzhu.blogspot.com/
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Best regards,
Andrew Zhang

db4o - database for Android: www.db4o.com
http://zhanghuangzhu.blogspot.com/


Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread Otis Gospodnetic
Hi,


- Original Message 
From: J. Delgado <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sunday, September 7, 2008 4:04:58 AM
Subject: Re: Realtime Search for Social Networks Collaboration


On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Regarding real-time search and Solr, my feeling is the focus should be on first 
adding real-time search to Lucene, and then we'll figure out how to incorporate 
that into Solr later.
 
Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note 
that Lucene, being a indexing/search library (and not a full blown search 
engine), is by definition "real-time": once you add/write a document to the 
index it becomes immediately searchable and if a document is logically deleted 
and no longer returned in a search, though physical deletion happens during an 
index optimization.

OG: When I think about real-time search I see it as: "Make the newly added 
document show up in search results without closing and reopening the whole 
index with IndexWriter.  In other words, minimize re-reading of the 
old/unchanged data just to be able to see the newly added data."

I believe this is similar to what IndexReader.reopen does and Jason does 
make use of it.

Otis


Now, the problem of adding/deletingdocuments in bulk, as part of a transaction 
and making these documents available for search immediately after the 
transaction is commited sounds more like a search engine problem (i.e. SOLR, 
Nutch, Ocean), specially if these transactions are known to be I/O expensive 
and thus are usually implemented bached proceeses with some kind of sync 
mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a 
quasi-realtime enterprise search engine using Lucene, having a set of 
multi-threaded indexers hitting a set of multiple indexes alocatted accross 
different search services which powered a broker based distributed search 
interface. The most recent documents provided to the indexers were always added 
to the smaller in-memory (RAM) indexes which usually could absorbe the load of 
a bulk "add" transaction and later would be merged into larger disk based 
indexes and then flushed to make them ready to absorbe new fresh docs. We even 
had further partitioning of the indexes that reflected time periods with caps 
on size for them to be merged into older more archive based indexes which were 
used less (yes the search engine default search was on data no more than 1 
month old, though user could open the time window by including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search engines 
are becomming more and more like relational databases with full-text search 
capablities (without the benefit of full reletional algebra -- for example 
joins are not possible using SOLR). Notice that "real-time" CRUD operations and 
transactionality are core DB concepts adn have been studied and developed by 
database communities for aquite long time. There has been recent efforts on how 
to effeciently integrate Lucene into releational databases (see Lucene JVM 
ORACLE integration, see 
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)

I think we should seriously look at joining efforts with open-source Database 
engine projects, written in Java (see 
http://java-source.net/open-source/database-engines) in order to blend IR and 
ORM for once and for all.

-- Joaquin 
 
 


I've read Jason's Wiki as well.  Actually, I had to read it a number of times 
to understand bits and pieces of it.  I have to admit there is still some 
fuzziness about the whole things in my head - is "Ocean" something that already 
works, a separate project on googlecode.com?  I think so.  If so, and if you 
are working on getting it integrated into Lucene, would it make it less 
confusing to just refer to it as "real-time search", so there is no confusion?

If this is to be initially integrated into Lucene, why are things like 
replication, crowding/field collapsing, locallucene, name service, tag index, 
etc. all mentioned there on the Wiki and bundled with description of how 
real-time search works and is to be implemented?  I suppose mentioning 
replication kind-of makes sense because the replication approach is closely 
tied to real-time search - all query nodes need to see index changes fast.  But 
Lucene itself offers no replication mechanism, so maybe the replication is 
something to figure out separately, say on the Solr level, later on "once we 
get there".  I think even just the essential real-time search requires 
substantial changes to Lucene (I remember seeing large patches in JIRA), which 
makes it hard to digest, understand, comment on, and ultimately commit (hence 
the luke warm response, I think).  Bringing other non-essential elements into 
discussion at the same time makes it more difficult t o
 process all this new stuff, at least for

[jira] Created: (LUCENE-1378) Remove remaining @author references

2008-09-07 Thread Otis Gospodnetic (JIRA)
Remove remaining @author references
---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.4
 Attachments: LUCENE-1378.patch

$ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi 
-e 's/ [EMAIL PROTECTED]//'


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1378) Remove remaining @author references

2008-09-07 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1378:
-

Attachment: LUCENE-1378.patch

> Remove remaining @author references
> ---
>
> Key: LUCENE-1378
> URL: https://issues.apache.org/jira/browse/LUCENE-1378
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Otis Gospodnetic
>Priority: Trivial
> Fix For: 2.4
>
> Attachments: LUCENE-1378.patch
>
>
> $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
> -pi -e 's/ [EMAIL PROTECTED]//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1378) Remove remaining @author references

2008-09-07 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629088#action_12629088
 ] 

Paul Elschot commented on LUCENE-1378:
--

The patch of 20080907 has some commented code added in SweetSpotSimilarityTest, 
probably unwanted.
Also, author lines are replaced by emty comment lines, perhaps it's better 
remove these lines completely. I didn't see any place where that could go wrong 
by changing the perl substitute command to do so, and the compiler would find 
such possible comment errors anyway.


> Remove remaining @author references
> ---
>
> Key: LUCENE-1378
> URL: https://issues.apache.org/jira/browse/LUCENE-1378
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Otis Gospodnetic
>Priority: Trivial
> Fix For: 2.4
>
> Attachments: LUCENE-1378.patch
>
>
> $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
> -pi -e 's/ [EMAIL PROTECTED]//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



jessica simpson upskirt

2008-09-07 Thread jessica ann
hots jessica simpson upskirt
http://jessica-simpson-pic.blogspot.com/

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Mumbai Masti" group.
To post to this group, send email to Mumbai-Masti@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.co.in/group/Mumbai-Masti?hl=en
-~--~~~~--~~--~--~---