Re: [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss
Hello Steven! Steven Rowe (JIRA) schrieb: > > Also, I don't see Swedish among the hyphenation data licenses - is it covered > in some other way? > I have a Swedish grammar file now. If you are interested drop me a note. It is not that hard to generate them from the TeX files. CU Thomas -

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Steven Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566447#action_12566447 ] Steven Rowe commented on LUCENE-1157: - Okay - it's available now at: [http://hudson.z

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
That is the problem, waiting for the full sync (of all of the segment files) takes quite a while... syncing a single log file is much more efficient. On Feb 6, 2008, at 9:41 PM, Andrew Zhang wrote: On Feb 7, 2008 7:22 AM, robert engels <[EMAIL PROTECTED]> wrote: That doesn't help, with la

Re: detected corrupted index / performance improvement

2008-02-06 Thread Andrew Zhang
On Feb 7, 2008 7:22 AM, robert engels <[EMAIL PROTECTED]> wrote: > That doesn't help, with lazy writing/buffering by the OS, there is no > guarantee that if the last written block is ok, that earlier blocks > in the file are > > The OS/drive is going to physically write them in the most effici

Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith
On Feb 6, 2008, at 6:42 PM, Mark Miller wrote: Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: "They had problems with their storage system telling

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I'm pretty sure that what you describe is the case, specially taking into consideration that PageRank (what drives their search results) is a per document value that is probably recomputed after some long time interval. I did see a MapReduce algorithm to compute PageRank as well. However I do think

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Andrzej Bialecki
(trimming excessive cc-s) Ning Li wrote: No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? I'm pretty sure that MapReduce/GFS/BigTabl

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
One main focus is to provide fault-tolerance in this distributed index system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exi

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Nigel Daley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566406#action_12566406 ] Nigel Daley commented on LUCENE-1157: - {quote} job/Lucene-trunk/ws/ sounds like a temp

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote: > I assume that Google also has distributed index over their > GFS/MapReduce implementation. Any idea how they achieve this? > > J.D. >

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust has a similar design. Happy to see an existing application on such a system. Do they plan to open-source it? Is the AOL project an open source project? On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > >

Re: detected corrupted index / performance improvement

2008-02-06 Thread Mark Miller
Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: "They had problems with their storage system telling them writes were on disk when they really weren't

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
Yes, but this pruning could be more efficient. On a background thread, get current segment from segments file, call the system wide sync ( e.g. System.exec("fsync"), then you can purge the transaction logs for all segments up to that one. Since it is a background operation, you are not bloc

Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith
On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought.

Re: detected corrupted index / performance improvement

2008-02-06 Thread Michael McCandless
robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). If so, a much more efficient transactional system could be developed. S

[jira] Created: (LUCENE-1167) add compatibility statement to README.txt for all contribs

2008-02-06 Thread Hoss Man (JIRA)
add compatibility statement to README.txt for all contribs -- Key: LUCENE-1167 URL: https://issues.apache.org/jira/browse/LUCENE-1167 Project: Lucene - Java Issue Type: Task C

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ian Holsman
Clay Webster wrote: There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yo

Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by PaulElschot

2008-02-06 Thread Paul Elschot
Oh well, I ticked the "remove trailing white space" box. The only real addition is at the end: >* Easier and more efficient ways to add proximity scoring? > +For example specialize Span-Near-Query for the case when all subqueries > are terms. Regards, Paul Elschot

Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen

2008-02-06 Thread Doron Cohen
On Thu, Jan 31, 2008 at 11:09 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: > Hi Otis, > > On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Doron - this looks super useful! > > Can you give an example for the lexical affinities you mention here? > > ("Juru creates

Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen

2008-02-06 Thread Doron Cohen
Hi Grant, yes I have these combinations - I just updated the wiki page with these numbers. I still have the index as described,allowing to try other ideas that may come up, or if we need more tests (on GOV2 data) to take better decisions ... Cheers, Doron On Wed, Feb 6, 2008 at 2:15 PM, Grant In

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D. On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > > There seem to be a few other players in this space too. > > Are you from Rackspace? > (http://highsc

Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's "Index Server Project Proposal" at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html 2) Solr's "Distributed Search" at http://wiki.apache.org/solr/DistributedSearch 3) Mark Bu

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566257#action_12566257 ] Doron Cohen commented on LUCENE-1157: - Nice spying work Steven :) I am not familiar w

[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566220#action_12566220 ] Thomas Peuss commented on LUCENE-1166: -- bq. Looking at http://offo.sourceforge.net/hy

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Nigel Daley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566206#action_12566206 ] Nigel Daley commented on LUCENE-1157: - I suggest you save the Changes.html as one of t

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Steven Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566198#action_12566198 ] Steven Rowe commented on LUCENE-1157: - Wait! I found it: [http://hudson.zones.apache

[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Steven Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566188#action_12566188 ] Steven Rowe commented on LUCENE-1166: - Hi Thomas, Looking at [http://offo.sourceforge

[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-997: --- Attachment: timeout.patch > Add search timeout support to Lucene > --

[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-997: --- Attachment: timeout.patch Attached patch corrects default resolution comment. > Add search timeout s

[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566175#action_12566175 ] Doron Cohen commented on LUCENE-997: Oh wrote comment that was before I decided to chan

[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Sean Timm (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566171#action_12566171 ] Sean Timm commented on LUCENE-997: -- Doron, your comment for setResolution(long) says "The

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566167#action_12566167 ] Doron Cohen commented on LUCENE-1157: - I suspected something like this but wasn't sure

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Steven Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566163#action_12566163 ] Steven Rowe commented on LUCENE-1157: - If I browse to [http://hudson.zones.apache.org

[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-997: --- Attachment: timeout.patch Sean thanks for adding the test. In the attached I tightened the check of

Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen

2008-02-06 Thread Grant Ingersoll
Hey Doron, I see you recommend that we think about making SweetSpot the default similarity. Do you have numbers showing for running that alone? Or for that matter, any of the other combinations of #3 individually? Thanks, Grant On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote: Hi Otis,

[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: hyphenation.dtd The DTD describing the hyphenation grammar XML files. > A tokenfilt

[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: de.xml A hyphenation grammar. You can download them from: http://downloads.sourcefo

[jira] Created: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis

[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: CompoundTokenFilter.patch A preliminary version of the token filter. > A tokenfilte