ex
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20pa
rse the OS hasn't swapped your RAM out to your SSD ;).
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
>
that significant for systems
where the index fits into RAM, or when the persistant storage device is an
SSD. And of course a different caching strategy altogether (popular document
caching) is best for dedicated doc servers.
n it. Loading a mix of cached/uncached fields is
massive win, it becomes even more massive if all required fields happen to be
cached.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-12
ient
to call document() somewhere and get all fields back.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-123
ents: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-docu
ense.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
>
becomes). Ie, when possible, that method should maybe pull
from CSFs for values.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache
and
throw exceptions if you don't consume the number of bytes you should
consume.
{quote}
I generally prefer liberal use of asserts to trip bugs like this,
instead of explicit strongly divoced code paths / classes / modes
etc., containing real if stateme
e
DataInput/Output patch.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
>
or 3.0 to overhaul the
document/field/fieldinfos APIs. I have some ideas which I started
hacking during a long flight. I'll try to summarize the ideas/goals
I'd have for such a new API and send it to java-dev.
> Column-stride fields (aka
shouldn't block the 2.4 release.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Jav
these "combinations". But I haven't wrapped my brain
around what all this will entail... it's a biggie!
{quote}
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a
reasonable API for updating sparse fields.
{quote}
I like that!
e.g., no-freqs, no-positions and
(perhaps) updateable.
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a
reasonable API for updating sparse fields.
> Column-stride fields (aka per-document Payloads)
>
>
>
tc.
> Column-stride fields (aka per-document Payloads)
>
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
> Issu
ring of position and freq optional for
a field? Then one could have an indexed field with a payload or boost but with
no freq (or positions, since freq is required for positions). Would that be
equivalent?
> Column-stride fields (aka per-document P
t;[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Friday, 14 March, 2008 7:57:24 AM
Subject: [jira] Created: (LUCENE-1231) Column-stride fields (aka per-document
Payloads)
Column-stride fields (aka per-document Payloads)
Key: LUC
Column-stride fields (aka per-document Payloads)
Key: LUCENE-1231
URL: https://issues.apache.org/jira/browse/LUCENE-1231
Project: Lucene - Java
Issue Type: New Feature
Components
> That may be a little too seamless. We want the user to have specific
> control over which fields are efficiently stored separately since they
> will know how that field will be used.
Maybe let users decide field families, like the column families in BigTable?
--
Le lundi 29 octobre 2007, Michael McCandless a écrit :
> "Michael Busch" <[EMAIL PROTECTED]> wrote:
> > Michael McCandless wrote:
> > > Michael, are you thinking that the storage would/could be non-sparse
> > > (like norms), and loaded/cached once in memory, especially for fixed
> > > size fields?
"Michael Busch" <[EMAIL PROTECTED]> wrote:
> Michael McCandless wrote:
> >
> > Michael, are you thinking that the storage would/could be non-sparse
> > (like norms), and loaded/cached once in memory, especially for fixed
> > size fields? EG a big array of ints of length maxDocID? In John's
> >
Michael McCandless wrote:
>
> Michael, are you thinking that the storage would/could be non-sparse
> (like norms), and loaded/cached once in memory, especially for fixed
> size fields? EG a big array of ints of length maxDocID? In John's
> original case, every doc has this UID int field; I think
> Michael Busch wrote:
>
> > Doug Cutting wrote:
> >
> > If this is really required, perhaps it ought to appear as an
> > attribute for stored fields, indicating that the field should be
> > stored in a separate "column store". This would permit efficient
> > enumeration of values of just that f
Doug Cutting wrote:
>
> If this is really required, perhaps it ought to appear as an attribute
> for stored fields, indicating that the field should be stored in a
> separate "column store". This would permit efficient enumeration of
> values of just that field.
>
Yes I was thinking about this
Hi Micahel:
After removing isDelete(), the index loads in 430 ms.
Thanks
-john
On 10/21/07, Michael Busch <[EMAIL PROTECTED]> wrote:
>
> John Wang wrote:
>
> >
> > Since all three methods loads docids into an int[], the lookup time is
> the
> > same for all three methods, what's
> > differen
next term,
seek to the current position in a file, etc. Profiling should show if
we've missed obvious optimizations for this case.
I was therefore thinking about adding per-document payloads to Lucene
If this is really required, perhaps it ought to appear as an attribute
for stored fields,
John Wang wrote:
>
> Since all three methods loads docids into an int[], the lookup time is the
> same for all three methods, what's
> different are the load times:
>
> 1) 16.5 seconds, 43 MB
> 2) 590 milliseconds 32.5 MB
> 3) 186 milliseconds 26MB
Good analysis! Thanks for sharing th
On 10/20/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I would think the typical use case would be you want all the
> "small" fields to be returned w/ the document and the large fields to
> be lazily loaded. I think it should be seamless to the user.
That may be a little too seamless. We want
>> I/O seeks (one for term lookup + one to open the posting list).
> >>
> >> In my app it took for a big index several minutes to fill the cache
> like
> >> that.
> >>
> >> To speed things up I did essentially what Ning suggested. Now I store
&g
On Oct 19, 2007, at 3:53 PM, Michael Busch wrote:
The next question would be how to store the per-doc payloads (PDP). If
all values have the same length (as the unique docIds), then we should
store them as efficiently as possible, like the norms. However, we
still
want to offer the flexibilit
On Oct 20, 2007, at 12:49 PM, Michael Busch wrote:
In fact, what I'm proposing is a new kind of posting list.
http://www.rectangular.com/pipermail/kinosearch/2007-July/001096.html
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Grant Ingersoll wrote:
>
> Some randomly pieced together thoughts (I may not even be fully awake
> yet :-) so feel free to tell me I'm not understanding this correctly)
>
> My first thought was how is this different from just having a binary
> field, but if I understand correctly it is to be sto
On Oct 20, 2007, at 10:51 AM, Yonik Seeley wrote:
On 10/20/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I think one of the questions that will come up from users is when
should I use addMetadata and when should I use addField? Why make
the distinction to the user? Fields have always repres
https://issues.apache.org/jira/browse/LUCENE-510 is related, then, I
presume
On Oct 20, 2007, at 11:09 AM, Yonik Seeley wrote:
On 10/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
What about switching from char
counts to byte counts for indexed (String) fields that are stored
separately?
On 10/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> What about switching from char
> counts to byte counts for indexed (String) fields that are stored
> separately?
In fact, what about switching to byte counts for all stored fields?
It should be much easier than the full-blown byte-counts for t
On 10/20/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I think one of the questions that will come up from users is when
> should I use addMetadata and when should I use addField? Why make
> the distinction to the user? Fields have always represented
> metadata, all your doing is optimizing th
e stored with a fixed
size,
which means both random access and sequential scan are optimal. Norms
are also cached in memory, and filling that cache is much faster
compared to the current FieldCache approach.
I was therefore thinking about adding per-document payloads to Lucene
(we can also c
plementation, but it still can be improved. In fact, we already have a
> mechanism for doing that: the norms. Norms are stored with a fixed size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
&g
posting list of the specific term, then it is just a
>> sequential scan to load all values. With this approach the time for
>> filling the cache went down from minutes to seconds!
>>
>> Now this approach is already much better than the current field cache
>> implementat
that: the norms. Norms are stored with a fixed size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-
and sequential scan are optimal. Norms
are also cached in memory, and filling that cache is much faster
compared to the current FieldCache approach.
I was therefore thinking about adding per-document payloads to Lucene
(we can also call it document-metadata). The API could look like this:
D
41 matches
Mail list logo