Re: problem found with DiskDocValuesFormat

2013-10-21 Thread Duke DAI
Hi guys,

Seems I have the same problem with Lucene45DocValuesFormat, no problem with
MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
DiskDocValuesFormat, no with Lucene42DocValuesFormat.

I dig into a little and found the superficial cause. In SegmentCoreReaders,
there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
building data structure repeatedly by query thread . But how about the
query thread is from thread pool, and reused for different query?
I removed docValuesLocal and built a lucene-core.jar, it works with my
multi-threads(thread pool) test cases.

Do you have any idea about this? Information is enough?


Thanks,
Duke


Best regards,
Duke
If not now, when? If not me, who?


On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI  wrote:

> Hi experts,
>
> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> field for performance reason. But due to unknown size of index(depends on
> customer), so I will use DiskDocValuesFormat, especially for some binary
> field. Then I wrote my customized Codec:
>
>   final Codec codec = new Lucene42Codec() {
>
> private final Lucene42DocValuesFormat memoryDVFormat = new
> Lucene42DocValuesFormat();
> private final DiskDocValuesFormat diskDVFormat = new
> DiskDocValuesFormat();
>
> @Override
> public DocValuesFormat getDocValuesFormatForField(String field) {
>   if
> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>   || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> return diskDVFormat;
>   } else {
> return memoryDVFormat
>   }
> }
>   };
>   iwc.setCodec(codec);
>
> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
> long type. And others are binary.
>
> Then I consume DV like below pseudo-code:
> nodeIDDocValuesSource =
> MultiDocValues.getNumericValues(searcher.getIndexReader(),
> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>
>..
>long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>
> Then I'm sure I get a wrong nodeId, which will be verified by upper logic
> and treated as data corruption.
>
>
> But if I change to memoryDVFormat for the long type field, then everything
> is OK.
>
> Also for upgrading legacy data, I keep two index format, DV or stored
> field, controlled by version. If I use stored field, everything is OK.
> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> does it relate to byte-aligned numeric compression?
> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> for it.
>
> Sorry that I have no pure Lucene test case yet. Hope someone shed some
> light on this.
>
>
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>


Re: problem found with DiskDocValuesFormat

2013-10-21 Thread Michael McCandless
Can you describe what problem you are actually hitting?

The purpose of docValuesLocal is to hold the per-Thread instance of
each doc values, and re-use it when that thread comes back again
asking for the same doc values.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI  wrote:
> Hi guys,
>
> Seems I have the same problem with Lucene45DocValuesFormat, no problem with
> MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
> DiskDocValuesFormat, no with Lucene42DocValuesFormat.
>
> I dig into a little and found the superficial cause. In SegmentCoreReaders,
> there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
> building data structure repeatedly by query thread . But how about the
> query thread is from thread pool, and reused for different query?
> I removed docValuesLocal and built a lucene-core.jar, it works with my
> multi-threads(thread pool) test cases.
>
> Do you have any idea about this? Information is enough?
>
>
> Thanks,
> Duke
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
>
> On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI  wrote:
>
>> Hi experts,
>>
>> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
>> field for performance reason. But due to unknown size of index(depends on
>> customer), so I will use DiskDocValuesFormat, especially for some binary
>> field. Then I wrote my customized Codec:
>>
>>   final Codec codec = new Lucene42Codec() {
>>
>> private final Lucene42DocValuesFormat memoryDVFormat = new
>> Lucene42DocValuesFormat();
>> private final DiskDocValuesFormat diskDVFormat = new
>> DiskDocValuesFormat();
>>
>> @Override
>> public DocValuesFormat getDocValuesFormatForField(String field) {
>>   if
>> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>>   || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
>> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>> return diskDVFormat;
>>   } else {
>> return memoryDVFormat
>>   }
>> }
>>   };
>>   iwc.setCodec(codec);
>>
>> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
>> long type. And others are binary.
>>
>> Then I consume DV like below pseudo-code:
>> nodeIDDocValuesSource =
>> MultiDocValues.getNumericValues(searcher.getIndexReader(),
>> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>>
>>..
>>long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>>
>> Then I'm sure I get a wrong nodeId, which will be verified by upper logic
>> and treated as data corruption.
>>
>>
>> But if I change to memoryDVFormat for the long type field, then everything
>> is OK.
>>
>> Also for upgrading legacy data, I keep two index format, DV or stored
>> field, controlled by version. If I use stored field, everything is OK.
>> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
>> does it relate to byte-aligned numeric compression?
>> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
>> for it.
>>
>> Sorry that I have no pure Lucene test case yet. Hope someone shed some
>> light on this.
>>
>>
>>
>>
>> Best regards,
>> Duke
>> If not now, when? If not me, who?
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: problem found with DiskDocValuesFormat

2013-10-21 Thread Duke DAI
Hi Mike,

My scenario, query thread from a ThreadPool will be used to execute query.
So thread must have to be reused to handle various queries. Now that
SegmentCoreReaders
uses ThreadLocal to hold per-thread instance, I think some private
variables must belong to the given thread(file offset? I didn't find any
other thread-dependent status), otherwise object-level instance is enough.
And ThreadPool is very common to facilitate heavy load queries, does the
ThreadLocal mechanism support thread reuse for different queries? You know,
either thread creation is heavy or ThreadLocal cleanup from outside is
complicated.
My test shows NumericDocValues will return wrong value, but sure that it's
a long value, upper logic can verify whether the value is valid or not.

As I described in earlier mail, in Lucene4.4 Lucene42DocValuesFormat(in-memory)
has no problem, DiskDocValuesFormat(in-disk) has problem. Now in
Lucene4.5, MemoryDocValuesFormat(in-memory)
has no problem, but Lucene45DocValuesFormat(in-disk) has problem.
Coincidency? My test is far more complex than I described, two ThreadPool,
one is used to handle main query, one is used to query sub collections
parallelly with proper RejectedExecutionHandler(now one sub rejected,
cancel and fail all subs).

For simple, what's the private status of per-thread NumericDocValues
instance? The private status can be re-used for different queries?


Best regards,
Duke
If not now, when? If not me, who?


On Mon, Oct 21, 2013 at 7:26 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Can you describe what problem you are actually hitting?
>
> The purpose of docValuesLocal is to hold the per-Thread instance of
> each doc values, and re-use it when that thread comes back again
> asking for the same doc values.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI  wrote:
> > Hi guys,
> >
> > Seems I have the same problem with Lucene45DocValuesFormat, no problem
> with
> > MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
> > DiskDocValuesFormat, no with Lucene42DocValuesFormat.
> >
> > I dig into a little and found the superficial cause. In
> SegmentCoreReaders,
> > there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
> > building data structure repeatedly by query thread . But how about the
> > query thread is from thread pool, and reused for different query?
> > I removed docValuesLocal and built a lucene-core.jar, it works with my
> > multi-threads(thread pool) test cases.
> >
> > Do you have any idea about this? Information is enough?
> >
> >
> > Thanks,
> > Duke
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
> >
> >
> > On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI 
> wrote:
> >
> >> Hi experts,
> >>
> >> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> >> field for performance reason. But due to unknown size of index(depends
> on
> >> customer), so I will use DiskDocValuesFormat, especially for some binary
> >> field. Then I wrote my customized Codec:
> >>
> >>   final Codec codec = new Lucene42Codec() {
> >>
> >> private final Lucene42DocValuesFormat memoryDVFormat = new
> >> Lucene42DocValuesFormat();
> >> private final DiskDocValuesFormat diskDVFormat = new
> >> DiskDocValuesFormat();
> >>
> >> @Override
> >> public DocValuesFormat getDocValuesFormatForField(String field)
> {
> >>   if
> >> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
> >>   || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
> ||
> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> >> return diskDVFormat;
> >>   } else {
> >> return memoryDVFormat
> >>   }
> >> }
> >>   };
> >>   iwc.setCodec(codec);
> >>
> >> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
> field,
> >> long type. And others are binary.
> >>
> >> Then I consume DV like below pseudo-code:
> >> nodeIDDocValuesSource =
> >> MultiDocValues.getNumericValues(searcher.getIndexReader(),
> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
> >>
> >>..
> >>long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
> >>
> >> Then I'm sure I get a wrong nodeId, which will be verified by upper
> logic
> >> and treated as data corruption.
> >>
> >>
> >> But if I change to memoryDVFormat for the long type field, then
> everything
> >> is OK.
> >>
> >> Also for upgrading legacy data, I keep two index format, DV or stored
> >> field, controlled by version. If I use stored field, everything is OK.
> >> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> >> does it relate to byte-aligned numeric compression?
> >> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> >> for it.
> >>
> >> Sorry that I have no pure Lucene test case yet. Hope someone shed some

Re: problem found with DiskDocValuesFormat

2013-10-21 Thread Michael McCandless
It's perfectly fine, and recommended, to reuse a thread across
different queries (ie, use a thread pool in your app, up above
Lucene).

The ThreadLocals used in SegmentCoreReaders should not interfere or
cause problems with that: they can easily be re-used across queries.

Maybe you can boil down the issue you are seeing into a small test case?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 21, 2013 at 10:35 AM, Duke DAI  wrote:
> Hi Mike,
>
> My scenario, query thread from a ThreadPool will be used to execute query.
> So thread must have to be reused to handle various queries. Now that
> SegmentCoreReaders
> uses ThreadLocal to hold per-thread instance, I think some private
> variables must belong to the given thread(file offset? I didn't find any
> other thread-dependent status), otherwise object-level instance is enough.
> And ThreadPool is very common to facilitate heavy load queries, does the
> ThreadLocal mechanism support thread reuse for different queries? You know,
> either thread creation is heavy or ThreadLocal cleanup from outside is
> complicated.
> My test shows NumericDocValues will return wrong value, but sure that it's
> a long value, upper logic can verify whether the value is valid or not.
>
> As I described in earlier mail, in Lucene4.4 
> Lucene42DocValuesFormat(in-memory)
> has no problem, DiskDocValuesFormat(in-disk) has problem. Now in
> Lucene4.5, MemoryDocValuesFormat(in-memory)
> has no problem, but Lucene45DocValuesFormat(in-disk) has problem.
> Coincidency? My test is far more complex than I described, two ThreadPool,
> one is used to handle main query, one is used to query sub collections
> parallelly with proper RejectedExecutionHandler(now one sub rejected,
> cancel and fail all subs).
>
> For simple, what's the private status of per-thread NumericDocValues
> instance? The private status can be re-used for different queries?
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
>
> On Mon, Oct 21, 2013 at 7:26 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you describe what problem you are actually hitting?
>>
>> The purpose of docValuesLocal is to hold the per-Thread instance of
>> each doc values, and re-use it when that thread comes back again
>> asking for the same doc values.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI  wrote:
>> > Hi guys,
>> >
>> > Seems I have the same problem with Lucene45DocValuesFormat, no problem
>> with
>> > MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
>> > DiskDocValuesFormat, no with Lucene42DocValuesFormat.
>> >
>> > I dig into a little and found the superficial cause. In
>> SegmentCoreReaders,
>> > there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
>> > building data structure repeatedly by query thread . But how about the
>> > query thread is from thread pool, and reused for different query?
>> > I removed docValuesLocal and built a lucene-core.jar, it works with my
>> > multi-threads(thread pool) test cases.
>> >
>> > Do you have any idea about this? Information is enough?
>> >
>> >
>> > Thanks,
>> > Duke
>> >
>> >
>> > Best regards,
>> > Duke
>> > If not now, when? If not me, who?
>> >
>> >
>> > On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI 
>> wrote:
>> >
>> >> Hi experts,
>> >>
>> >> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
>> >> field for performance reason. But due to unknown size of index(depends
>> on
>> >> customer), so I will use DiskDocValuesFormat, especially for some binary
>> >> field. Then I wrote my customized Codec:
>> >>
>> >>   final Codec codec = new Lucene42Codec() {
>> >>
>> >> private final Lucene42DocValuesFormat memoryDVFormat = new
>> >> Lucene42DocValuesFormat();
>> >> private final DiskDocValuesFormat diskDVFormat = new
>> >> DiskDocValuesFormat();
>> >>
>> >> @Override
>> >> public DocValuesFormat getDocValuesFormatForField(String field)
>> {
>> >>   if
>> >> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>> >>   || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
>> ||
>> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>> >> return diskDVFormat;
>> >>   } else {
>> >> return memoryDVFormat
>> >>   }
>> >> }
>> >>   };
>> >>   iwc.setCodec(codec);
>> >>
>> >> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
>> field,
>> >> long type. And others are binary.
>> >>
>> >> Then I consume DV like below pseudo-code:
>> >> nodeIDDocValuesSource =
>> >> MultiDocValues.getNumericValues(searcher.getIndexReader(),
>> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>> >>
>> >>..
>> >>long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>> >>
>> >> Then I'm sure I get a wrong nodeId, which will be verified by upper
>> logic

use MMapDirectory with tmpfs?

2013-10-21 Thread Reg
Hi there,

If I put Lucene segments on tmpfs and use MMapDirectory to access them,
would the kernel be so dumb to load the files from tmpfs to another copy of
file system cache before map it to the virtual address?  Or it just maps
tmpfs to the virtual address directly?  I tend to believe it's the later
but want to confirm with the experts.