Re: Handling Indian regional languages

2023-01-23 Thread Kumaran Ramasubramanian
Hi Robert Muir, we will check on this. Thanks a lot for the pointers.

--
*K*umaran
*R*



On Mon, Jan 16, 2023 at 11:16 PM Robert Muir  wrote:

> On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
>  wrote:
> >
> > For handling Indian regional languages, what is the advisable approach?
> >
> > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> > content_tamil, content_hindi with specific per field Analyzer like Tamil
> > for content_tamil, HindiAnalyzer for content_hindi?
>
> You don't need to do this just to tokenize. You only need to do this
> if you want to do something fancier on top (e.g. stemming and so on).
> If you look at newer lucene versions, there are more analyzers for
> more languages.
>
> >
> > 2. Indexing all language data in the same field but handling tokenization
> > with specific unicode range(similar to THAI) in tokenizer like mentioned
> > below..
> >
> > THAI   = [\u0E00-\u0E59]
> > > TAMIL  = [\u0B80-\u0BFF]
> > > // basic word: a sequence of digits & letters (includes Thai to enable
> > > ThaiAnalyzer to function)
> > > ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+
>
> Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
> StandardTokenizer can tokenize all the Indian writing systems
> out-of-box.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Handling Indian regional languages

2023-01-09 Thread Kumaran Ramasubramanian
For handling Indian regional languages, what is the advisable approach?

1. Indexing each language data(Tamil, Hindi etc) in specific fields like
content_tamil, content_hindi with specific per field Analyzer like Tamil
for content_tamil, HindiAnalyzer for content_hindi?

2. Indexing all language data in the same field but handling tokenization
with specific unicode range(similar to THAI) in tokenizer like mentioned
below..

THAI   = [\u0E00-\u0E59]
> TAMIL  = [\u0B80-\u0BFF]
> // basic word: a sequence of digits & letters (includes Thai to enable
> ThaiAnalyzer to function)
> ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+


Note: I am using lucene 4.10.4. But open to know suggestions from latest
lucene versions as well as lucene 4..


--
*K*umaran *R*


Re: currency based search using query time calculated field match with expression

2021-09-05 Thread Kumaran Ramasubramanian
Thanks a lot for your inputs Michael. I will check about FunctionQuery.
Thanks again :-)

--
Kumaran R
Chennai, India




On Fri, Sep 3, 2021 at 9:22 PM Michael Sokolov  wrote:

> Sorry I'm not sure I understand what you're trying to do. Maybe you
> want to match a document having a computed value? This is going to be
> potentially costly, potentially requiring post-filtering of all hits
> matching for other reasons. I think there is a
> FunctionQuery/FunctionRangeQuery that might help, but I don't have
> much experience with this API, so I'm not sure. If you want useful
> suggestions, you need to be much more explicit about your use case,
> what you've tried, why it didn't work, etc.
>
> On Fri, Sep 3, 2021 at 6:08 AM Kumaran Ramasubramanian
>  wrote:
> >
> > Hi Michael, Thanks for the response.
> >
> > Based on my understanding, we can use the expressions module in lucene to
> > reorder search results using custom score calculations based on
> expression
> > using stored fields.
> >
> > But i am not sure how to do the same for lucene document hits(doc hits
> > matching 2 USD with 150 INR records). Any pointers to know about this in
> > detail?
> >
> >
> > Kumaran R
> > Chennai, India
> >
> >
> >
> > On Fri, Sep 3, 2021 at 12:08 AM Michael Sokolov 
> wrote:
> >
> > > Have you looked at the expressions module? It provides support for
> > > user-defined computation using values from the index based on a simple
> > > expression language. It might prove useful to you if the exchange rate
> > > needs to be tracked very dynamically.
> > >
> > > On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubramanian
> > >  wrote:
> > > >
> > > > I am having one use case regarding currency based search. I want to
> get
> > > any
> > > > suggestions or pointers..
> > > >
> > > > For example,
> > > > Assume,
> > > > 1USD = 75 INR
> > > > 1USD = 42190 IRR
> > > > similarly, we have support for 100 currencies as of now.
> > > >
> > > > Record1 created with PRICE 150 INR & EXCHANGE_RATE 75 for USD
> > > > Record2 created with PRICE 84380 IRR & EXCHANGE_RATE 42190 for USD
> > > >
> > > > If i search 2 ( USD ), I would like to get both Record1 & Record2 as
> > > search
> > > > results
> > > >
> > > > PRICE & EXCHANGE_RATE are indexed & stored as separate fields in the
> > > search
> > > > index
> > > > We can have 50 number of currency fields like PRICE. so we may need
> to
> > > > index additional 50 fields holding USD values.
> > > >
> > > > To avoid additional fields, Is it possible to match records in the
> search
> > > > index by applying an expression like (PRICE / EXCHANGE_RATE )
> > > >
> > > > I am not sure if this is the right use case for Lucene index. But I
> would
> > > > like to know the possibilities. Thanks in advance
> > > >
> > > >
> > > > --
> > > > Kumaran R
> > > > Chennai, India
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: currency based search using query time calculated field match with expression

2021-09-03 Thread Kumaran Ramasubramanian
Hi Michael, Thanks for the response.

Based on my understanding, we can use the expressions module in lucene to
reorder search results using custom score calculations based on expression
using stored fields.

But i am not sure how to do the same for lucene document hits(doc hits
matching 2 USD with 150 INR records). Any pointers to know about this in
detail?


Kumaran R
Chennai, India



On Fri, Sep 3, 2021 at 12:08 AM Michael Sokolov  wrote:

> Have you looked at the expressions module? It provides support for
> user-defined computation using values from the index based on a simple
> expression language. It might prove useful to you if the exchange rate
> needs to be tracked very dynamically.
>
> On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubramanian
>  wrote:
> >
> > I am having one use case regarding currency based search. I want to get
> any
> > suggestions or pointers..
> >
> > For example,
> > Assume,
> > 1USD = 75 INR
> > 1USD = 42190 IRR
> > similarly, we have support for 100 currencies as of now.
> >
> > Record1 created with PRICE 150 INR & EXCHANGE_RATE 75 for USD
> > Record2 created with PRICE 84380 IRR & EXCHANGE_RATE 42190 for USD
> >
> > If i search 2 ( USD ), I would like to get both Record1 & Record2 as
> search
> > results
> >
> > PRICE & EXCHANGE_RATE are indexed & stored as separate fields in the
> search
> > index
> > We can have 50 number of currency fields like PRICE. so we may need to
> > index additional 50 fields holding USD values.
> >
> > To avoid additional fields, Is it possible to match records in the search
> > index by applying an expression like (PRICE / EXCHANGE_RATE )
> >
> > I am not sure if this is the right use case for Lucene index. But I would
> > like to know the possibilities. Thanks in advance
> >
> >
> > --
> > Kumaran R
> > Chennai, India
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


currency based search using query time calculated field match with expression

2021-09-02 Thread Kumaran Ramasubramanian
I am having one use case regarding currency based search. I want to get any
suggestions or pointers..

For example,
Assume,
1USD = 75 INR
1USD = 42190 IRR
similarly, we have support for 100 currencies as of now.

Record1 created with PRICE 150 INR & EXCHANGE_RATE 75 for USD
Record2 created with PRICE 84380 IRR & EXCHANGE_RATE 42190 for USD

If i search 2 ( USD ), I would like to get both Record1 & Record2 as search
results

PRICE & EXCHANGE_RATE are indexed & stored as separate fields in the search
index
We can have 50 number of currency fields like PRICE. so we may need to
index additional 50 fields holding USD values.

To avoid additional fields, Is it possible to match records in the search
index by applying an expression like (PRICE / EXCHANGE_RATE )

I am not sure if this is the right use case for Lucene index. But I would
like to know the possibilities. Thanks in advance


--
Kumaran R
Chennai, India


Re: Autocompletion based on one field in index

2020-03-11 Thread Kumaran Ramasubramanian
Hi All,

Any input would be appreciated.. thanks in advance.


Kumaran R




On Tue, Mar 10, 2020, 11:44 PM Kumaran Ramasubramanian 
wrote:

>
>
> Hi Mikhail
>
> Thanks for the input. But i would like to suggest title of the available
> documents (for the query typed in search box ) from an index & when user
> clicked on the suggestion, i would like to take to exact document. Thanks
> in advance.
>
> my requirement is like this ( like google did in its help widget )
> [image: Screenshot from 2020-03-10 23-40-14.png]
>
>
> --
> *K*umaran
> *R*
>
>
>
> On Wed, Mar 4, 2020 at 2:44 AM Mikhail Khludnev  wrote:
>
>> Hi,
>>
>> org.apache.lucene.search.spell.DirectSpellChecker
>>
>>
>> On Tue, Mar 3, 2020 at 8:14 AM Kumaran Ramasubramanian <
>> kums@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > I would like to compute autocompletion based on one field's data. For
>> > example, title field of a list of webpages. Is there anyway to achieve
>> > this?
>> >
>> >
>> > Regards
>> > Kumaran R
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>


Re: Autocompletion based on one field in index

2020-03-10 Thread Kumaran Ramasubramanian
Hi Mikhail

Thanks for the input. But i would like to suggest title of the available
documents (for the query typed in search box ) from an index & when user
clicked on the suggestion, i would like to take to exact document. Thanks
in advance.

my requirement is like this ( like google did in its help widget )
[image: Screenshot from 2020-03-10 23-40-14.png]


--
*K*umaran
*R*



On Wed, Mar 4, 2020 at 2:44 AM Mikhail Khludnev  wrote:

> Hi,
>
> org.apache.lucene.search.spell.DirectSpellChecker
>
>
> On Tue, Mar 3, 2020 at 8:14 AM Kumaran Ramasubramanian  >
> wrote:
>
> > Hi All,
> >
> > I would like to compute autocompletion based on one field's data. For
> > example, title field of a list of webpages. Is there anyway to achieve
> > this?
> >
> >
> > Regards
> > Kumaran R
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Autocompletion based on one field in index

2020-03-03 Thread Kumaran Ramasubramanian
Hi All,

I would like to compute autocompletion based on one field's data. For
example, title field of a list of webpages. Is there anyway to achieve
this?


Regards
Kumaran R


Index fields configuration - suggestions

2018-03-05 Thread Kumaran Ramasubramanian
Hi all,

Regarding configurations about every fields( stored? analyzed? sort needed?
numeric ? ), elastic search designed cluster state to hold these
configurations index wise.. solr have those configurations in xml format.
If we have data center in multiple locations, is there any better way of
maintaining index fields configurations? i am looking for suggestions of
storing these configurations & synced across multiple data centers...
Please share your experience or any related pointers.. Thanks in advance.


solr
https://lucene.apache.org/solr/guide/6_6/defining-fields.html

elasticsearch
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html


--
Kumaran R


Re: Encryption at lucene index

2017-08-11 Thread Kumaran Ramasubramanian
I got it Erick.. Thank you..

-
​Kumaran R
​

On Fri, Aug 11, 2017 at 10:35 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Encrypting the _tokens_ inevitably leads to reduced capabilities BTW.
> Trivial example:
> I have these tokens in my index
> run
> runner
> running
> runs
>
> Any non-trivial encryption algorithm will not encrypt the first three
> letters "run" identically in all three so searching for run* simply
> won't work.
>
> As you can see, there's quite a bit of back-and-forth with that JIRA
> and it is pretty much been abandoned.
>
> Best,
> Erick
>
> On Thu, Aug 10, 2017 at 11:17 PM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Hi Ishan, thank you :-)
> >
> > -
> > -
> > Kumaran R
> >
> >
> >
> > On Mon, Aug 7, 2017 at 10:53 PM, Ishan Chattopadhyaya <
> > ichattopadhy...@gmail.com> wrote:
> >
> >> Harry Ochiai (Hitachi) has some index encryption solution,
> >> https://www.slideshare.net/maggon/securing-solr-search-
> data-in-the-cloud
> >> I think it is proprietary, but I'm not sure. Maybe more googling might
> help
> >> find the exact page where his solution is described.
> >>
> >> On Mon, Aug 7, 2017 at 9:59 PM, Kumaran Ramasubramanian <
> >> kums@gmail.com>
> >> wrote:
> >>
> >> > Hi Erick, i want to encrypt some fields of an document which has
> personal
> >> > identifiable information ( both indexed and stored data)... for eg:
> >> email,
> >> > mobilenumber etc.. i am able to find LUCENE-6966 alone while googling
> >> it..
> >> > any related pointers in solr or latest lucene version?
> >> >
> >> >
> >> > -
> >> > -
> >> > Kumaran R
> >> >
> >> > On Mon, Aug 7, 2017 at 9:52 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> > > No, since you haven't defined what you want to encrypt, what your
> >> > > requirements are, what you hope to get out of "encryption" etc.
> >> > >
> >> > > Put the index on an encrypting filesystem and forget about it if you
> >> > > possibly can, because anything else is a significant amount of work.
> >> > > To encrypt the searchable tokens on a per-user basis in memory is a
> >> > > _lot_ of work. It depends on your security needs.
> >> > >
> >> > > Otherwise, as I said, please ask specific questions as the topic is
> >> > > quite large, much too large to conduct a seminar through the user's
> >> > > list.
> >> > >
> >> > > Best,
> >> > > Erick
> >> > >
> >> > > On Mon, Aug 7, 2017 at 9:07 AM, Kumaran Ramasubramanian
> >> > > <kums@gmail.com> wrote:
> >> > > > Hi Erick,
> >> > > >
> >> > > > Thanks for the information. Any pointers about encryption
> options
> >> > in
> >> > > > solr?
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Kumaran R
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <
> >> > erickerick...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > >> Encryption in Solr has a bunch of ramifications. Do you care
> about
> >> > > >>
> >> > > >> - encryption at rest or in memory?
> >> > > >> - encrypting the _searchable_ tokens?
> >> > > >> - encrypting the searchable tokens per-user?
> >> > > >> - encrypting the stored data (which a filter won't do BTW).
> >> > > >>
> >> > > >> It's actually a fairly complex topic the discussion at
> LUCENE-6966
> >> > > >> outlines much of it. Please ask specific questions as you
> research
> >> the
> >> > > >> topic. One  per-user encryption package that I know of is by
> Hitachi
> >> > > >> Solutions (commercial) and it explicitly does _not_ support, for
> >> > > >> instance, wildcards (there are other limitations too). See:
> >> > > >> http://www.hitachi-solutions.com/securesearch/
> >> > > >>
> >> > > >> Most of the 

Re: Encryption at lucene index

2017-08-11 Thread Kumaran Ramasubramanian
Hi Ishan, thank you :-)

-
​-
Kumaran R

​

On Mon, Aug 7, 2017 at 10:53 PM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Harry Ochiai (Hitachi) has some index encryption solution,
> https://www.slideshare.net/maggon/securing-solr-search-data-in-the-cloud
> I think it is proprietary, but I'm not sure. Maybe more googling might help
> find the exact page where his solution is described.
>
> On Mon, Aug 7, 2017 at 9:59 PM, Kumaran Ramasubramanian <
> kums@gmail.com>
> wrote:
>
> > Hi Erick, i want to encrypt some fields of an document which has personal
> > identifiable information ( both indexed and stored data)... for eg:
> email,
> > mobilenumber etc.. i am able to find LUCENE-6966 alone while googling
> it..
> > any related pointers in solr or latest lucene version?
> >
> >
> > -
> > ​-
> > Kumaran R​
> >
> > On Mon, Aug 7, 2017 at 9:52 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > No, since you haven't defined what you want to encrypt, what your
> > > requirements are, what you hope to get out of "encryption" etc.
> > >
> > > Put the index on an encrypting filesystem and forget about it if you
> > > possibly can, because anything else is a significant amount of work.
> > > To encrypt the searchable tokens on a per-user basis in memory is a
> > > _lot_ of work. It depends on your security needs.
> > >
> > > Otherwise, as I said, please ask specific questions as the topic is
> > > quite large, much too large to conduct a seminar through the user's
> > > list.
> > >
> > > Best,
> > > Erick
> > >
> > > On Mon, Aug 7, 2017 at 9:07 AM, Kumaran Ramasubramanian
> > > <kums@gmail.com> wrote:
> > > > Hi Erick,
> > > >
> > > > Thanks for the information. Any pointers about encryption options
> > in
> > > > solr?
> > > >
> > > >
> > > > --
> > > > Kumaran R
> > > >
> > > >
> > > >
> > > > On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <
> > erickerick...@gmail.com>
> > > > wrote:
> > > >
> > > >> Encryption in Solr has a bunch of ramifications. Do you care about
> > > >>
> > > >> - encryption at rest or in memory?
> > > >> - encrypting the _searchable_ tokens?
> > > >> - encrypting the searchable tokens per-user?
> > > >> - encrypting the stored data (which a filter won't do BTW).
> > > >>
> > > >> It's actually a fairly complex topic the discussion at LUCENE-6966
> > > >> outlines much of it. Please ask specific questions as you research
> the
> > > >> topic. One  per-user encryption package that I know of is by Hitachi
> > > >> Solutions (commercial) and it explicitly does _not_ support, for
> > > >> instance, wildcards (there are other limitations too). See:
> > > >> http://www.hitachi-solutions.com/securesearch/
> > > >>
> > > >> Most of the time when people ask for encryption they soon discover
> > > >> it's much more difficult than they imagine and settle for just
> putting
> > > >> the indexes on an encrypting file system. When they move beyond that
> > > >> it gets complex and you'd be well advised to consult with Solr
> > > >> security experts.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
> > > >> <kums@gmail.com> wrote:
> > > >> > Hi All,
> > > >> >
> > > >> >
> > > >> > After looking at all below discussions, i have one doubt which may
> > be
> > > >> silly
> > > >> > or novice but i want to throw this to lucene user list.
> > > >> >
> > > >> > if we have encryption layer included in our analyzer's flow of
> > filters
> > > >> like
> > > >> > EncryptionFilter to control field-level encryption. what are the
> > > >> > consequences ? am i missing anything basic?
> > > >> >
> > > >> > Thanks in advance..
> > > >> >
> > > >> >
> > > >> > Related links:
> > > >> >
> > > >> > https://issu

Re: Encryption at lucene index

2017-08-07 Thread Kumaran Ramasubramanian
Hi Erick, i want to encrypt some fields of an document which has personal
identifiable information ( both indexed and stored data)... for eg: email,
mobilenumber etc.. i am able to find LUCENE-6966 alone while googling it..
any related pointers in solr or latest lucene version?


-
​-
Kumaran R​

On Mon, Aug 7, 2017 at 9:52 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> No, since you haven't defined what you want to encrypt, what your
> requirements are, what you hope to get out of "encryption" etc.
>
> Put the index on an encrypting filesystem and forget about it if you
> possibly can, because anything else is a significant amount of work.
> To encrypt the searchable tokens on a per-user basis in memory is a
> _lot_ of work. It depends on your security needs.
>
> Otherwise, as I said, please ask specific questions as the topic is
> quite large, much too large to conduct a seminar through the user's
> list.
>
> Best,
> Erick
>
> On Mon, Aug 7, 2017 at 9:07 AM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Hi Erick,
> >
> > Thanks for the information. Any pointers about encryption options in
> > solr?
> >
> >
> > --
> > Kumaran R
> >
> >
> >
> > On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Encryption in Solr has a bunch of ramifications. Do you care about
> >>
> >> - encryption at rest or in memory?
> >> - encrypting the _searchable_ tokens?
> >> - encrypting the searchable tokens per-user?
> >> - encrypting the stored data (which a filter won't do BTW).
> >>
> >> It's actually a fairly complex topic the discussion at LUCENE-6966
> >> outlines much of it. Please ask specific questions as you research the
> >> topic. One  per-user encryption package that I know of is by Hitachi
> >> Solutions (commercial) and it explicitly does _not_ support, for
> >> instance, wildcards (there are other limitations too). See:
> >> http://www.hitachi-solutions.com/securesearch/
> >>
> >> Most of the time when people ask for encryption they soon discover
> >> it's much more difficult than they imagine and settle for just putting
> >> the indexes on an encrypting file system. When they move beyond that
> >> it gets complex and you'd be well advised to consult with Solr
> >> security experts.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
> >> <kums@gmail.com> wrote:
> >> > Hi All,
> >> >
> >> >
> >> > After looking at all below discussions, i have one doubt which may be
> >> silly
> >> > or novice but i want to throw this to lucene user list.
> >> >
> >> > if we have encryption layer included in our analyzer's flow of filters
> >> like
> >> > EncryptionFilter to control field-level encryption. what are the
> >> > consequences ? am i missing anything basic?
> >> >
> >> > Thanks in advance..
> >> >
> >> >
> >> > Related links:
> >> >
> >> > https://issues.apache.org/jira/browse/LUCENE-2228 : AES Encrypted
> >> Directory
> >> > - in lucene 3.x
> >> >
> >> > https://issues.apache.org/jira/browse/LUCENE-6966 :  Codec for
> >> index-level
> >> > encryption - at codec level, to have control on which column / field
> have
> >> >  personal identifiable information
> >> >
> >> > https://security.stackexchange.com/questions/
> 53/is-a-lucene-search-
> >> index-effectively-a-backdoor-for-field-level-encryption
> >> >
> >> >
> >> > A decent encrypting algorithm will not produce, say, the same first
> >> portion
> >> >> for two tokens that start with the same letters. So wildcard searches
> >> won't
> >> >> work. Consider "runs", "running", "runner". A search on "run*" would
> be
> >> >> expected to match all three, but wouldn't unless the encryption were
> so
> >> >> trivial as to be useless. Similar issues arise with sorting. "More
> Like
> >> >> This" would be unreliable. There are many other features of a robust
> >> search
> >> >> engine that would be impacted, and an index with encrypted terms
> would
> >> be
> >> >> useful for only exact matches, which usually results in a poor search
> >> >> experience.
> >> >
> >> >
> >> > https://stackoverflow.com/questions/36604551/adding-
> >> encryption-to-solr-lucene-indexes
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Kumaran R
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Encryption at lucene index

2017-08-07 Thread Kumaran Ramasubramanian
Hi Erick,

Thanks for the information. Any pointers about encryption options in
solr?


--
Kumaran R



On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Encryption in Solr has a bunch of ramifications. Do you care about
>
> - encryption at rest or in memory?
> - encrypting the _searchable_ tokens?
> - encrypting the searchable tokens per-user?
> - encrypting the stored data (which a filter won't do BTW).
>
> It's actually a fairly complex topic the discussion at LUCENE-6966
> outlines much of it. Please ask specific questions as you research the
> topic. One  per-user encryption package that I know of is by Hitachi
> Solutions (commercial) and it explicitly does _not_ support, for
> instance, wildcards (there are other limitations too). See:
> http://www.hitachi-solutions.com/securesearch/
>
> Most of the time when people ask for encryption they soon discover
> it's much more difficult than they imagine and settle for just putting
> the indexes on an encrypting file system. When they move beyond that
> it gets complex and you'd be well advised to consult with Solr
> security experts.
>
> Best,
> Erick
>
> On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Hi All,
> >
> >
> > After looking at all below discussions, i have one doubt which may be
> silly
> > or novice but i want to throw this to lucene user list.
> >
> > if we have encryption layer included in our analyzer's flow of filters
> like
> > EncryptionFilter to control field-level encryption. what are the
> > consequences ? am i missing anything basic?
> >
> > Thanks in advance..
> >
> >
> > Related links:
> >
> > https://issues.apache.org/jira/browse/LUCENE-2228 : AES Encrypted
> Directory
> > - in lucene 3.x
> >
> > https://issues.apache.org/jira/browse/LUCENE-6966 :  Codec for
> index-level
> > encryption - at codec level, to have control on which column / field have
> >  personal identifiable information
> >
> > https://security.stackexchange.com/questions/53/is-a-lucene-search-
> index-effectively-a-backdoor-for-field-level-encryption
> >
> >
> > A decent encrypting algorithm will not produce, say, the same first
> portion
> >> for two tokens that start with the same letters. So wildcard searches
> won't
> >> work. Consider "runs", "running", "runner". A search on "run*" would be
> >> expected to match all three, but wouldn't unless the encryption were so
> >> trivial as to be useless. Similar issues arise with sorting. "More Like
> >> This" would be unreliable. There are many other features of a robust
> search
> >> engine that would be impacted, and an index with encrypted terms would
> be
> >> useful for only exact matches, which usually results in a poor search
> >> experience.
> >
> >
> > https://stackoverflow.com/questions/36604551/adding-
> encryption-to-solr-lucene-indexes
> >
> >
> >
> >
> >
> >
> > --
> > Kumaran R
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Encryption at lucene index

2017-08-07 Thread Kumaran Ramasubramanian
Hi All,


After looking at all below discussions, i have one doubt which may be silly
or novice but i want to throw this to lucene user list.

if we have encryption layer included in our analyzer's flow of filters like
EncryptionFilter to control field-level encryption. what are the
consequences ? am i missing anything basic?

Thanks in advance..


Related links:

https://issues.apache.org/jira/browse/LUCENE-2228 : AES Encrypted Directory
- in lucene 3.x

https://issues.apache.org/jira/browse/LUCENE-6966 :  Codec for index-level
encryption - at codec level, to have control on which column / field have
 personal identifiable information

https://security.stackexchange.com/questions/53/is-a-lucene-search-index-effectively-a-backdoor-for-field-level-encryption


A decent encrypting algorithm will not produce, say, the same first portion
> for two tokens that start with the same letters. So wildcard searches won't
> work. Consider "runs", "running", "runner". A search on "run*" would be
> expected to match all three, but wouldn't unless the encryption were so
> trivial as to be useless. Similar issues arise with sorting. "More Like
> This" would be unreliable. There are many other features of a robust search
> engine that would be impacted, and an index with encrypted terms would be
> useful for only exact matches, which usually results in a poor search
> experience.


https://stackoverflow.com/questions/36604551/adding-encryption-to-solr-lucene-indexes






--
Kumaran R


Re: Filters Vs queries - for terms more than 1024

2017-07-19 Thread Kumaran Ramasubramanian
Hi Adrien


i have tried
​
BooleanQuery with ConstantScoreQuery based suggestion from this link,
http://lucene.472066.n3.nabble.com/BooleanFilter-vs-BooleanQuery-performance-td4106920.html

If you want it fast, use
> ​​
> BooleanQuery and wrap it with ConstantScoreQuery. Then there is also no
> scoring done (in most cases, older BooleanQuery sometimes still calculated
> the score).




3. if i disable scoring process using ConstantScoreQuery, is it possible
> give more than 1024 query clauses?
>i tried this.. But still getting java.lang.OutOfMemoryError.. Why ?


java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.(Lucene41PostingsReader.java:345)
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs(Lucene41PostingsReader.java:254)
> at
> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.docs(SegmentTermsEnum.java:999)
> at org.apache.lucene.index.TermsEnum.docs(TermsEnum.java:149)
> at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:84)
> at
> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:356)
> at
> org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:164)
> at
> org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:542)
> at
> org.apache.lucene.search.FilteredQuery$FilterStrategy.filteredBulkScorer(FilteredQuery.java:504)
> at
> org.apache.lucene.search.FilteredQuery$1.bulkScorer(FilteredQuery.java:150)




If i use BooleanQuery and wrap it with ConstantScoreQuery, shall i use 1
lakh boolean clauses in booleanquery ?





-
​-
Kumaran R

​

On Wed, Jul 19, 2017 at 8:26 AM, Kumaran Ramasubramanian <kums@gmail.com
> wrote:

>
>
> Thank you Adrien :-)
>
>
>
> On 18-Jul-2017 3:21 PM, "Adrien Grand" <jpou...@gmail.com> wrote:
>
> Sorry for the confusion, I keep saying query in all cases because queries
> and filters got merged in Lucene 5.0. If you are using BooleanFilter rather
> than BooleanQuery with Lucene 4 then things should be mostly ok if you have
> many clauses. But like TermsQuery, BooleanFilter always consume all
> matching documents from all its clauses. So if you intersect it with a
> selective query, it is wasteful.
>
> Le mar. 18 juil. 2017 à 11:42, Kumaran Ramasubramanian <kums@gmail.com
> >
> a écrit :
>
> > ​Hi Adrien,
> >
> > Thanks for your input...
> >
> > 1. using boolean filters is working for even 1lakh Filter Clauses in
> > > booleanFilter... is there any consequence using filters in this case?
> > shall
> > > i proceed with this?
> >
> >
> > ​code snippet i used for this statement 1.. ​
> >
> > for (int i = 0; i < 10
> > > ​00​
> > > 00; i++)
> > > {
> > > Term term = new Term("
> > > ​key
> > > "
> > > ​+i​
> > > , "
> > > ​value
> > > "
> > > ​+i​
> > > );
> > > TermsFilter filter = new
> > > ​​
> > > TermsFilter(term);
> > > FilterClause filterClause = new
> FilterClause(filter,
> > > BooleanClause.Occur.SHOULD);
> > > boolFilter.add(filterClause);
> > > }
> >
> >
> >
> > Do you see any problem in using
> > ​
> > TermsFilter over TermsQuery?
> >
> > btw, i will test with TermsQuery and let you know.
> >
> >
> >
> > ​--
> > Kumaran ​R
> >
> >
> >
> >
> > On Tue, Jul 18, 2017 at 1:59 AM, Adrien Grand <jpou...@gmail.com> wrote:
> >
> > > Could you use TermInSetQuery (TermsQuery in older Lucene versions)? It
> is
> > > worse at skipping over matches than a BooleanQuery but keeps memory
> > > usage low and disk access sequential, on the contrary to large boolean
> > > queries.
> > >
> > > Otherwise you would probably need to rethink how you design your
> > documents
> > > in order to be able to run simpler queries.
> > >
> > > Le lun. 17 juil. 2017 à 16:28, Kumaran Ramasubramanian <
> > kums@gmail.com
> > > >
> > > a écrit :
> > >
> > > > Hi All,
> > > >
> > > > i am using lucene 4.10.4
> > > >
> > > > In lucene search, i know we have 1024 limitation in number of boolean
> > > query
> > > > clauses. i know we can increase this limit.. but i want to understand
> &

Re: Filters Vs queries - for terms more than 1024

2017-07-18 Thread Kumaran Ramasubramanian
Thank you Adrien :-)



On 18-Jul-2017 3:21 PM, "Adrien Grand" <jpou...@gmail.com> wrote:

Sorry for the confusion, I keep saying query in all cases because queries
and filters got merged in Lucene 5.0. If you are using BooleanFilter rather
than BooleanQuery with Lucene 4 then things should be mostly ok if you have
many clauses. But like TermsQuery, BooleanFilter always consume all
matching documents from all its clauses. So if you intersect it with a
selective query, it is wasteful.

Le mar. 18 juil. 2017 à 11:42, Kumaran Ramasubramanian <kums@gmail.com>
a écrit :

> ​Hi Adrien,
>
> Thanks for your input...
>
> 1. using boolean filters is working for even 1lakh Filter Clauses in
> > booleanFilter... is there any consequence using filters in this case?
> shall
> > i proceed with this?
>
>
> ​code snippet i used for this statement 1.. ​
>
> for (int i = 0; i < 10
> > ​00​
> > 00; i++)
> > {
> > Term term = new Term("
> > ​key
> > "
> > ​+i​
> > , "
> > ​value
> > "
> > ​+i​
> > );
> > TermsFilter filter = new
> > ​​
> > TermsFilter(term);
> > FilterClause filterClause = new FilterClause(filter,
> > BooleanClause.Occur.SHOULD);
> > boolFilter.add(filterClause);
> > }
>
>
>
> Do you see any problem in using
> ​
> TermsFilter over TermsQuery?
>
> btw, i will test with TermsQuery and let you know.
>
>
>
> ​--
> Kumaran ​R
>
>
>
>
> On Tue, Jul 18, 2017 at 1:59 AM, Adrien Grand <jpou...@gmail.com> wrote:
>
> > Could you use TermInSetQuery (TermsQuery in older Lucene versions)? It
is
> > worse at skipping over matches than a BooleanQuery but keeps memory
> > usage low and disk access sequential, on the contrary to large boolean
> > queries.
> >
> > Otherwise you would probably need to rethink how you design your
> documents
> > in order to be able to run simpler queries.
> >
> > Le lun. 17 juil. 2017 à 16:28, Kumaran Ramasubramanian <
> kums@gmail.com
> > >
> > a écrit :
> >
> > > Hi All,
> > >
> > > i am using lucene 4.10.4
> > >
> > > In lucene search, i know we have 1024 limitation in number of boolean
> > query
> > > clauses. i know we can increase this limit.. but i want to understand
> > > queries vs filter in lucene 4.10.4...
> > >
> > > i want to make queries larger than 1024.. Relevance is not needed for
> > > me. What are the best possible options?
> > >
> > > 1. using boolean filters is working for even 1lakh Filter Clauses in
> > > booleanFilter... is there any consequence using filters in this case?
> > shall
> > > i proceed with this?
> > >
> > > 2. if i am giving very less memory for filters, it is managed to
> > complete a
> > > search after so much GC cycles.. Why cannot we do the same for query
> > > clauses too? What is the actual technical reason for 1024 limitation
in
> > > boolean query?
> > >
> > > 3. if i disable scoring process using ConstantScoreQuery, is it
> possible
> > > give more than 1024 query clauses?
> > >i tried this.. But still getting java.lang.OutOfMemoryError..
> Why
> > ?
> > >
> > > java.lang.OutOfMemoryError: Java heap space
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$
> > BlockDocsEnum.(Lucene41PostingsReader.java:345)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs(
> > Lucene41PostingsReader.java:254)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
> > docs(SegmentTermsEnum.java:999)
> > > >
> > > > at org.apache.lucene.index.TermsEnum.docs(TermsEnum.java:149)
> > > >
> > > > at
> > > org.apache.lucene.search.TermQuery$TermWeight.scorer(
TermQuery.java:84)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.search.BooleanQuery$BooleanWeight.
> > scorer(BooleanQuery.java:356)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(
> > ConstantScoreQuery.java:164)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.
> > filteredScorer(FilteredQuery.java:542)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.search.FilteredQuery$FilterStrategy.
> > filteredBulkScorer(FilteredQuery.java:504)
> > > >
> > > > at
> > > >>
> > > org.apache.lucene.search.FilteredQuery$1.bulkScorer(
> > FilteredQuery.java:150)
> > > >
> > > >
> > >
> > >
> > >
> > > Any pointers are much appreciated... Thank you..
> > >
> > >
> > >
> > > --
> > > Kumaran R
> > >
> >
>


Re: Filters Vs queries - for terms more than 1024

2017-07-18 Thread Kumaran Ramasubramanian
​Hi Adrien,

Thanks for your input...

1. using boolean filters is working for even 1lakh Filter Clauses in
> booleanFilter... is there any consequence using filters in this case? shall
> i proceed with this?


​code snippet i used for this statement 1.. ​

for (int i = 0; i < 10
> ​00​
> 00; i++)
> {
> Term term = new Term("
> ​key
> "
> ​+i​
> , "
> ​value
> "
> ​+i​
> );
> TermsFilter filter = new
> ​​
> TermsFilter(term);
> FilterClause filterClause = new FilterClause(filter,
> BooleanClause.Occur.SHOULD);
> boolFilter.add(filterClause);
> }



Do you see any problem in using
​
TermsFilter over TermsQuery?

btw, i will test with TermsQuery and let you know.



​--
Kumaran ​R




On Tue, Jul 18, 2017 at 1:59 AM, Adrien Grand <jpou...@gmail.com> wrote:

> Could you use TermInSetQuery (TermsQuery in older Lucene versions)? It is
> worse at skipping over matches than a BooleanQuery but keeps memory
> usage low and disk access sequential, on the contrary to large boolean
> queries.
>
> Otherwise you would probably need to rethink how you design your documents
> in order to be able to run simpler queries.
>
> Le lun. 17 juil. 2017 à 16:28, Kumaran Ramasubramanian <kums@gmail.com
> >
> a écrit :
>
> > Hi All,
> >
> > i am using lucene 4.10.4
> >
> > In lucene search, i know we have 1024 limitation in number of boolean
> query
> > clauses. i know we can increase this limit.. but i want to understand
> > queries vs filter in lucene 4.10.4...
> >
> > i want to make queries larger than 1024.. Relevance is not needed for
> > me. What are the best possible options?
> >
> > 1. using boolean filters is working for even 1lakh Filter Clauses in
> > booleanFilter... is there any consequence using filters in this case?
> shall
> > i proceed with this?
> >
> > 2. if i am giving very less memory for filters, it is managed to
> complete a
> > search after so much GC cycles.. Why cannot we do the same for query
> > clauses too? What is the actual technical reason for 1024 limitation in
> > boolean query?
> >
> > 3. if i disable scoring process using ConstantScoreQuery, is it possible
> > give more than 1024 query clauses?
> >i tried this.. But still getting java.lang.OutOfMemoryError.. Why
> ?
> >
> > java.lang.OutOfMemoryError: Java heap space
> > >
> > > at
> > >>
> > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$
> BlockDocsEnum.(Lucene41PostingsReader.java:345)
> > >
> > > at
> > >>
> > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs(
> Lucene41PostingsReader.java:254)
> > >
> > > at
> > >>
> > org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
> docs(SegmentTermsEnum.java:999)
> > >
> > > at org.apache.lucene.index.TermsEnum.docs(TermsEnum.java:149)
> > >
> > > at
> > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:84)
> > >
> > > at
> > >>
> > org.apache.lucene.search.BooleanQuery$BooleanWeight.
> scorer(BooleanQuery.java:356)
> > >
> > > at
> > >>
> > org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(
> ConstantScoreQuery.java:164)
> > >
> > > at
> > >>
> > org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.
> filteredScorer(FilteredQuery.java:542)
> > >
> > > at
> > >>
> > org.apache.lucene.search.FilteredQuery$FilterStrategy.
> filteredBulkScorer(FilteredQuery.java:504)
> > >
> > > at
> > >>
> > org.apache.lucene.search.FilteredQuery$1.bulkScorer(
> FilteredQuery.java:150)
> > >
> > >
> >
> >
> >
> > Any pointers are much appreciated... Thank you..
> >
> >
> >
> > --
> > Kumaran R
> >
>


Filters Vs queries - for terms more than 1024

2017-07-17 Thread Kumaran Ramasubramanian
Hi All,

i am using lucene 4.10.4

In lucene search, i know we have 1024 limitation in number of boolean query
clauses. i know we can increase this limit.. but i want to understand
queries vs filter in lucene 4.10.4...

i want to make queries larger than 1024.. Relevance is not needed for
me. What are the best possible options?

1. using boolean filters is working for even 1lakh Filter Clauses in
booleanFilter... is there any consequence using filters in this case? shall
i proceed with this?

2. if i am giving very less memory for filters, it is managed to complete a
search after so much GC cycles.. Why cannot we do the same for query
clauses too? What is the actual technical reason for 1024 limitation in
boolean query?

3. if i disable scoring process using ConstantScoreQuery, is it possible
give more than 1024 query clauses?
   i tried this.. But still getting java.lang.OutOfMemoryError.. Why ?

java.lang.OutOfMemoryError: Java heap space
>
> at
>> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.(Lucene41PostingsReader.java:345)
>
> at
>> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs(Lucene41PostingsReader.java:254)
>
> at
>> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.docs(SegmentTermsEnum.java:999)
>
> at org.apache.lucene.index.TermsEnum.docs(TermsEnum.java:149)
>
> at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:84)
>
> at
>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:356)
>
> at
>> org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:164)
>
> at
>> org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:542)
>
> at
>> org.apache.lucene.search.FilteredQuery$FilterStrategy.filteredBulkScorer(FilteredQuery.java:504)
>
> at
>> org.apache.lucene.search.FilteredQuery$1.bulkScorer(FilteredQuery.java:150)
>
>



Any pointers are much appreciated... Thank you..



--
Kumaran R


Re: email field - analyzed and not analyzed in single field using custom analyzer

2017-06-19 Thread Kumaran Ramasubramanian
Hi Steve

Thanks for the input. How to apply WordDelimiterGraphFilter
/ WordDelimiterFilter for email tokens alone using email regex ? i want to
have only analyzed tokens for other tokens with other type of special
characters...


--
Kumaran R






On Thu, Jun 15, 2017 at 7:43 PM, Steve Rowe <sar...@gmail.com> wrote:

> Hi Kumaran,
>
> WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: <
> http://lucene.apache.org/core/6_6_0/analyzers-common/
> org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>.
>
> Here’s a test I added to TestWordDelimiterGraphFilter.java that passed
> for me:
>
> -
> public void testEmail() throws Exception {
>   final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS |
> SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | PRESERVE_ORIGINAL;
>   Analyzer a = new Analyzer() {
> @Override public TokenStreamComponents createComponents(String field) {
>   Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE,
> false);
>   return new TokenStreamComponents(tokenizer, new
> WordDelimiterGraphFilter(tokenizer, flags, null));
> }
>   };
>   assertAnalyzesTo(a, "will.sm...@yahoo.com",
>   new String[] { "will.sm...@yahoo.com", "will", "smith", "yahoo",
> "com" },
>   null, null, null,
>   new int[] { 1, 0, 1, 1, 1 },
>   null, false);
>   a.close();
> }
> -
>
> --
> Steve
> www.lucidworks.com
>
> > On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <kums@gmail.com>
> wrote:
> >
> > Hi All,
> >
> > i want to index email fields as both analyzed and not analyzed using
> custom
> > analyzer.
> >
> > for example,
> > sm...@yahoo.com
> > will.sm...@yahoo.com
> >
> > that is,  indexing sm...@yahoo.com as single token as well as analyzed
> > tokens in same email field...
> >
> >
> > My existing custom analyzer,
> >
> > public class CustomSearchAnalyzer extends StopwordAnalyzerBase
> > {
> >
> >public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
> > throws Exception
> >{
> >super(matchVersion, loadStopwordSet(stopwords, matchVersion));
> >}
> >
> >@Override
> >protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> >{
> >final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> >src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >TokenStream tok = new ClassicFilter(src);
> >tok = new LowerCaseFilter(getVersion(), tok);
> >tok = new StopFilter(getVersion(), tok, stopwords);
> >tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> >return new Analyzer.TokenStreamComponents(src, tok)
> >{
> >@Override
> >protected void setReader(final Reader reader) throws
> IOException
> >{
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >super.setReader(reader);
> >}
> >};
> >}
> > }
> >
> >
> > And so i want to achieve like,
> >
> > 1.if i search using query "sm...@yahoo.com", records with
> > will.sm...@yahoo.com should not come...
> > 2.Also i should be able to search using query "smith" in that field
> > 3.if possible, should be able to detect email values in all other fields
> > and apply the same type of tokenization
> >
> > How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
> > UAX29URLEmailTokenizer in my existing custom analyzer without using email
> > analyzer ( perfieldanalyzer )  for email field.. And so i can apply this
> > tokenizer for email terms of all fields..
> >
> >
> >
> > -
> > Kumaran R
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


email field - analyzed and not analyzed in single field using custom analyzer

2017-06-15 Thread Kumaran Ramasubramanian
Hi All,

i want to index email fields as both analyzed and not analyzed using custom
analyzer.

for example,
sm...@yahoo.com
will.sm...@yahoo.com

that is,  indexing sm...@yahoo.com as single token as well as analyzed
tokens in same email field...


My existing custom analyzer,

public class CustomSearchAnalyzer extends StopwordAnalyzerBase
{

public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
throws Exception
{
super(matchVersion, loadStopwordSet(stopwords, matchVersion));
}

@Override
protected Analyzer.TokenStreamComponents createComponents(final String
fieldName, final Reader reader)
{
final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
reader);
src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
TokenStream tok = new ClassicFilter(src);
tok = new LowerCaseFilter(getVersion(), tok);
tok = new StopFilter(getVersion(), tok, stopwords);
tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
search

return new Analyzer.TokenStreamComponents(src, tok)
{
@Override
protected void setReader(final Reader reader) throws IOException
{

src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
}


And so i want to achieve like,

1.if i search using query "sm...@yahoo.com", records with
will.sm...@yahoo.com should not come...
2.Also i should be able to search using query "smith" in that field
3.if possible, should be able to detect email values in all other fields
and apply the same type of tokenization

How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
UAX29URLEmailTokenizer in my existing custom analyzer without using email
analyzer ( perfieldanalyzer )  for email field.. And so i can apply this
tokenizer for email terms of all fields..



-
Kumaran R


Re: Recommended number of fields in one lucene index

2017-02-15 Thread Kumaran Ramasubramanian
Hi Adrien Grand,

Thanks for the response.

a binary blob that
> stores all the data so that you can perform updates.


Could you elaborate on this? Do you mean to have StoredField as mentioned
below to store all other fields which are needed only for updates? is there
any way to use updatedocuments api for this kind of updates instead of
taking out storedfields and delete-add updated documents?


Use a StoredField. You can pass in either the BytesRef, or the byte array
>> itself into the field:
>
>
>> byte[] myByteArray = new byte[10];
>
> document.add(new StoredField("bin1", myByteArray));
>
> As far as retrieving the value, you are on about the right track there
>> already. Something like:
>
>
>> Document resultDoc = searcher.doc(docno);
>
> BytesRef bin1ref = resultDoc.getBinaryValue("bin1");
>
> bytes[] bin1bytes = bin1ref.bytes;
>
>
Snippet from: http://stackoverflow.com/a/34324561/1382168


--
Kumaran R




On Thu, Feb 16, 2017 at 12:38 AM, Adrien Grand <jpou...@gmail.com> wrote:

> I think it is hard to come up with a general rule, but there is certainly a
> per-field overhead. There are some things that we need to store per field
> per segment in memory, so if you multiply the number of fields you have,
> you could run out of memory. In most cases I have seen where the index had
> so many fields, it was due to the fact that the application wanted to index
> arbitrary documents and provide search for them, which cannot scale, or to
> the fact that the index contained many unrelated documents that should have
> been put into different indices. This limit has been very useful to catch
> such design problems early instead of waiting for the production server to
> go out of memory due to the multiplication of fields.
>
> Le mer. 15 févr. 2017 à 19:44, Kumaran Ramasubramanian <kums@gmail.com
> >
> a écrit :
>
> > While searching, i use _all_ blob field to search in texts of all fields
> > data.
> >
>
> This is interesting: if all your searches go to a catch-all field, then it
> means that you do not need those thousands of fields but could just have a
> single indexed field that is used for searching, and a binary blob that
> stores all the data so that you can perform updates. So this only requires
> two fields from a Lucene perspective.
>


Recommended number of fields in one lucene index

2017-02-15 Thread Kumaran Ramasubramanian
Hi All,

Elasticsearch allows 1000 fields by default. In lucene, What are the
indexing and searching performance impacts of having 10 fields vs 3000
fields in a lucene index?

In my case,
while indexing, i index and store all fields and so i can provide update on
one field where we use to take out all stored fields ( except field to be
updated) and index everything again ( remove and add remaining fields ).

While searching, i use _all_ blob field to search in texts of all fields
data.


--
Kumaran R


Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-11 Thread Kumaran Ramasubramanian
I always use filter when i need to add more than 1024 ( for no scoring
cases ).  If filter is removed in lucene 6, what will happen to
maxbooleanclauses limit? Am i missing anything?


-
Kumaran R


On Jan 12, 2017 5:01 AM, "Trejkaz"  wrote:

On Thu, Jan 21, 2016 at 4:25 AM, Adrien Grand  wrote:
> Uwe, maybe we could promote ConstantScoreWeight to an experimental API and
> document how to build simple queries based on it?

In the future now, looking at Lucene 6.3 Javadocs, where Filter is now
gone, and it seems that ConstantScoreWeight is still @lucene.internal
(and awfully hard to understand how it can do much at all...). Did we
ever get a replacement class for this use case for Filter? I read
something about solr taking a copy of the class over in its code,
which might be what we have to do here, but I wanted to check first.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Re: Heavy usage of final in Lucene classes

2017-01-11 Thread Kumaran Ramasubramanian
Hi

I want to know the purpose of having final in analyzers.

For eg: classicanalyzer. It will be easy to add asciifolding filter over
classicanalyzer.


-
Kumaran R

On Jan 12, 2017 5:41 AM, "Michael McCandless" 
wrote:

I don't think it's about efficiency but rather about not exposing
possibly trappy APIs / usage ...

Do you have a particular class/method that you'd want to remove final from?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jan 11, 2017 at 4:15 PM, Michael Wilkowski 
wrote:
> Hi,
> I sometimes wonder what is the purpose of so heavy "final" methods and
> classes usage in Lucene. It makes it my life much harder to override
> standard classes with some custom implementation.
>
> What comes first to my mind is runtime efficiency (compiler "knows" that
> this class/method will not be overridden and may create more efficient
code
> without jump lookup tables and with method inlining). Is my assumption
> correct or there are other benefits that were behind this decision?
>
> Regards,
> Michael W.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Re: Sorting, Range Query, faceting - NumericDocValuesField Vs LongField

2016-12-23 Thread Kumaran Ramasubramanian
Thanks Erick and Mike. i am using lucene 4.10.4 directly.


i have observed better performance in LongField compared to lexicographic
sorting. i can understand, it is due to trie structure of LongField,

But one more doubt, Will uninversion process happen in IntField / LongField
too?

Thanks for the link mike. i will look into LongPoint in recent versions.

--
Kumaran R










On Fri, Dec 23, 2016 at 4:51 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Note that Erick is giving you the Solr syntax below, but if you are
> using Lucene directly, that obviously doesn't apply (though the same
> general concepts do).
>
> I would strongly recommend not using uninversion: it's an archaic and
> costly option that Lucene only offered long ago because it didn't have
> doc values, but that changed many years ago now.
>
> Also the new dimensional points (IntPoint, LongPoint) give better
> performance than the legacy postings based ("trie") numerics.
>
> See https://www.elastic.co/blog/apache-lucene-numeric-filters for some
> of the history here ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Dec 22, 2016 at 10:37 PM, Erick Erickson
> <erickerick...@gmail.com> wrote:
> > bq: Does this mean LongField/IntField just supports lexicographic
> > order in sorting?
> >
> > no on several counts.
> >
> > No numeric type (long, int, float, double or trie values) support
> > lexicographic sorting. That's the whole _point_ of having numeric
> > types in the first place. Well, and efficient range queries in the
> > Trie variants.
> >
> > docValues are an additional _attribute_ on the field so it's perfectly
> > reasonable to have a long field that's both
> > indexed="true"  and docValues="true". Or
> > indexed="true"  and docValues="false". Or
> > indexed="false" and docValues="true". Or
> > indexed="false" and docValues="false"
> >
> > Do not think of them as separate field types.
> >
> > indexed="true" is _required_ for searching. A field with
> > indexed="true" and docValues="false" also supports faceting, grouping
> > and sorting (numeric).
> >
> > A field with docValues="true" just supports faceting, grouping and
> > sorting without having to "uninvert" the field in the Java heap, the
> > data is out in OS cache. See Uwe's excellent blog here:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best,
> > Erick
> >
> > On Thu, Dec 22, 2016 at 6:57 PM, Kumaran Ramasubramanian
> > <kums@gmail.com> wrote:
> >> Thank you Adrien.
> >>
> >> "NumericDocValuesField is the one that supports sorting."
> >>
> >> Does this mean LongField/IntField just supports lexicographic order in
> >> sorting?
> >>
> >>
> >> -
> >> Kumaran R
> >>
> >>
> >>
> >> On Dec 22, 2016 11:28 PM, "Adrien Grand" <jpou...@gmail.com> wrote:
> >>
> >> Le jeu. 22 déc. 2016 à 18:50, Kumaran Ramasubramanian <
> kums@gmail.com>
> >> a écrit :
> >>
> >>> I want to provide sorting, range search and faceting in numeric fields.
> >>>
> >>> AFAIK, Purpose of different numeric field types are,
> >>>
> >>> NumericDocValuesField supports sorting and faceting
> >>> LongField/IntField supports range query and sorting
> >>>
> >>
> >> LongField/IntField only support querying, NumericDocValuesField is the
> one
> >> that supports sorting.
> >>
> >> Also note that as of 6.0 LongField and IntField have been replaced with
> >> LongPoint and IntPoint.
> >>
> >>
> >>> 1. Should i duplicate one field in above mentioned types to achieve all
> >> the
> >>> three features in numeric?
> >>>
> >>
> >> Yes. By the way it is perfectly fine to use the same field name for the
> >> point field and the doc values field.
> >>
> >>
> >>> 2. If i am ready to sacrifice faceting, is it advisable to use
> LongField
> >>> for sorting and range query?
> >>>
> >>
> >> Like said above you need doc values for sorting.
> >>
> >>
> >>> 3. During sorting, Will NumericDocValuesField( column stride storage)
> >>> perform better than LongField(trie structure)? If so , should i
> duplicate
> >>> field in both 1 and 2 cases?
> >>>
> >>
> >> Same note here.
> >>
> >> Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>


Re: Sorting, Range Query, faceting - NumericDocValuesField Vs LongField

2016-12-22 Thread Kumaran Ramasubramanian
Thank you Adrien.

"NumericDocValuesField is the one that supports sorting."

Does this mean LongField/IntField just supports lexicographic order in
sorting?


-
Kumaran R



On Dec 22, 2016 11:28 PM, "Adrien Grand" <jpou...@gmail.com> wrote:

Le jeu. 22 déc. 2016 à 18:50, Kumaran Ramasubramanian <kums@gmail.com>
a écrit :

> I want to provide sorting, range search and faceting in numeric fields.
>
> AFAIK, Purpose of different numeric field types are,
>
> NumericDocValuesField supports sorting and faceting
> LongField/IntField supports range query and sorting
>

LongField/IntField only support querying, NumericDocValuesField is the one
that supports sorting.

Also note that as of 6.0 LongField and IntField have been replaced with
LongPoint and IntPoint.


> 1. Should i duplicate one field in above mentioned types to achieve all
the
> three features in numeric?
>

Yes. By the way it is perfectly fine to use the same field name for the
point field and the doc values field.


> 2. If i am ready to sacrifice faceting, is it advisable to use LongField
> for sorting and range query?
>

Like said above you need doc values for sorting.


> 3. During sorting, Will NumericDocValuesField( column stride storage)
> perform better than LongField(trie structure)? If so , should i duplicate
> field in both 1 and 2 cases?
>

Same note here.

Adrien


Sorting, Range Query, faceting - NumericDocValuesField Vs LongField

2016-12-22 Thread Kumaran Ramasubramanian
Hi All,


I want to provide sorting, range search and faceting in numeric fields.

AFAIK, Purpose of different numeric field types are,

NumericDocValuesField supports sorting and faceting
LongField/IntField supports range query and sorting



1. Should i duplicate one field in above mentioned types to achieve all the
three features in numeric?
2. If i am ready to sacrifice faceting, is it advisable to use LongField
for sorting and range query?
3. During sorting, Will NumericDocValuesField( column stride storage)
perform better than LongField(trie structure)? If so , should i duplicate
field in both 1 and 2 cases?







-
Kumaran R


Re: how do lucene read large index files?

2016-11-29 Thread Kumaran Ramasubramanian
Thanks Mike. We are planning to move  MMapDirectory in both indexing and
searching.Regarding ulimit change and read during merging, i just tried
to know the impact of mmapdir during indexing.

-
Kumaran R


On Nov 30, 2016 4:18 AM, "Michael McCandless" <luc...@mikemccandless.com>
wrote:
>
> It's OK to use NIOFSDirectory for indexing only in that nothing will
break.
>
> But, MMapDirectory already uses normal IO for writing
> (java.io.FileOutputStream), and indexing does sometimes need to to
> read (for merging segments) though that's largely sequential reading
> so perhaps NIOFSDirectory won't be much slower.
>
> Why not use MMapDirectory for both indexing and searching?
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Nov 28, 2016 at 7:20 AM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Thanks a lot Uwe!!! Do we get any benefit on using MMapDirectory over
> > NIOFSDir during indexing? During merging? Is it ok to change to
> > MMapDirectory during search alone?
> >
> > --
> > Kumaran R
> >
> >
> > On Nov 24, 2016 11:27 PM, "Erick Erickson" <erickerick...@gmail.com>
wrote:
> >>
> >> Thanks Uwe!
> >>
> >>
> >>
> >>
> >> On Thu, Nov 24, 2016 at 9:41 AM, Uwe Schindler <u...@thetaphi.de> wrote:
> >> > Hi Kumaran, hi Erick,
> >> >
> >> >> Not really, as I don't know that code well, Uwe and company
> >> >> are the masters of that realm ;)
> >> >>
> >> >> Sorry I can't be more help there
> >> >
> >> > I can help!
> >> >
> >> >> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
> >> >> <kums@gmail.com> wrote:
> >> >> > Erick, Thanks a lot for sharing an excellent post...
> >> >> >
> >> >> > Btw, am using NIOFSDirectory, could you please elaborate on below
> >> >> mentioned
> >> >> > lines? or any further pointers?
> >> >> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price:
> > Our
> >> >> code
> >> >> >> has to do a lot of syscalls to the O/S kernel to copy blocks of
data
> >> >> >> between the disk or filesystem cache and our buffers residing in
> > Java
> >> >> heap.
> >> >> >> This needs to be done on every search request, over and over
again.
> >> >
> >> > the blog post just says it simple: You should use MMapDirectory and
> > avoid SimpleFSDir or MMapDirectory! The blog post explains why:
SimpleFSDir
> > and NIOFSDir extend BufferedIndexInput. This class uses an on-heap
buffer
> > for reading index files (which is 16 KB). For some parts of the index
(like
> > doc values), this is not ideal. E.g. if you sort against a doc values
field
> > and it needs to access a sort value (e.g. a short, integer or byte,
which
> > is very small), it will ask the buffer for the like 4 bytes. In most
cases
> > when sorting the buffer will not contain those byte, as sorting requires
> > random access over a huge file (so it is unlikely that the buffer will
> > help). Then BufferedIndexInput will seek the NIO/Simple file pointer and
> > read 16 KiB into the buffer. This requires a syscall to the OS kernel,
> > which is expensive. During sorting search results this can be millions
or
> > billions of times. In addition it will copy chunks of memory between
Java
> > heap and operating system cache over and over.
> >> >
> >> > With MMapDirectory no buffering is done, the Lucene code directly
> > accesses the file system cache and this is much more optimized.
> >> >
> >> > So for fast index access:
> >> > - avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32
bit
> > operating systems and JVMs)
> >> > - configure your operating system kernel as described in the blog
post
> > and use MMapDirectory
> >> > - tell the sysadmin to inform himself about the output of linux
> > commands free/top/... (or Windows complements).
> >> >
> >> > Uwe
> >> >
> >> >> > --
> >> >> > Kumaran R
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
> >> >> <erickerick...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> see Uwe's blog:
> >> &

Re: how do lucene read large index files?

2016-11-28 Thread Kumaran Ramasubramanian
Thanks a lot Uwe!!! Do we get any benefit on using MMapDirectory over
NIOFSDir during indexing? During merging? Is it ok to change to
MMapDirectory during search alone?

--
Kumaran R


On Nov 24, 2016 11:27 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
> Thanks Uwe!
>
>
>
>
> On Thu, Nov 24, 2016 at 9:41 AM, Uwe Schindler <u...@thetaphi.de> wrote:
> > Hi Kumaran, hi Erick,
> >
> >> Not really, as I don't know that code well, Uwe and company
> >> are the masters of that realm ;)
> >>
> >> Sorry I can't be more help there
> >
> > I can help!
> >
> >> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
> >> <kums@gmail.com> wrote:
> >> > Erick, Thanks a lot for sharing an excellent post...
> >> >
> >> > Btw, am using NIOFSDirectory, could you please elaborate on below
> >> mentioned
> >> > lines? or any further pointers?
> >> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price:
Our
> >> code
> >> >> has to do a lot of syscalls to the O/S kernel to copy blocks of data
> >> >> between the disk or filesystem cache and our buffers residing in
Java
> >> heap.
> >> >> This needs to be done on every search request, over and over again.
> >
> > the blog post just says it simple: You should use MMapDirectory and
avoid SimpleFSDir or MMapDirectory! The blog post explains why: SimpleFSDir
and NIOFSDir extend BufferedIndexInput. This class uses an on-heap buffer
for reading index files (which is 16 KB). For some parts of the index (like
doc values), this is not ideal. E.g. if you sort against a doc values field
and it needs to access a sort value (e.g. a short, integer or byte, which
is very small), it will ask the buffer for the like 4 bytes. In most cases
when sorting the buffer will not contain those byte, as sorting requires
random access over a huge file (so it is unlikely that the buffer will
help). Then BufferedIndexInput will seek the NIO/Simple file pointer and
read 16 KiB into the buffer. This requires a syscall to the OS kernel,
which is expensive. During sorting search results this can be millions or
billions of times. In addition it will copy chunks of memory between Java
heap and operating system cache over and over.
> >
> > With MMapDirectory no buffering is done, the Lucene code directly
accesses the file system cache and this is much more optimized.
> >
> > So for fast index access:
> > - avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32 bit
operating systems and JVMs)
> > - configure your operating system kernel as described in the blog post
and use MMapDirectory
> > - tell the sysadmin to inform himself about the output of linux
commands free/top/... (or Windows complements).
> >
> > Uwe
> >
> >> > --
> >> > Kumaran R
> >> >
> >> >
> >> >
> >> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
> >> <erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> see Uwe's blog:
> >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
> >> 64bit.html
> >> >>
> >> >> Short form: files are read into the OS's memory as needed. the whole
> >> >> file isn't read at once.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
> >> >> <kums@gmail.com> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > how do lucene read large index files?
> >> >> > for example, if one file (for eg: .dat file) is 4GB.
> >> >> > lucene read only part of file to RAM? or
> >> >> > is it different approach for different lucene file formats?
> >> >> >
> >> >> >
> >> >> > Related Link:
> >> >> > How do applications (and OS) handle very big files?
> >> >> > http://superuser.com/a/361201
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Kumaran R
> >> >>
> >> >>
-
> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


Re: how do lucene read large index files?

2016-11-24 Thread Kumaran Ramasubramanian
Erick, Thanks a lot for sharing an excellent post...

Btw, am using NIOFSDirectory, could you please elaborate on below mentioned
lines? or any further pointers?

NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code
> has to do a lot of syscalls to the O/S kernel to copy blocks of data
> between the disk or filesystem cache and our buffers residing in Java heap.
> This needs to be done on every search request, over and over again.




--
Kumaran R



On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> see Uwe's blog:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Short form: files are read into the OS's memory as needed. the whole
> file isn't read at once.
>
> Best,
> Erick
>
> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Hi All,
> >
> > how do lucene read large index files?
> > for example, if one file (for eg: .dat file) is 4GB.
> > lucene read only part of file to RAM? or
> > is it different approach for different lucene file formats?
> >
> >
> > Related Link:
> > How do applications (and OS) handle very big files?
> > http://superuser.com/a/361201
> >
> >
> > --
> > Kumaran R
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


how do lucene read large index files?

2016-11-23 Thread Kumaran Ramasubramanian
Hi All,

how do lucene read large index files?
for example, if one file (for eg: .dat file) is 4GB.
lucene read only part of file to RAM? or
is it different approach for different lucene file formats?


Related Link:
How do applications (and OS) handle very big files?
http://superuser.com/a/361201


--
Kumaran R


Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-10 Thread Kumaran Ramasubramanian
Hi All,
We all know that Lucene supports faceting by providing
Taxonomy(Separate index and hierarchical facets) and
SortedSetDocValuesFacetField ( flat facets and no sidecar index).

  Then why did solr and elastic search go for its own implementation ?
 ( that is, solr uses block join & elasticsearch uses aggregations ) Is
there any limitations in lucene's implementation ?


--
Kumaran R


Re: Indexing values of different datatype under same field

2016-11-04 Thread Kumaran Ramasubramanian
Hi Rajnish

It is not advisable to index values with two data types in a field.
Features like phrase query, sorting may break in those indexes.

related previous discussion :
http://www.gossamer-threads.com/lists/lucene/java-user/289159?do=post_view_flat#289159


-
Kumaran R








On Fri, Nov 4, 2016 at 8:30 AM, Rajnish kamboj 
wrote:

> Hi
>
> Is it advisable to store and index values of different datatype under same
> field as follows
>
> Field field = new LongField("*region*", 10L, Field.Store.YES);
> doc.add(field);
> Field field1 = new StringField("*region*", "NORTH", Field.Store.YES);
> doc.add(field1);
>
> Our field "region" can have numeric and string data types.
>
> Our query can have two patterns as under:
> #1 region="NORTH"
> #2 region in range 5 TO 20
>
> Though we are able to index and retrieve desired results but,
>
> *We could not find Lucene (5.3.1) documentation around this behavior.*
> Please comment on,
> 1. If we can go with this behavior and what would be the performance
> implication of indexing and querying different datatype under same field?
> 2. How the two are stored internally (i.e. different datatype under same
> field)?
> 2. If we upgrade to new Lucene version 5.4 or to major 6.0.0, then will the
> above behavior work or it may break?
>
>
> Regards
> Rajnish
>


indexing analyzed and not_analyzed values in same field

2016-10-25 Thread Kumaran Ramasubramanian
Hi All,

i have indexed 4 documents in an index where BANKNAME field is analyzed in
two documents and it is not_analyzed in another two documents. i have
mentioned search cases below where i am able to search using both analyzed
( using classic analyzer ) and not_analyzed ( using keyword analyzer )
terms. But, is it right to have index with both analyzed and not_analyzed
values in a field?




output:


BANKNAME field of these two documents is analyzed

using classic analyzer
 query : BANKNAME:"swiss bank"
total hits:2

DocId:0  DocScore:1.6096026
[stored,indexed,tokenized,
stored,indexed,tokenized,
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY]

DocId:2  DocScore:1.6096026
[stored,indexed,tokenized,
stored,indexed,tokenized,
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY]





BANKNAME field of these two documents is not analyzed

using keyword analyzer
rrsk query : BANKNAME:swiss bank
total hits:2

DocId:1  DocScore:1.287682
[stored,indexed,tokenized,
stored,indexed,tokenized,
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY]

DocId:3  DocScore:1.287682
[stored,indexed,tokenized,
stored,indexed,tokenized,
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY]






--
Kumaran R


Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-20 Thread Kumaran Ramasubramanian
Hi Adrien

Thanks a lot for the pointer.


--
Kumaran R


On Wed, Oct 19, 2016 at 8:07 PM, Adrien Grand <jpou...@gmail.com> wrote:

> You would need to override the wrapComponents method in order to wrap the
> tokenstream. See for instance Lucene's LimitTokenCountAnalyzer.
>
> Le mar. 18 oct. 2016 à 18:46, Kumaran Ramasubramanian <kums@gmail.com>
> a écrit :
>
> > Hi Adrien
> >
> > How to do this? Any Pointers?
> >
> > ​
> > > If it is fine to add the ascii folding filter at the end of the
> analysis
> >
> > chain, then you could use AnalyzerWrapper. ​
> > >
> >
> >
> >
> >
> > ​-
> > Kumaran R​
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 11, 2016 at 9:59 PM, Kumaran Ramasubramanian <
> > kums@gmail.com
> > > wrote:
> >
> > >
> > >
> > > @Ahmet, Uwe: Thanks a lot for your suggestion. Already i have written
> > > custom analyzer as you said. But just trying to avoid new component in
> my
> > > search flow.
> > >
> > > @Adrien: how to add filter using AnalyzerWrapper. Any pointers?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 11, 2016 at 8:16 PM, Uwe Schindler <u...@thetaphi.de>
> wrote:
> > >
> > >> I'd suggest to use CustomAnalyzer for defining your own analyzer. This
> > >> allows to build your own analyzer with the components (tokenizers and
> > >> filters) you like to have.
> > >>
> > >> Uwe
> > >>
> > >> -
> > >> Uwe Schindler
> > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: u...@thetaphi.de
> > >>
> > >> > -Original Message-
> > >> > From: Adrien Grand [mailto:jpou...@gmail.com]
> > >> > Sent: Tuesday, October 11, 2016 4:37 PM
> > >> > To: java-user@lucene.apache.org
> > >> > Subject: Re: How to add ASCIIFoldingFilter in ClassicAnalyzer
> > >> >
> > >> > Hi Kumaran,
> > >> >
> > >> > If it is fine to add the ascii folding filter at the end of the
> > analysis
> > >> > chain, then you could use AnalyzerWrapper. Otherwise, you need to
> > >> create a
> > >> > new analyzer that has the same analysis chain as ClassicAnalyzer,
> plus
> > >> an
> > >> > ASCIIFoldingFilter.
> > >> >
> > >> > Le mar. 11 oct. 2016 à 16:22, Kumaran Ramasubramanian
> > >> > <kums@gmail.com>
> > >> > a écrit :
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > >   Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer
> > >> without
> > >> > > writing a new custom analyzer ? should i extend
> StopwordAnalyzerBase
> > >> > again?
> > >> > >
> > >> > >
> > >> > > I know that ClassicAnalyzer is final. any special purpose for
> making
> > >> it as
> > >> > > final? Because, StandardAnalyzer was not final before ?
> > >> > >
> > >> > > public final class ClassicAnalyzer extends StopwordAnalyzerBase
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Kumaran R
> > >> > >
> > >>
> > >>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >>
> > >
> >
>


Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-18 Thread Kumaran Ramasubramanian
Hi Adrien

How to do this? Any Pointers?

​
> If it is fine to add the ascii folding filter at the end of the analysis

chain, then you could use AnalyzerWrapper. ​
>




​-
Kumaran R​









On Tue, Oct 11, 2016 at 9:59 PM, Kumaran Ramasubramanian <kums@gmail.com
> wrote:

>
>
> @Ahmet, Uwe: Thanks a lot for your suggestion. Already i have written
> custom analyzer as you said. But just trying to avoid new component in my
> search flow.
>
> @Adrien: how to add filter using AnalyzerWrapper. Any pointers?
>
>
>
>
>
>
>
>
>
> On Tue, Oct 11, 2016 at 8:16 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>
>> I'd suggest to use CustomAnalyzer for defining your own analyzer. This
>> allows to build your own analyzer with the components (tokenizers and
>> filters) you like to have.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Adrien Grand [mailto:jpou...@gmail.com]
>> > Sent: Tuesday, October 11, 2016 4:37 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: How to add ASCIIFoldingFilter in ClassicAnalyzer
>> >
>> > Hi Kumaran,
>> >
>> > If it is fine to add the ascii folding filter at the end of the analysis
>> > chain, then you could use AnalyzerWrapper. Otherwise, you need to
>> create a
>> > new analyzer that has the same analysis chain as ClassicAnalyzer, plus
>> an
>> > ASCIIFoldingFilter.
>> >
>> > Le mar. 11 oct. 2016 à 16:22, Kumaran Ramasubramanian
>> > <kums@gmail.com>
>> > a écrit :
>> >
>> > > Hi All,
>> > >
>> > >   Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer
>> without
>> > > writing a new custom analyzer ? should i extend StopwordAnalyzerBase
>> > again?
>> > >
>> > >
>> > > I know that ClassicAnalyzer is final. any special purpose for making
>> it as
>> > > final? Because, StandardAnalyzer was not final before ?
>> > >
>> > > public final class ClassicAnalyzer extends StopwordAnalyzerBase
>> > > >
>> > >
>> > >
>> > > --
>> > > Kumaran R
>> > >
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Kumaran Ramasubramanian
@Ahmet, Uwe: Thanks a lot for your suggestion. Already i have written
custom analyzer as you said. But just trying to avoid new component in my
search flow.

@Adrien: how to add filter using AnalyzerWrapper. Any pointers?









On Tue, Oct 11, 2016 at 8:16 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> I'd suggest to use CustomAnalyzer for defining your own analyzer. This
> allows to build your own analyzer with the components (tokenizers and
> filters) you like to have.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Adrien Grand [mailto:jpou...@gmail.com]
> > Sent: Tuesday, October 11, 2016 4:37 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: How to add ASCIIFoldingFilter in ClassicAnalyzer
> >
> > Hi Kumaran,
> >
> > If it is fine to add the ascii folding filter at the end of the analysis
> > chain, then you could use AnalyzerWrapper. Otherwise, you need to create
> a
> > new analyzer that has the same analysis chain as ClassicAnalyzer, plus an
> > ASCIIFoldingFilter.
> >
> > Le mar. 11 oct. 2016 à 16:22, Kumaran Ramasubramanian
> > <kums@gmail.com>
> > a écrit :
> >
> > > Hi All,
> > >
> > >   Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer
> without
> > > writing a new custom analyzer ? should i extend StopwordAnalyzerBase
> > again?
> > >
> > >
> > > I know that ClassicAnalyzer is final. any special purpose for making
> it as
> > > final? Because, StandardAnalyzer was not final before ?
> > >
> > > public final class ClassicAnalyzer extends StopwordAnalyzerBase
> > > >
> > >
> > >
> > > --
> > > Kumaran R
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Kumaran Ramasubramanian
Hi All,

  Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer without
writing a new custom analyzer ? should i extend StopwordAnalyzerBase again?


I know that ClassicAnalyzer is final. any special purpose for making it as
final? Because, StandardAnalyzer was not final before ?

public final class ClassicAnalyzer extends StopwordAnalyzerBase
>


--
Kumaran R


Re: Clarification on LUCENE 4795 discussions ( Add FacetsCollector based on SortedSetDocValues )

2016-09-26 Thread Kumaran Ramasubramanian
Thank you shai. Will check them and let you know for clarifications.

-
Kumaran R

On Sep 27, 2016 10:05 AM, "Shai Erera" <ser...@gmail.com> wrote:
>
> Hey,
>
> Here's a blog I wrote a couple years ago about using facet associations:
> http://shaierera.blogspot.com/2013/01/facet-associations.html. Note that
> the examples in the blog were written against a very old Lucene version
> (4.7 maybe). We have a couple of demo files that are maintained with the
> code changes here
>
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=tree;f=lucene/demo/src/java/org/apache/lucene/demo/facet;h=41085e3aaa1d4d0697a5ef5d9853a093c1600ca6;hb=HEAD
.
> Check them out, especially this one:
>
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=lucene/demo/src/java/org/apache/lucene/demo/facet/AssociationsFacetsExample.java;h=3e2737d0c8f02d12e4fdb76f97891c8593ef5fbc;hb=HEAD
>
> Hope this helps!
>
> Shai
>
> On Tue, Sep 27, 2016 at 7:20 AM Kumaran Ramasubramanian <
kums@gmail.com>
> wrote:
>
> > Hi mike,
> >
> > Thanks for the clarification. Any example about difference in using
flat vs
> > hierarchical facets? Any demo or sample page?
> >
> > In a previous thread yesterday ( Faceting: Taxonomy index Vs
> > SortedSetDocValues ), there is a point like
> >
> > "tried to achieve multilevel (hierarchical) categorization using
> > SortedSetDocValues and got it simply by changing the query  and opening
the
> > IndexReader for each level of query using
SortedSetDocValuesReaderState. "
> >
> > Is it possible easily?
> >
> > -
> > Kumaran R
> >
> > On Sep 27, 2016 9:38 AM, "Michael McCandless" <luc...@mikemccandless.com
>
> > wrote:
> > >
> > > Weighted facets is the ability to associate a float value with each
> > > facet label you index, and at search time to aggregate those floats.
> > > See e.g. FloatAssociationFacetField.
> > >
> > > "other features" refers to hierarchical facets, which
> > > SortedSetDocValuesFacetField does not support (just flat facets)
> > > though this is possible to fix, I think (patches welcome!).
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > On Mon, Sep 26, 2016 at 5:24 PM, Kumaran Ramasubramanian
> > > <kums@gmail.com> wrote:
> > > >
> > > >
> > > > Hi All,
> > > >
> > > > i want to know the list of features which can be used by
applications
> > > > using facet module of lucene.
> > > >
> > > >
> >
https://issues.apache.org/jira/browse/LUCENE-4795?focusedCommentId=13599687
> > > >
> > > > I ask because it seems that the only thing that we get from this
> > SortedSet
> > > >> approach is not having to maintain a sidecar index (which for some
> > reason
> > > >> freaks everybody), and we even lose performance. Plus, I don't see
how
> > we
> > > >> can support other facet features with it.
> > > >
> > > >
> > > > on the other hand SortedSet doesn't have these problems. maybe it
> > doesnt
> > > >> support weighted facets or other features, but its a nice option. I
> > > >> personally don't think its the end of the world if Mike's patch
doesnt
> > > >> support all the features of the faceting module initially or even
> > ever.
> > > >
> > > >
> > > >
> > > >
> > > > what
> > > > is meant by
> > > > weighted facets
> > > > ? what are
> > > > othe
> > > > r
> > > >  facets
> > > > features ?
> > > >
> > > >
> > > > --
> > > > Kumaran R
> > > >
> >


Clarification on LUCENE 4795 discussions ( Add FacetsCollector based on SortedSetDocValues )

2016-09-26 Thread Kumaran Ramasubramanian
​

​Hi All,​

​i want to know the list of features ​which can be used by applications
using facet module of lucene.

https://issues.apache.org/jira/browse/LUCENE-4795?focusedCommentId=13599687

I ask because it seems that the only thing that we get from this SortedSet
> approach is not having to maintain a sidecar index (which for some reason
> freaks everybody), and we even lose performance. Plus, I don't see how we
> can support other facet features with it.


on the other hand SortedSet doesn't have these problems. maybe it doesnt
> support weighted facets or other features, but its a nice option. I
> personally don't think its the end of the world if Mike's patch doesnt
> support all the features of the faceting module initially or even ever.




what
​ ​is meant by
weighted facets
​? what are ​
othe
​r
 facets​
features ?


​--
Kumaran R
​


Re: parent-child relationship in lucene - to avoid reindexing if parent information changes

2016-08-30 Thread Kumaran Ramasubramanian
Hi Ralph

Thank you for the response.. yes, It is one of the work-around..
While searching, what you have suggested is costly and also it takes more
time if number of groups is more (we can use query time join?? )..

Also, my second problem remains same.( adding a member to a group ).
Because, i want to make all existing messages in a group as visible to any
new member... so i need to reindex all messages with that newly added
member id..

Is index time join (for second case ) or query time join ( for first case )
can be best fit?

--
Kumaran R






On Tue, Aug 30, 2016 at 1:55 PM, Ralph Soika <ralph.so...@imixs.com> wrote:

> Hi,
>
> I think this is more a problem of the data model.
> You should not link a message to a group by the group name. Instead use a
> GroupID (which is unique) to refer to the group. The GroupID is a
> 'non-analyzed' and 'not-stored' field in your lucene document.
>
> Then, when you want to search for all messages assigned to groups the user
> is member of, first search for the groups where the user is member to get
> the id, and next search all messages with that ids.
>
> So there should no need to reindex.
>
> ===
> Ralph
>
>
>
> On 30.08.2016 08:38, Kumaran Ramasubramanian wrote:
>
>> Hi All,
>>
>>
>> Am building a sample application, where a group of members can interact as
>> a chat room. i am trying to enable search for message level search...
>>
>>
>> If i denormalize group_name & group_members in every lucene document, then
>> below cases will reindex more number of lucene documents...
>>
>> 1. editing group name
>> 2. adding / deleting a member
>>
>>
>> So am trying to index group_name, group_members(member ids as csv)  as
>> parent and every text message & message_id as child.
>> By using parent & child, i am trying to solve 1 * m cases...
>> If there are 1 lakh messages under one parent, how to delete a member id
>> or
>> edit a group name without reindexing of its children??
>>
>>
>> is it possible to avoid reindexing? Which lucene class is best fit for
>> this?
>>
>> Related Article:
>> http://blog.mikemccandless.com/2012/01/searching-relational-
>> content-with.html
>>
>>
>>
>>
>> --
>> Kumaran R
>>
>>
>
> --
> *Imixs*...extends the way people work together
> We are an open source company, read more at: www.imixs.org <
> http://www.imixs.org>
> 
> Imixs Software Solutions GmbH
> Agnes-Pockels-Bogen 1, 80992 München
> *Web:* www.imixs.com <http://www.imixs.com>
> *Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika
>
>


parent-child relationship in lucene - to avoid reindexing if parent information changes

2016-08-30 Thread Kumaran Ramasubramanian
Hi All,


Am building a sample application, where a group of members can interact as
a chat room. i am trying to enable search for message level search...


If i denormalize group_name & group_members in every lucene document, then
below cases will reindex more number of lucene documents...

1. editing group name
2. adding / deleting a member


So am trying to index group_name, group_members(member ids as csv)  as
parent and every text message & message_id as child.
By using parent & child, i am trying to solve 1 * m cases...
If there are 1 lakh messages under one parent, how to delete a member id or
edit a group name without reindexing of its children??


is it possible to avoid reindexing? Which lucene class is best fit for
this?

Related Article:
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html




--
Kumaran R


In lucene faceting, advantages of taxonomy index over ​SortedSetDocValuesFacetField

2016-08-06 Thread Kumaran Ramasubramanian
Hi All,


As per documentation of SortedSetDocValuesFacetCounts,


Compute facets counts from previously indexed
> ​​
> SortedSetDocValuesFacetField, without require a separate taxonomy index.
> Faceting is a bit slower (~25%), and there is added cost on every
> IndexReader open to create a new SortedSetDocValuesReaderState.
> Furthermore, this
> ​​
> does not support hierarchical facets; only flat (dimension + label)
> facets, but it uses quite a bit less RAM to do so.
>


what is meant by "
​
does not support hierarchical facets; only flat" here? what functionality
we will miss because of this?


other than faster faceting...Is there any benefit of using taxonomy index
over docvalues field for faceting?



--
Kumaran R


Re: How to store and retrieve latest utf8mb4 emoji / smiley characters in lucene index

2016-08-01 Thread Kumaran Ramasubramanian
Hi All,

i tried to index some emoji / smiley characters in lucene index. In
fetched search results, every smiley character is returned as 4 "?"
characters as shown in attached image ( 4 byte smiley characters they are
).

The sample text i tried to index is

> bcdeewillifindfgh​


Any further pointers on the same ??

--
Kumaran R







On Mon, Aug 1, 2016 at 12:07 AM, Kumaran Ramasubramanian <kums@gmail.com>
wrote:

>

> Hi All,
>
> Is there any pointers on storing smileys in lucene index?? Any help is
much appreciated.
>
> Thank you.
>
> --
> Kumaran R
>
>
> On Jul 30, 2016 12:24 PM, "Kumaran Ramasubramanian" <kums@gmail.com>
wrote:

>>

>>
>> Hi All,
>>
>> Am using lucene 4.10.4. Using lucene index, Is there any way to store
and retrieve latest utf8 and utf8mb4 emoji / smiley characters?? In any
latest lucene version??
>>
>> Thanks in advance.
>>
>> --
>> Kumaran R


Re: How to store and retrieve latest utf8mb4 emoji / smiley characters in lucene index

2016-07-31 Thread Kumaran Ramasubramanian
Hi All,

Is there any pointers on storing smileys in lucene index?? Any help is much
appreciated.

Thank you.

--
Kumaran R

On Jul 30, 2016 12:24 PM, "Kumaran Ramasubramanian" <kums@gmail.com>
wrote:

>
> Hi All,
>
> Am using lucene 4.10.4. Using lucene index, Is there any way to store and
> retrieve latest utf8 and utf8mb4 emoji / smiley characters?? In any latest
> lucene version??
>
> Thanks in advance.
>
> --
> Kumaran R
>


How to store and retrieve latest utf8mb4 emoji / smiley characters in lucene index

2016-07-30 Thread Kumaran Ramasubramanian
Hi All,

Am using lucene 4.10.4. Using lucene index, Is there any way to store and
retrieve latest utf8 and utf8mb4 emoji / smiley characters?? In any latest
lucene version??

Thanks in advance.

--
Kumaran R


Re: Indexing and storing Long fields

2016-07-28 Thread Kumaran Ramasubramanian
Ok mike.. thanks for the explanation... i have another doubt...

i read in some article like, we can have one storedfield & docvalue field
with same field... is it so?


--
Kumaran R






On Thu, Jul 28, 2016 at 9:29 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> OK, sorry, you cannot change how the field is indexed for the same field
> name across different field indices.
>
> Lucene will "downgrade" that field to the lowest settings, e.g. "docs, no
> positions" in your case.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jul 28, 2016 at 9:31 AM, Kumaran Ramasubramanian <
> kums@gmail.com
> > wrote:
>
> > Hi Mike,
> >
> >  For your information, am using lucene 4.10.4.. am i missing anything?
> >
> >
> >
> > ​--
> > Kumaran R​
> >
> >
> >
> >
> > On Wed, Jul 27, 2016 at 1:52 AM, Kumaran Ramasubramanian <
> > kums@gmail.com
> > > wrote:
> >
> > >
> > > Hi Mike,
> > >
> > > 1.if we index one field as analyzed and not analyzed using same name,
> > > phrase queries are not working (field "comp" was indexed without
> position
> > > data, cannot run phrasequery) for analyzed terms also... because
> indexed
> > > document ( term properties are not proper, even if tokenized, not able
> to
> > > search "bank" or "swiss" or "world") looks like
> > >
> > > *while we index*
> > >
> > > Document<*stored,indexed**,tokenized**
> > > stored,indexed,tokenized
> > > stored,indexed,tokenized
> stored,indexed,tokenized
> > > stored,indexed,tokenized>
> > > Document<*stored,indexed
> > > stored,indexed,tokenized
> > > stored,indexed,tokenized
> stored,indexed,tokenized
> > > stored,indexed,tokenized>
> > >
> > >
> > > *in index*
> > >
> > > Document<*stored,indexed**,tokenized**
> > > stored,indexed,tokenized
> > > stored,indexed,tokenized
> stored,indexed,tokenized
> > > stored,indexed,tokenized>
> > > Document<*stored,indexed,tokenized
> > > stored,indexed,tokenized
> > > stored,indexed,tokenized
> stored,indexed,tokenized
> > > stored,indexed,tokenized>
> > >
> > > *impact:*
> > >
> > > *stored,indexed is changed to **stored,indexed**,tokenized*
> > >
> > > *Related links:*
> > >
> > > *https://github.com/elastic/elasticsearch/issues/12079
> > > <https://github.com/elastic/elasticsearch/issues/12079>*
> > >
> > > *https://github.com/elastic/elasticsearch/issues/4475
> > > <https://github.com/elastic/elasticsearch/issues/4475>*
> > >
> > > *
> >
> http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras
> > > <
> >
> http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras
> > >*
> > >
> > >
> > >
> > > *2.similarly, for numeric field & string field using same field*
> > >
> > > Also, if we index numeric & stringfield using same field name in single
> > > index, we do lose position data of indexed string terms and so phrase
> > > queries not working ( field  "fieldname" was indexed without position
> > > data, cannot run phrasequery)
> > >
> > >
> > >
> > >
> >
> https://mail-archives.apache.org/mod_mbox/lucene-java-user/201510.mbox/%3CCAHTScUgTYgSLP9OmoMe2ebVBHw8=trih5b++u7v050vnrqz...@mail.gmail.com%3E
> > >
> > >
> > >
> > > > I would be pretty skeptical of this approach You're
> > >
> > > > mixing numeric data with textual data and I expect
> > >
> > > > the results to be unpredictable. You already said
> > >
> > > > "it is working for most of the
> > >
> > > > documents except one or two documents." I predict
> > >
> > > > you'll find more and more of these as time passes.
> > >
> > > >
> > >
> > > > Expect many more anomalies. At best you need to
> > >
> > > > index both forms as text rather than mixing numeric
> > >
> > > > and text data.
> > >
> > >
> > >
> > > Thanks in advance...
> > >
> > >
> > >
> > > --
> > > Kumaran R
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Jul 24, 2016 at 1:54 AM, Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > >> On Sat, Jul 23, 2016 at 4:48 AM, Kumaran Ramasubramanian <
> > >> kums@gmail.com
> > >> > wrote:
> > >>
> > >> > Hi Mike,
> > >> >
> > >> > *Two different fields can be the same name*
> > >> >
> > >> > Is it so? You mean we can index one field as docvaluefield and also
> > >> stored
> > >> > field, Using same name?
> > >> >
> > >>
> > >> This should be fine, yes.
> > >>
> > >>
> > >> > And AFAIK, We cannot index one field as analyzed and not analyzed
> > using
> > >> the
> > >> > same name. Am i right?
> > >> >
> > >>
> > >> Hmm, I think you can do this?  The first one will be tokenized, and
> the
> > >> second indexed as a single token.
> > >>
> > >> Or do you see otherwise?
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >
> > >
> >
>


Re: Indexing and storing Long fields

2016-07-28 Thread Kumaran Ramasubramanian
Hi Mike,

 For your information, am using lucene 4.10.4.. am i missing anything?



​--
Kumaran R​




On Wed, Jul 27, 2016 at 1:52 AM, Kumaran Ramasubramanian <kums@gmail.com
> wrote:

>
> Hi Mike,
>
> 1.if we index one field as analyzed and not analyzed using same name,
> phrase queries are not working (field "comp" was indexed without position
> data, cannot run phrasequery) for analyzed terms also... because indexed
> document ( term properties are not proper, even if tokenized, not able to
> search "bank" or "swiss" or "world") looks like
>
> *while we index*
>
> Document<*stored,indexed**,tokenized**
> stored,indexed,tokenized
> stored,indexed,tokenized stored,indexed,tokenized
> stored,indexed,tokenized>
> Document<*stored,indexed
> stored,indexed,tokenized
> stored,indexed,tokenized stored,indexed,tokenized
> stored,indexed,tokenized>
>
>
> *in index*
>
> Document<*stored,indexed**,tokenized**
> stored,indexed,tokenized
> stored,indexed,tokenized stored,indexed,tokenized
> stored,indexed,tokenized>
> Document<*stored,indexed,tokenized
> stored,indexed,tokenized
> stored,indexed,tokenized stored,indexed,tokenized
> stored,indexed,tokenized>
>
> *impact:*
>
> *stored,indexed is changed to **stored,indexed**,tokenized*
>
> *Related links:*
>
> *https://github.com/elastic/elasticsearch/issues/12079
> <https://github.com/elastic/elasticsearch/issues/12079>*
>
> *https://github.com/elastic/elasticsearch/issues/4475
> <https://github.com/elastic/elasticsearch/issues/4475>*
>
> *http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras
> <http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras>*
>
>
>
> *2.similarly, for numeric field & string field using same field*
>
> Also, if we index numeric & stringfield using same field name in single
> index, we do lose position data of indexed string terms and so phrase
> queries not working ( field  "fieldname" was indexed without position
> data, cannot run phrasequery)
>
>
>
> https://mail-archives.apache.org/mod_mbox/lucene-java-user/201510.mbox/%3CCAHTScUgTYgSLP9OmoMe2ebVBHw8=trih5b++u7v050vnrqz...@mail.gmail.com%3E
>
>
>
> > I would be pretty skeptical of this approach You're
>
> > mixing numeric data with textual data and I expect
>
> > the results to be unpredictable. You already said
>
> > "it is working for most of the
>
> > documents except one or two documents." I predict
>
> > you'll find more and more of these as time passes.
>
> >
>
> > Expect many more anomalies. At best you need to
>
> > index both forms as text rather than mixing numeric
>
> > and text data.
>
>
>
> Thanks in advance...
>
>
>
> --
> Kumaran R
>
>
>
>
>
> On Sun, Jul 24, 2016 at 1:54 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> On Sat, Jul 23, 2016 at 4:48 AM, Kumaran Ramasubramanian <
>> kums@gmail.com
>> > wrote:
>>
>> > Hi Mike,
>> >
>> > *Two different fields can be the same name*
>> >
>> > Is it so? You mean we can index one field as docvaluefield and also
>> stored
>> > field, Using same name?
>> >
>>
>> This should be fine, yes.
>>
>>
>> > And AFAIK, We cannot index one field as analyzed and not analyzed using
>> the
>> > same name. Am i right?
>> >
>>
>> Hmm, I think you can do this?  The first one will be tokenized, and the
>> second indexed as a single token.
>>
>> Or do you see otherwise?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>
>


Re: Indexing and storing Long fields

2016-07-26 Thread Kumaran Ramasubramanian
Hi Mike,

1.if we index one field as analyzed and not analyzed using same name,
phrase queries are not working (field "comp" was indexed without position
data, cannot run phrasequery) for analyzed terms also... because indexed
document ( term properties are not proper, even if tokenized, not able to
search "bank" or "swiss" or "world") looks like

*while we index*

Document<*stored,indexed**,tokenized**
stored,indexed,tokenized
stored,indexed,tokenized stored,indexed,tokenized
stored,indexed,tokenized>
Document<*stored,indexed
stored,indexed,tokenized
stored,indexed,tokenized stored,indexed,tokenized
stored,indexed,tokenized>


*in index*

Document<*stored,indexed**,tokenized**
stored,indexed,tokenized
stored,indexed,tokenized stored,indexed,tokenized
stored,indexed,tokenized>
Document<*stored,indexed,tokenized
stored,indexed,tokenized
stored,indexed,tokenized stored,indexed,tokenized
stored,indexed,tokenized>

*impact:*

*stored,indexed is changed to **stored,indexed**,tokenized*

*Related links:*

*https://github.com/elastic/elasticsearch/issues/12079
<https://github.com/elastic/elasticsearch/issues/12079>*

*https://github.com/elastic/elasticsearch/issues/4475
<https://github.com/elastic/elasticsearch/issues/4475>*

*http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras
<http://stackoverflow.com/questions/19302887/elasticsearch-field-title-was-indexed-without-position-data-cannot-run-phras>*



*2.similarly, for numeric field & string field using same field*

Also, if we index numeric & stringfield using same field name in single
index, we do lose position data of indexed string terms and so phrase
queries not working ( field  "fieldname" was indexed without position data,
cannot run phrasequery)


https://mail-archives.apache.org/mod_mbox/lucene-java-user/201510.mbox/%3CCAHTScUgTYgSLP9OmoMe2ebVBHw8=trih5b++u7v050vnrqz...@mail.gmail.com%3E



> I would be pretty skeptical of this approach You're

> mixing numeric data with textual data and I expect

> the results to be unpredictable. You already said

> "it is working for most of the

> documents except one or two documents." I predict

> you'll find more and more of these as time passes.

>

> Expect many more anomalies. At best you need to

> index both forms as text rather than mixing numeric

> and text data.



Thanks in advance...



--
Kumaran R





On Sun, Jul 24, 2016 at 1:54 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Sat, Jul 23, 2016 at 4:48 AM, Kumaran Ramasubramanian <
> kums@gmail.com
> > wrote:
>
> > Hi Mike,
> >
> > *Two different fields can be the same name*
> >
> > Is it so? You mean we can index one field as docvaluefield and also
> stored
> > field, Using same name?
> >
>
> This should be fine, yes.
>
>
> > And AFAIK, We cannot index one field as analyzed and not analyzed using
> the
> > same name. Am i right?
> >
>
> Hmm, I think you can do this?  The first one will be tokenized, and the
> second indexed as a single token.
>
> Or do you see otherwise?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


Re: Indexing and storing Long fields

2016-07-23 Thread Kumaran Ramasubramanian
Hi Mike,

*Two different fields can be the same name*

Is it so? You mean we can index one field as docvaluefield and also stored
field, Using same name?

And AFAIK, We cannot index one field as analyzed and not analyzed using the
same name. Am i right?

Kumaran R

On Jul 21, 2016 11:50 PM, "Michael McCandless" 
wrote:
>
> Two different fields can be the same name.
>
> I think the problem is that you are indexing it as doc values, which is
not
> searchable.
>
> To make your numeric fields searchable, use e.g. LongPoint (as of Lucene
> 6.0) or LongField (before 6.0).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jul 21, 2016 at 11:50 AM, Siraj Haider  wrote:
>
> > Hi,
> > We want to store and index a long value. Currently when we are adding
two
> > fields with same name, one is NumericDocValuesField and the other is
> > StoredField. But it seems like the search is not returning any document
> > when we query on that field, however we are able to retrieve the field
> > values of that field from a search. So, my question is, can we index two
> > different field types with same name or do we have to use different
names
> > for these fields?
> >
> > --
> > Regards
> > -Siraj Haider
> > (212) 306-0154
> >
> >
> > 
> >
> > This electronic mail message and any attachments may contain information
> > which is privileged, sensitive and/or otherwise exempt from disclosure
> > under applicable law. The information is intended only for the use of
the
> > individual or entity named as the addressee above. If you are not the
> > intended recipient, you are hereby notified that any disclosure,
copying,
> > distribution (electronic or otherwise) or forwarding of, or the taking
of
> > any action in reliance on, the contents of this transmission is strictly
> > prohibited. If you have received this electronic transmission in error,
> > please notify us by telephone, facsimile, or e-mail as noted above to
> > arrange for the return of any electronic mail or attachments. Thank You.
> >


query norm understanding when we use term boosting during search time

2016-06-17 Thread Kumaran Ramasubramanian
Hi All,

i have read some discussions on impact of querynorm values in lucene score
when we specify more boosting.

Impact is, if boosting increases, querynorm decreases a lot, and so more
boosting does not results in final lucene score

Consider an example like, i need to order results based on given boosting
of terms( in query ) irrespective of lucene relevance calculation.. Setting
querynorm to 1 achieves the same... But resulted in very big numbers in
score value... My use case is solved..

But i need to understand what i do is recommended or not?? is there any
possibility of inconsistency expected ? Any pointers is much appreciated...


Thank you :-)




--
Kumaran R


Re: Lucene term modifiers during search time - importance based on listed order

2016-06-09 Thread Kumaran Ramasubramanian
Dear All,

   Is there pointers on how to take my first step regarding below mentioned
requirement? Any help is much appreciated. Thanks.

​when i dont not get matched records for a query, i want to try term
> modifiers ( as per order ) to retrieve related results ( at least )​
>
>
--
Kumaran R


On Tue, Jun 7, 2016 at 10:16 PM, Kumaran Ramasubramanian <kums@gmail.com
> wrote:

>
>
> Hi All,
>
>   i am trying to retrieve most related results using user given search
> queries. If there is no results for exact queries, i want to use term
> modifiers during search time. And so, i want to give more importance to
> term modifiers in the defined listed order given below
>
>
>1. exact query
>2. wild card
>3. typo terms
>4. fuzzy search queries
>5. synonymous terms
>6. Proximity Searches ( if more than one term specified in user given
>query )
>
>
> Please share me any articles if it is already solved / raised / documented
> to lucene community before. Thanks in advance :-)
>
>
> --
> *​*​
> Kumaran R
>
>
>


Lucene term modifiers during search time - importance based on listed order

2016-06-07 Thread Kumaran Ramasubramanian
Hi All,

  i am trying to retrieve most related results using user given search
queries. If there is no results for exact queries, i want to use term
modifiers during search time. And so, i want to give more importance to
term modifiers in the defined listed order given below


   1. exact query
   2. wild card
   3. typo terms
   4. fuzzy search queries
   5. synonymous terms
   6. Proximity Searches ( if more than one term specified in user given
   query )


Please share me any articles if it is already solved / raised / documented
to lucene community before. Thanks in advance :-)


--
*​*​
Kumaran R


Re: Difference in Retrieving values of a docvalue field using atomicreader over duplicate StoredField

2016-06-03 Thread Kumaran Ramasubramanian
Thanks a lot for the clarification mike :-)



--
*​*​
Kumaran R




On Thu, Jun 2, 2016 at 2:38 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Well, each access may involve disk seeks (if the pages are not already hot
> in the OS's IO cache).
>
> So if you do some doc values + some stored fields you're looking at
> multiple seeks per docID you retrieve.
>
> Really you should only do this for a few hits, e.g. one page worth.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Jun 1, 2016 at 3:15 PM, Kumaran Ramasubramanian <
> kums@gmail.com>
> wrote:
>
> > Hi All,
> >
> >
> > In javadoc of every docvalue field
> > <
> >
> https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/document/SortedNumericDocValuesField.html
> > >,
> > there is line like
> >
> > ​
> > > If you also need to store the value, you should add a separate
> > StoredField
> > > <
> >
> https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/document/StoredField.html
> > >
> > >  instance.​
> >
> >
> >
> >
> https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/document/SortedNumericDocValuesField.html
> >
> >
> >
> >
> > But, we are able to retrieve the value of SortedNumericDocValuesField
> > using AtomicReader also
> >
> > for (AtomicReaderContext context : indexReader.leaves())
> > > {
> > >  AtomicReader atomicReader = context.reader();
> > >  SortedNumericDocValues
> > >  sortedDocValues=DocValues.getSortedNumeric(atomicReader, "Numeric
> > > ​_docvalue​
> > > _price");
> > > }
> >
> >
> > ​So what is the difference actually? Is there any performance or time
> taken
> > issue in retrieving values of a docvalue field using atomicreader over
> > StoredField? ​
> >
> > Please clarify me what am i missing?​
> >
> > Thanks in advance
> >
> >
> > ​--​
> >
> > Kumaran R
> >
>


Difference in Retrieving values of a docvalue field using atomicreader over duplicate StoredField

2016-06-01 Thread Kumaran Ramasubramanian
Hi All,


In javadoc of every docvalue field
,
there is line like

​
> If you also need to store the value, you should add a separate StoredField
> 
>  instance.​


https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/document/SortedNumericDocValuesField.html




But, we are able to retrieve the value of SortedNumericDocValuesField
using AtomicReader also

for (AtomicReaderContext context : indexReader.leaves())
> {
>  AtomicReader atomicReader = context.reader();
>  SortedNumericDocValues
>  sortedDocValues=DocValues.getSortedNumeric(atomicReader, "Numeric
> ​_docvalue​
> _price");
> }


​So what is the difference actually? Is there any performance or time taken
issue in retrieving values of a docvalue field using atomicreader over
StoredField? ​

Please clarify me what am i missing?​

Thanks in advance


​--​

Kumaran R


Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Kumaran Ramasubramanian
Hi Jack Krupansky

Thanks for the reply. That will work fine. But i am trying to use the
stored values instead of hitting database for reindex. Isn't it better way
to reindex? Any inputs?


--
​Kumaran R




On Thu, Dec 17, 2015 at 11:50 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Delete the full index and create from scratch with the correct field type,
> re-adding all documents. Any remnants of the old field must be removed.
>
> -- Jack Krupansky
>
> On Thu, Dec 17, 2015 at 11:48 AM, Kumaran R <kums@gmail.com> wrote:
>
> > While Reindexing only am facing this problem.
> >
> > Just to confirm what do you mean by reindex. You mean "delete and add"
> > for all documents by taking data one by one right??
> >
> > Sent from Phone
> >
> > > On 17-Dec-2015, at 8:53 PM, Jack Krupansky <jack.krupan...@gmail.com>
> > wrote:
> > >
> > > The standard answer is that you need to reindex all of your data.
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Dec 17, 2015 at 6:10 AM, Kumaran Ramasubramanian <
> > kums@gmail.com
> > >> wrote:
> > >
> > >> Dear All
> > >>
> > >> i am using lucene 4.10.4. Is there any more information i missed to
> > >> provide? Please let me know.
> > >>
> > >>
> > >> --
> > >> Kumaran R*​*
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Dec 16, 2015 at 10:35 PM, Kumaran Ramasubramanian <
> > >> kums@gmail.com> wrote:
> > >>
> > >>>
> > >>> Hi All,
> > >>>
> > >>> Previous Post -
> > >>> http://www.gossamer-threads.com/lists/lucene/java-user/289159
> > >>>
> > >>>  i have indexed one field "STATUS" as both IntField & String
> field
> > >> in
> > >>> same index. Now i want to take IntField containing documents and
> change
> > >> the
> > >>> value of field "STATUS" to string with norms & positions ( to achieve
> > >>> phrase query).
> > >>>
> > >>> But even if i delete that field and index again as String field,
> > ​*STATUS
> > >>> field property of "omitNorms & no positions" are not changing *(
> which
> > >>> are set when it was IntField)
> > >>>
> > >>> There are around 2 million documents in that index. indexed STATUS
> > field
> > >>> as
> > >>> IntField - in 1 million documents
> > >>> Analyzed String Field - in another 1 million doucments
> > >>>
> > >>> Basically, am trying to change STATUS field into only one type ( to
> > solve
> > >>> http://www.gossamer-threads.com/lists/lucene/java-user/289159)
> > >>>
> > >>>
> > >>> *In index when it was IntField*
> > >>>
> > >>>
> <stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY
> > >>>
> > >>>
> > >>>
> > >>> *​​when​ i try to change to string from​ IntField*
> > >>>
> > >>> ​stored,indexed,tokenized
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> *This is how STATUS field looks again in index*
> > >>>
> > >>>
> > ​<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY > >>>> index cleared>​
> > >>> ​
> > >>>
> > >>>
> > >>>
> > >>> *code details i am using:*
> > >>>
> > >>> for IntField,
> > >>> IntField intField = new IntField("STATUS", Integer.parseInt("
> > >>> ​222​
> > >>> "), Field.Store.YES);
> > >>> doc
> > >>> ​ument​
> > >>> .add(intField);
> > >>>
> > >>> ​for string field,
> > >>> ​document.add(new Field("STATUS", "lucene index cleared",
> > >> Field.Store.YES,
> > >>> Field.Index.ANALYZED));
> > >>>
> > >>>
> > >>>
> > >>> ​Thanks in advance​ :-)
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> ​K​
> > >>> umaran
> > >>> ​R​
> > >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Kumaran Ramasubramanian
yes. All fields are stored in index. so sounds like working.

Thanks a lot jack :-) Nice to meet you here :-)

--
​ Kumaran R​



On Fri, Dec 18, 2015 at 12:49 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> You could certainly read your stored values from your current index and
> then write new documents to a new index and then use the new index. That's
> if all of the indexed field values are stored.
>
> -- Jack Krupansky
>
> On Thu, Dec 17, 2015 at 2:10 PM, Kumaran Ramasubramanian <
> kums@gmail.com
> > wrote:
>
> > Hi Jack Krupansky
> >
> > Thanks for the reply. That will work fine. But i am trying to use the
> > stored values instead of hitting database for reindex. Isn't it better
> way
> > to reindex? Any inputs?
> >
> >
> > --
> > ​Kumaran R
> >
> >
> >
> >
> > On Thu, Dec 17, 2015 at 11:50 PM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > wrote:
> >
> > > Delete the full index and create from scratch with the correct field
> > type,
> > > re-adding all documents. Any remnants of the old field must be removed.
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Dec 17, 2015 at 11:48 AM, Kumaran R <kums@gmail.com>
> wrote:
> > >
> > > > While Reindexing only am facing this problem.
> > > >
> > > > Just to confirm what do you mean by reindex. You mean "delete and
> add"
> > > > for all documents by taking data one by one right??
> > > >
> > > > Sent from Phone
> > > >
> > > > > On 17-Dec-2015, at 8:53 PM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > The standard answer is that you need to reindex all of your data.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Thu, Dec 17, 2015 at 6:10 AM, Kumaran Ramasubramanian <
> > > > kums@gmail.com
> > > > >> wrote:
> > > > >
> > > > >> Dear All
> > > > >>
> > > > >> i am using lucene 4.10.4. Is there any more information i missed
> to
> > > > >> provide? Please let me know.
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Kumaran R*​*
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Wed, Dec 16, 2015 at 10:35 PM, Kumaran Ramasubramanian <
> > > > >> kums@gmail.com> wrote:
> > > > >>
> > > > >>>
> > > > >>> Hi All,
> > > > >>>
> > > > >>> Previous Post -
> > > > >>> http://www.gossamer-threads.com/lists/lucene/java-user/289159
> > > > >>>
> > > > >>>  i have indexed one field "STATUS" as both IntField & String
> > > field
> > > > >> in
> > > > >>> same index. Now i want to take IntField containing documents and
> > > change
> > > > >> the
> > > > >>> value of field "STATUS" to string with norms & positions ( to
> > achieve
> > > > >>> phrase query).
> > > > >>>
> > > > >>> But even if i delete that field and index again as String field,
> > > > ​*STATUS
> > > > >>> field property of "omitNorms & no positions" are not changing *(
> > > which
> > > > >>> are set when it was IntField)
> > > > >>>
> > > > >>> There are around 2 million documents in that index. indexed
> STATUS
> > > > field
> > > > >>> as
> > > > >>> IntField - in 1 million documents
> > > > >>> Analyzed String Field - in another 1 million doucments
> > > > >>>
> > > > >>> Basically, am trying to change STATUS field into only one type (
> to
> > > > solve
> > > > >>> http://www.gossamer-threads.com/lists/lucene/java-user/289159)
> > > > >>>
> > > > >>>
> > > > >>> *In index when it was IntField*
> > > > >>>
> > > > >>>
> > > <stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *​​when​ i try to change to string from​ IntField*
> > > > >>>
> > > > >>> ​stored,indexed,tokenized
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *This is how STATUS field looks again in index*
> > > > >>>
> > > > >>>
> > > >
> > ​<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY > > > >>>> index cleared>​
> > > > >>> ​
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *code details i am using:*
> > > > >>>
> > > > >>> for IntField,
> > > > >>> IntField intField = new IntField("STATUS", Integer.parseInt("
> > > > >>> ​222​
> > > > >>> "), Field.Store.YES);
> > > > >>> doc
> > > > >>> ​ument​
> > > > >>> .add(intField);
> > > > >>>
> > > > >>> ​for string field,
> > > > >>> ​document.add(new Field("STATUS", "lucene index cleared",
> > > > >> Field.Store.YES,
> > > > >>> Field.Index.ANALYZED));
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ​Thanks in advance​ :-)
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> ​K​
> > > > >>> umaran
> > > > >>> ​R​
> > > > >>
> > > >
> > > > -
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >
>


IntField is not indexing properly - field properties changed to "just stored"

2015-12-17 Thread Kumaran Ramasubramanian
Hi All,

i am using lucene 4.10.4. i am using below code to index IntField. All
properties of IntField is vanished except stored. Any help is much
appreciated.


IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_4, new
> ClassicAnalyzer(Version.LUCENE_30, new StringReader("")));
> iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
> IndexWriter writer = new IndexWriter(NIOFSDirectory.open(new
> File(indexDir)), iwc);
> Document doc = new Document();
> IntField intField = new IntField("STATUS", Integer.parseInt("178"),
> Field.Store.YES);
> doc.add(intField);

writer.addDocument(doc);

writer.forceMerge(1);

writer.close();




*document looks like this before adding to index*

stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=INT,numericPrecisionStep=8 ​
>

*​​document looks like this before adding to index​*

stored




​Thanks in advance​ :-)




--
​ ​
Kumaran R


Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Kumaran Ramasubramanian
Dear All

 i am using lucene 4.10.4. Is there any more information i missed to
provide? Please let me know.


--
Kumaran R*​*




On Wed, Dec 16, 2015 at 10:35 PM, Kumaran Ramasubramanian <
kums@gmail.com> wrote:

>
> Hi All,
>
> Previous Post -
> http://www.gossamer-threads.com/lists/lucene/java-user/289159
>
>   i have indexed one field "STATUS" as both IntField & String field in
> same index. Now i want to take IntField containing documents and change the
> value of field "STATUS" to string with norms & positions ( to achieve
> phrase query).
>
> But even if i delete that field and index again as String field, ​*STATUS
> field property of "omitNorms & no positions" are not changing *( which
> are set when it was IntField)
>
> There are around 2 million documents in that index. indexed STATUS field
> as
> IntField - in 1 million documents
> Analyzed String Field - in another 1 million doucments
>
> Basically, am trying to change STATUS field into only one type ( to solve
> http://www.gossamer-threads.com/lists/lucene/java-user/289159)
>
>
> *In index when it was IntField*
>
> <stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY
>
>
>
> *​​when​ i try to change to string from​ IntField*
>
> ​stored,indexed,tokenized
>
>
>
>
> *This is how STATUS field looks again in index*
>
> ​<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY> index cleared>​
>>
> ​
>
>
>
> *code details i am using:*
>
> for IntField,
> IntField intField = new IntField("STATUS", Integer.parseInt("
> ​222​
> "), Field.Store.YES);
> doc
> ​ument​
> .add(intField);
>
> ​for string field,
> ​document.add(new Field("STATUS", "lucene index cleared", Field.Store.YES,
> Field.Index.ANALYZED));
>
>
>
> ​Thanks in advance​ :-)
>
>
>
> --
> ​K​
> umaran
> ​R​
>
>
>
>


Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-16 Thread Kumaran Ramasubramanian
Hi All,

Previous Post -
http://www.gossamer-threads.com/lists/lucene/java-user/289159

  i have indexed one field "STATUS" as both IntField & String field in
same index. Now i want to take IntField containing documents and change the
value of field "STATUS" to string with norms & positions ( to achieve
phrase query).

But even if i delete that field and index again as String field, ​*STATUS
field property of "omitNorms & no positions" are not changing *( which are
set when it was IntField)

There are around 2 million documents in that index. indexed STATUS field as
IntField - in 1 million documents
Analyzed String Field - in another 1 million doucments

Basically, am trying to change STATUS field into only one type ( to solve
http://www.gossamer-threads.com/lists/lucene/java-user/289159)


*In index when it was IntField*

​
>
​



*code details i am using:*

for IntField,
IntField intField = new IntField("STATUS", Integer.parseInt("
​222​
"), Field.Store.YES);
doc
​ument​
.add(intField);

​for string field,
​document.add(new Field("STATUS", "lucene index cleared", Field.Store.YES,
Field.Index.ANALYZED));



​Thanks in advance​ :-)



--
​K​
umaran
​R​


Re: Using ​phrase query in Termfilters

2015-11-25 Thread Kumaran Ramasubramanian
Hi Uwe

Thanks for the clarification.

​--​
​Kumaran R​


On Wed, Nov 25, 2015 at 2:32 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> To use a real phrase (more than one term) as part of a filter, you have to
> convert the PhraseQuery to a Filter: new QueryWrapperFilter(phrasequery).
> The phrasequery can be built using QueryBuilder that analyzes the string
> and splits it into tokens to create a PhraseQuery: https://goo.gl/HDhn6R
>
> Please note: Filters are deprecated in Lucene 5. In Lucene 5 this is all
> easier! Just use the PhraseQuery as filtering clause in a BooleanQuery
> (using Occur.FILTER).
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Kumaran Ramasubramanian [mailto:kums@gmail.com]
> > Sent: Wednesday, November 25, 2015 9:13 AM
> > To: java-user@lucene.apache.org
> > Subject: Using ​phrase query in Termfilters
> >
> > Hi All,
> >
> >Am using lucene 4.10.4. Is it right to add analyzed multi valued
> fields
> > & phrase query for the same field in boolean filter. i believe we could
> not
> > apply analyzers to values in filters. So am not getting results for those
> > filters' match.
> >
> > String phraseTerm = "hello world"
> > > Term term = new Term(key,
> > > ​
> > > phraseTerm);
> > > TermsFilter filter = new TermsFilter(term);
> > > FilterClause filterClause = new FilterClause(filter,
> > > BooleanClause.Occur.SHOULD);
> > > boolFilter.add(filterClause);
> >
> >
> > Please
> > ​guide me with
> >  related articles.
> > ​ Thanks.​
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Using ​phrase query in Termfilters

2015-11-25 Thread Kumaran Ramasubramanian
Hi All,

   Am using lucene 4.10.4. Is it right to add analyzed multi valued fields
& phrase query for the same field in boolean filter. i believe we could not
apply analyzers to values in filters. So am not getting results for those
filters' match.

String phraseTerm = "hello world"
> Term term = new Term(key,
> ​​
> phraseTerm);
> TermsFilter filter = new TermsFilter(term);
> FilterClause filterClause = new FilterClause(filter,
> BooleanClause.Occur.SHOULD);
> boolFilter.add(filterClause);


Please
​guide me with
 related articles.
​ Thanks.​


Re: Two different types of values in same field name in single index

2015-10-28 Thread Kumaran Ramasubramanian
Hi Erick

Thanks a lot for the clarification.


​Regards,​
​K​
umaran
​R​


On Wed, Oct 28, 2015 at 1:47 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I would be pretty skeptical of this approach You're
> mixing numeric data with textual data and I expect
> the results to be unpredictable. You already said
> "it is working for most of the
> documents except one or two documents." I predict
> you'll find more and more of these as time passes.
>
> Expect many more anomalies. At best you need to
> index both forms as text rather than mixing numeric
> and text data. Since you're not sorting you should be
> OK with the caveat that searching for
> "02" won't match an indexed value of "2" unless you
> remove all leading zeros at both index and query
> time.
>
> Best,
> Erick
>
> On Tue, Oct 27, 2015 at 11:31 AM, Kumaran Ramasubramanian
> <kums@gmail.com> wrote:
> > Yes Will, You are right. But i dont use "status" field for sorting. i
> have
> > other fields that is being used for sorting specifically. And so i dont
> > face any issues in sorting as of now.
> >
> > --
> > Kumaran R
> >
> >
> > On Tue, Oct 27, 2015 at 7:20 PM, will <wmartin...@gmail.com> wrote:
> >
> >> Kumaran -
> >>
> >> Aren't you creating an unworkable scenario for sorting?
> >>
> >> -will
> >>
> >> On 10/27/15 5:49 AM, Kumaran Ramasubramanian wrote:
> >>
> >>> Hi All,
> >>>
> >>>   i have indexed module wise data in same index. In this case, we index
> >>> two
> >>> types of field in same name in two different document like this.
> >>>
> >>> *document1:*
> >>>
> >>> module:1
> >>>> status:4 ( as LongField )
> >>>>
> >>>> *code:*
> >>>
> >>> long longValue=Long.parseLong(value);
> >>>> LongField field=new LongField(f
> >>>> ield
> >>>> Name,longValue, Field.Store.YES);
> >>>> document.add(field);
> >>>>
> >>>>
> >>> *document2:*
> >>>
> >>> module:2
> >>>> status:open ( as Field )
> >>>>
> >>> *code:*
> >>>
> >>>
> >>>> Field field = new Field(
> >>>> f
> >>>>
> >>>> ield
> >>>> Name, value, (Field.Store) stored, (Field.Index) indexType);
> >>>>
> >>> document.add(field);
> >>>
> >>>
> >>>
> >>>>
> >>>> There are around 50lakh documents in my index which has these kind of
> >>> mixed
> >>> type of fields
> >>>
> >>> when i query like status:4 or status:open, it is working for most of
> the
> >>> documents except one or two documents.
> >>>
> >>> Not able to reproduce the same in other indexes. So i want confirm
> whether
> >>> it is supported to have both Field & LongField with same field name in
> >>> same
> >>> index. And also Please suggest me any articles discussing this kind of
> >>> problem.
> >>>
> >>> Thanks :-)
> >>>
> >>> Related links:
> >>> http://www.gossamer-threads.com/lists/lucene/java-user/109530
> >>>
> >>> --
> >>> **
> >>> Kumaran R
> >>>
> >>>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Two different types of values in same field name in single index

2015-10-27 Thread Kumaran Ramasubramanian
Hi All,

 i have indexed module wise data in same index. In this case, we index two
types of field in same name in two different document like this.

*document1:*

> module:1
> status:4 ( as LongField )
>
*code:*

> long longValue=Long.parseLong(value);
> LongField field=new LongField(f
> ield
> Name,longValue, Field.Store.YES);
> document.add(field);
>


*document2:*

> module:2
> status:open ( as Field )

*code:*

> ​
> Field field = new Field(
> ​f​
> ​
> ield
> Name, value, (Field.Store) stored, (Field.Index) indexType);

document.add(field);

​
>
>

There are around 50lakh documents in my index which has these kind of mixed
type of fields

when i query like status:4 or status:open, it is working for most of the
documents except one or two documents.

Not able to reproduce the same in other indexes. So i want confirm whether
it is supported to have both Field & LongField with same field name in same
index. And also Please suggest me any articles discussing this kind of
problem.

Thanks :-)

Related links:
http://www.gossamer-threads.com/lists/lucene/java-user/109530

--
*​*​
Kumaran R


Re: Two different types of values in same field name in single index

2015-10-27 Thread Kumaran Ramasubramanian
Yes Will, You are right. But i dont use "status" field for sorting. i have
other fields that is being used for sorting specifically. And so i dont
face any issues in sorting as of now.

--
​Kumaran R​


On Tue, Oct 27, 2015 at 7:20 PM, will <wmartin...@gmail.com> wrote:

> Kumaran -
>
> Aren't you creating an unworkable scenario for sorting?
>
> -will
>
> On 10/27/15 5:49 AM, Kumaran Ramasubramanian wrote:
>
>> Hi All,
>>
>>   i have indexed module wise data in same index. In this case, we index
>> two
>> types of field in same name in two different document like this.
>>
>> *document1:*
>>
>> module:1
>>> status:4 ( as LongField )
>>>
>>> *code:*
>>
>> long longValue=Long.parseLong(value);
>>> LongField field=new LongField(f
>>> ield
>>> Name,longValue, Field.Store.YES);
>>> document.add(field);
>>>
>>>
>> *document2:*
>>
>> module:2
>>> status:open ( as Field )
>>>
>> *code:*
>>
>> ​
>>> Field field = new Field(
>>> ​f​
>>> ​
>>> ield
>>> Name, value, (Field.Store) stored, (Field.Index) indexType);
>>>
>> document.add(field);
>>
>> ​
>>
>>>
>>> There are around 50lakh documents in my index which has these kind of
>> mixed
>> type of fields
>>
>> when i query like status:4 or status:open, it is working for most of the
>> documents except one or two documents.
>>
>> Not able to reproduce the same in other indexes. So i want confirm whether
>> it is supported to have both Field & LongField with same field name in
>> same
>> index. And also Please suggest me any articles discussing this kind of
>> problem.
>>
>> Thanks :-)
>>
>> Related links:
>> http://www.gossamer-threads.com/lists/lucene/java-user/109530
>>
>> --
>> *​*​
>> Kumaran R
>>
>>
>


Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Kumaran Ramasubramanian
Hi Sachin Kulkarni,

If possible, Please share your code.


-
Kumaran R





On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni kulk...@hawk.iit.edu
wrote:

 Hi,

 I am using Lucene 4.6.0.

 I have been storing 5 fields for my documents in the index, namely body,
 title, docname, docdate and docid.

 But when I get the fields using IndexReader.getTermVectors(indexedDocID) I
 only get
 the docname and body fields and can retrieve the term vectors for those
 fields, but not others.

 I check to see if all the five fields are stored using
 IndexedFieldType.stored()
 and all return true. I also check to see that all the fields are indexed
 and they are, but
 still when I try to getTermVectors I only receive two fields back.

 Is there any other config setting that I am missing while indexing that is
 causing this behavior?

 Thanks to Kumaran and Ian for their answers to my previous questions but I
 have not been able to figure out the above one yet.

 Thank you very much.

 Regards,
 Sachin



Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Kumaran Ramasubramanian
Hi Sachin

i want to look into ur indexing code. please share it

-
Kumaran R





On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni kulk...@hawk.iit.edu
wrote:

 Hi,

 Sorry for all the code, It got sent out accidentally.

 The following code is part of the Benchmark utility in Lucene, specifically
 SubmissionReport.java


 // Here reader is the IndexReader.


   Iterator itr = docMap.entrySet().iterator();
  int totalNumDocuments = reader.numDocs();
 ScoreDoc sd[] = td.scoreDocs;
 String sep =  \t ;
 DocNameExtractor docext = new DocNameExtractor(docNameField);
  for (int i=0; isd.length; i++)
 {
String docName = docext.docName(searcher,sd[i].doc);
  // * The Map of documents will help us get the docid
 int indexedDocID = docMap.get(docName);
  Fields fields = reader.getTermVectors(indexedDocID);
  IteratorString strItr=fields.iterator();

 /// ** The following while is printing the fieldNames which only
 show 2 fields out of the 5 that I am looking for.
 while(strItr.hasNext())
 {
 String fieldName = strItr.next();
 System.out.println(next field  + fieldName);
 }
 Document DocList= reader.document(indexedDocID);
 ListIndexableField field_list = DocList.getFields();

 /// ** The following for loop prints the five fields and it's
 related information.
 for(int j=0; j  field_list.size(); j++)
 {
 System.out.println ( list field is :  + field_list.get(j).name() );
 IndexableFieldType IFT = field_list.get(j).fieldType();
 System.out.println( Field storeTermVectorOffsets :  +
 IFT.storeTermVectorOffsets());
 System.out.println( Field stored : + IFT.stored());
  }
 // * //
   }


  / THE OUTPUT for this section of code is
 fields size : 2
 next field body
 next field docname

 list field is : docid
  Field storeTermVectorOffsets : false
 list field is : docname
  Field storeTermVectorOffsets : false
 list field is : docdate
  Field storeTermVectorOffsets : false
 list field is : doctitle
  Field storeTermVectorOffsets : false
 list field is : body
  Field storeTermVectorOffsets : false

 ***/

 Hope this code comes out legible in the email.

 Thank you.

 Regards,
 Sachin Kulkarni


 On Tue, Aug 19, 2014 at 8:39 AM, Sachin Kulkarni kulk...@hawk.iit.edu
 wrote:

  Hi Kumaran,
 
 
 
  The following code is part of the Benchmark utility in Lucene,
  specifically SubmissionReport.java
 
 
  Iterator itr = docMap.entrySet().iterator();
   int totalNumDocuments = reader.numDocs();
  ScoreDoc sd[] = td.scoreDocs;
   String sep =  \t ;
  DocNameExtractor docext = new DocNameExtractor(docNameField);
   for (int i=0; isd.length; i++)
  {
  System.out.println(i =  + i);
String docName = docext.docName(searcher,sd[i].doc);
System.out.println(docName :  + docName + \t map size  +
  docMap.size());
   // * The Map will help us get the docid and
  int indexedDocID = docMap.get(docName);
   System.out.println(indexed doc id :  + indexedDocID + \t docname : 
  + docName);
   //  GET THE tf-idf data now  //
  Fields fields = reader.getTermVectors(indexedDocID);
   System.out.println(fields size :  + fields.size());
   //  Print log output for testing  //
   IteratorString strItr=fields.iterator();
  while(strItr.hasNext())
  {
   String fieldName = strItr.next();
  System.out.println(next field  + fieldName);
  }
   Document DocList= reader.document(indexedDocID);
  ListIndexableField field_list = DocList.getFields();
   for(int j=0; j  field_list.size(); j++)
  {
  System.out.println ( list field is :  + field_list.get(j).name() );
   IndexableFieldType IFT = field_list.get(j).fieldType();
  System.out.println( Field storeTermVectorOffsets :  +
  IFT.storeTermVectorOffsets());
   //System.out.println( Field stored : + IFT.stored());
  //for (FieldInfo.IndexOptions c : IFT.indexOptions().values())
   // System.out.println(c);
  }
  // *88 //
 
 
  On Tue, Aug 19, 2014 at 2:04 AM, Kumaran Ramasubramanian 
  kums@gmail.com wrote:
 
  Hi Sachin Kulkarni,
 
  If possible, Please share your code.
 
 
  -
  Kumaran R
 
 
 
 
 
  On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni kulk...@hawk.iit.edu
  wrote:
 
   Hi,
  
   I am using Lucene 4.6.0.
  
   I have been storing 5 fields for my documents in the index, namely
 body,
   title, docname, docdate and docid.
  
   But when I get the fields using
  IndexReader.getTermVectors(indexedDocID) I
   only get
   the docname and body fields and can retrieve the term vectors for
 those
   fields, but not others.
  
   I check to see if all the five fields are stored using
   IndexedFieldType.stored()
   and all return true. I also check to see that all the fields are
 indexed
   and they are, but
   still when I try to getTermVectors I only receive two fields back.
  
   Is there any other config setting that I am missing while indexing
 that
  is
   causing this behavior?
  
   Thanks to Kumaran and Ian

Re: Is housekeeping of Lucene indexes block index update but allow search ?

2014-08-05 Thread Kumaran Ramasubramanian
Hi Gaurav

  Thanks for the clarification. If possible, please share your NRT
manager API related code example. i believe, it will help me to understand
little better.


-
Kumaran R

On Tue, Aug 5, 2014 at 12:39 PM, Gaurav gupta gupta.gaurav0...@gmail.com
wrote:

 Thanks Kumaran and Erik for resolving my queries.

 Kumaran,
 You are right at only one indexwriter can write as it acquire the lock but
 using the NRT manager APis -
 TrackingIndexWriter
 
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/TrackingIndexWriter.html
 
 multiple
 concurrent updates/delete/append is possible.

 Thanks again !







 On Mon, Aug 4, 2014 at 10:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Right.
  1 Occasionally the merge will require 2x the disk space. (3x in compound
  file system). The merging is, indeed, done in the background, it is NOT a
  blocking operation.
 
  2 n/a. It shouldn't block at all.
 
  Here's a cool video by Mike McCandless on the merging process, plus some
  explanations:
 
 
 
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
 
  Best,
  Erick
 
 
 
 
  On Mon, Aug 4, 2014 at 8:45 AM, Kumaran R kums@gmail.com wrote:
 
   Hi Gaurav
  
   1.When you opened index to write,till you close that index, there will
   be a lock to do further write. But not for search. During merge, index
   needs 3X ( not sure 2X?) of more storage space, i believe that is the
   reason for no blocking for search. ( any other experts can clarify you
   more on this )
  
   2. Merge will be taken care by default values( merge factor 2) of
   lucene. If u need to control more on merge policy, please go through
   about merge by size or by number of segments or many merge policies.
  
  
   Hope this will help you a little bit.
  
   --
   Kumaran R
   Sent from Phone
  
On 04-Aug-2014, at 8:04 pm, Gaurav gupta gupta.gaurav0...@gmail.com
 
   wrote:
   
Hi,
   
We are planning to use Lucene 4.8.1 over Oracle (1 to 2 TB data) and
seeking information on  How Lucene conduct housekeeping or
 maintenance
   of
indexes over a period of time. *Is it a blocking operation for write
  and
search or it will not block anything while merging is going on? *
   
I found :- *Since Lucene adds the updated document to the index and
   marks
all previous versions as deleted. So to get rid of deleted documents
   Lucene
needs to do some housekeeping over a period of time. Under the hood
 is
   that
from time to time segments are merged into (usually) bigger segments
using configurable MergePolicy

  
 
 http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/index/MergePolicy
   
(TieredMergePolicy).
*
   
1- Is it's a blocking operation for write and search both or it will
  not
block anything while merging is going on?
   
2- What is the best practice to avoid any blocking in production
  servers?
Not sure how Solr or Elasticsearch is handling it.
Should we control the merging by calling *forcemerge(int) at low
  traffic
time *to avoid any unpredictable blocking operation? Is it
 recommended
  or
Lucene do intelligent merging and don't block anything (updates and
searches) or there are ways to reduce the blocking time to a very
 small
duration (1 -2 minutes) using some API or demon thread etc.
   
Looking for your professional guidance on it.
   
Regards
Gaurav
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 



Is it right to index multi-valued fields with hash value?

2014-07-30 Thread Kumaran Ramasubramanian
Hi All

   During search, i find it difficult to handle every multi-valued
field with different analyzers. So i believe indexing multi-valued field
with hash value may solve the problem of searching with different
analyzers. Any of you tried hash value in lucene index? if so, please share
limitations and drawbacks in using hash values in lucene index.


-
Kumaran R