Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-29 Thread Chitra R
Thank you so much, mike... Hope, gained a lot of stuff on Doc
Values faceting and also clarified all my doubts. Thanks..!!


*Another use case:*

After getting matching documents for the given query, Is there any way to
calculate mix and max values on NumericDocValuesField ( say date field)?


I would like to implement it in numeric range faceting by splitting the
numeric values (getting from resulted documents) into ranges.


Chitra


On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Doc values fields are never loaded into memory; at most some small
> index structures are.
>
> When you use those fields, the bytes (for just the one doc values
> field you are using) are pulled from disk, and the OS will cache them
> in memory if available.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Nov 28, 2016 at 6:01 AM, Chitra R  wrote:
> > Hi,
> >  When opening SortedSetDocValuesReaderState at search time,
> whether
> > the whole doc value files (.dvd & .dvm) information are loaded in memory
> or
> > specified field information(say $facets field) alone load in memory?
> >
> >
> >
> >
> > Any help is much appreciated.
> >
> >
> > Regards,
> > Chitra
> >
> > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R  wrote:
> >>
> >>
> >> Kindly post your suggestions.
> >>
> >> Regards,
> >> Chitra
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R 
> wrote:
> >>>
> >>> Hey, I got it clearly. Thank you so much. Could you please help us to
> >>> implement it in our use case?
> >>>
> >>>
> >>> In our case, we are having dynamic index and it is variable depth too.
> So
> >>> flat facet is enough.No need of hierarchical facets.
> >>>
> >>> What I think is,
> >>>
> >>> Index my facet field as normal doc value field, so that no special
> >>> operation (like taxonomy and sorted set doc values facet field) will
> be done
> >>> at index time and only doc value field stores its ordinals in their
> >>> respective field.
> >>> At search time, I will pass query (user search query) , filter (path
> >>> traversed list)  and collect the matching documents in Facetscollector.
> >>> To compute facet count for the specific field, I will gather those
> >>> resulted docs, then move through each segment for collecting the
> matching
> >>> ordinals using AtomicReader.
> >>>
> >>>
> >>> And know when I use this means, can't calculate facet count for more
> than
> >>> one field(facet) in a search.
> >>>
> >>> Instead of loading all the dimensions in DocValuesReaderState (will
> take
> >>> more time and memory) at search time, loading specific fields will
> take less
> >>> time and memory, hope so. Kindly help to solve.
> >>>
> >>>
> >>> It will do it in a minimal index and search cost, I think. And hope
> this
> >>> won't put overload at index time, also at search time this will be
> better.
> >>>
> >>>
> >>> Kindly post your suggestions.
> >>>
> >>>
> >>> Regards,
> >>> Chitra
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless
> >>>  wrote:
> 
>  I think you've summed up exactly the differences!
> 
>  And, yes, it would be possible to emulate hierarchical facets on top
>  of flat facets, if the hierarchy is fixed depth like year/month/day.
> 
>  But if it's variable depth, it's trickier (but I think still
>  possible).  See e.g. the Committed Paths drill-down on the left, on
>  our dog-food server
>  http://jirasearch.mikemccandless.com/search.py?index=jira
> 
>  Mike McCandless
> 
>  http://blog.mikemccandless.com
> 
> 
>  On Fri, Nov 18, 2016 at 1:43 AM, Chitra R 
> wrote:
>  > case 1:
>  > In taxonomy, for each indexed document, examines facet
> label ,
>  > computes their ordinals and mappings, and which will be stored in
>  > sidecar
>  > index at index time.
>  >
>  > case 2:
>  > In doc values, these(ordinals) are computed at search time,
> so
>  > there
>  > will be a time and memory trade-off between both cases, hope so.
>  >
>  >
>  > In taxonomy, building hierarchical facets at index time makes
> faceting
>  > cost
>  > minimal at search time than flat facets in doc values.
>  >
>  > Except (memory,time and NRT latency) , Is any another contrast
> between
>  > hierarchical and flat facets at search time?
>  >
>  >
>  > Kindly post your suggestions...
>  >
>  >
>  > Regards,
>  > Chitra
>  >
>  > On Thu, Nov 17, 2016 at 6:40 PM, Chitra R 
>  > wrote:
>  >>
>  >> Okay. I agree with you, Taxonomy maintains and supports
> hierarchical
>  >> facets during indexing. 

Re: how do lucene read large index files?

2016-11-29 Thread Kumaran Ramasubramanian
Thanks Mike. We are planning to move  MMapDirectory in both indexing and
searching.Regarding ulimit change and read during merging, i just tried
to know the impact of mmapdir during indexing.

-
Kumaran R


On Nov 30, 2016 4:18 AM, "Michael McCandless" 
wrote:
>
> It's OK to use NIOFSDirectory for indexing only in that nothing will
break.
>
> But, MMapDirectory already uses normal IO for writing
> (java.io.FileOutputStream), and indexing does sometimes need to to
> read (for merging segments) though that's largely sequential reading
> so perhaps NIOFSDirectory won't be much slower.
>
> Why not use MMapDirectory for both indexing and searching?
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Nov 28, 2016 at 7:20 AM, Kumaran Ramasubramanian
>  wrote:
> > Thanks a lot Uwe!!! Do we get any benefit on using MMapDirectory over
> > NIOFSDir during indexing? During merging? Is it ok to change to
> > MMapDirectory during search alone?
> >
> > --
> > Kumaran R
> >
> >
> > On Nov 24, 2016 11:27 PM, "Erick Erickson" 
wrote:
> >>
> >> Thanks Uwe!
> >>
> >>
> >>
> >>
> >> On Thu, Nov 24, 2016 at 9:41 AM, Uwe Schindler  wrote:
> >> > Hi Kumaran, hi Erick,
> >> >
> >> >> Not really, as I don't know that code well, Uwe and company
> >> >> are the masters of that realm ;)
> >> >>
> >> >> Sorry I can't be more help there
> >> >
> >> > I can help!
> >> >
> >> >> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
> >> >>  wrote:
> >> >> > Erick, Thanks a lot for sharing an excellent post...
> >> >> >
> >> >> > Btw, am using NIOFSDirectory, could you please elaborate on below
> >> >> mentioned
> >> >> > lines? or any further pointers?
> >> >> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price:
> > Our
> >> >> code
> >> >> >> has to do a lot of syscalls to the O/S kernel to copy blocks of
data
> >> >> >> between the disk or filesystem cache and our buffers residing in
> > Java
> >> >> heap.
> >> >> >> This needs to be done on every search request, over and over
again.
> >> >
> >> > the blog post just says it simple: You should use MMapDirectory and
> > avoid SimpleFSDir or MMapDirectory! The blog post explains why:
SimpleFSDir
> > and NIOFSDir extend BufferedIndexInput. This class uses an on-heap
buffer
> > for reading index files (which is 16 KB). For some parts of the index
(like
> > doc values), this is not ideal. E.g. if you sort against a doc values
field
> > and it needs to access a sort value (e.g. a short, integer or byte,
which
> > is very small), it will ask the buffer for the like 4 bytes. In most
cases
> > when sorting the buffer will not contain those byte, as sorting requires
> > random access over a huge file (so it is unlikely that the buffer will
> > help). Then BufferedIndexInput will seek the NIO/Simple file pointer and
> > read 16 KiB into the buffer. This requires a syscall to the OS kernel,
> > which is expensive. During sorting search results this can be millions
or
> > billions of times. In addition it will copy chunks of memory between
Java
> > heap and operating system cache over and over.
> >> >
> >> > With MMapDirectory no buffering is done, the Lucene code directly
> > accesses the file system cache and this is much more optimized.
> >> >
> >> > So for fast index access:
> >> > - avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32
bit
> > operating systems and JVMs)
> >> > - configure your operating system kernel as described in the blog
post
> > and use MMapDirectory
> >> > - tell the sysadmin to inform himself about the output of linux
> > commands free/top/... (or Windows complements).
> >> >
> >> > Uwe
> >> >
> >> >> > --
> >> >> > Kumaran R
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
> >> >> 
> >> >> > wrote:
> >> >> >
> >> >> >> see Uwe's blog:
> >> >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
> >> >> 64bit.html
> >> >> >>
> >> >> >> Short form: files are read into the OS's memory as needed. the
whole
> >> >> >> file isn't read at once.
> >> >> >>
> >> >> >> Best,
> >> >> >> Erick
> >> >> >>
> >> >> >> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
> >> >> >>  wrote:
> >> >> >> > Hi All,
> >> >> >> >
> >> >> >> > how do lucene read large index files?
> >> >> >> > for example, if one file (for eg: .dat file) is 4GB.
> >> >> >> > lucene read only part of file to RAM? or
> >> >> >> > is it different approach for different lucene file formats?
> >> >> >> >
> >> >> >> >
> >> >> >> > Related Link:
> >> >> >> > How do applications (and OS) handle very big files?
> >> >> >> > http://superuser.com/a/361201
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Kumaran R
> >> >> >>
> >> >> >>
> > -
> >> >> >> To unsubscribe, e-mail: 

Re: Query expansion

2016-11-29 Thread Michael McCandless
This is likely tricky to do correctly.

E.g., MultiFieldQueryParser.getFieldQuery is invoked on whole chunks
of text.  If you search for:

  apple orange

I suspect it won't do what you want, since the whole string "apple
orange" is passed to getFieldQuery.

How do you want to handle e.g. a phrase query (user types "apple
orange", with the double quotes)?  Or a prefix query (app*)?

Maybe you could instead override newTermQuery?  In the example above
it would be invoked twice, once for apple and once for orange.

Finally, all this being said, making everything fuzzy is likely a big
performance hit and often poor results (massive recall, poor
precision) to the user!

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 28, 2016 at 6:24 AM, hariram ravichandran
 wrote:
> I need to perform *fuzzy search* for the whole search term. I
> extended MultiFieldQueryParser and overridden getFieldQuery()
>
>
> protected Query getFieldQuery(String field, String fieldText,boolean
> quoted) throws ParseException{
>return *super.getFuzzyQuery(field,fieldText,3.0f);
> //constructing fuzzy query*
> }
>
> For example, If i give search term as "(apple AND orange) OR (mango)", the
> query should be expanded as "(apple~ AND orange~) OR (mango~)".
>
> I need to search in multiple fields and also i need to implement this
> without affecting any of the lucene features. Is there any other simple way?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how do lucene read large index files?

2016-11-29 Thread Michael McCandless
It's OK to use NIOFSDirectory for indexing only in that nothing will break.

But, MMapDirectory already uses normal IO for writing
(java.io.FileOutputStream), and indexing does sometimes need to to
read (for merging segments) though that's largely sequential reading
so perhaps NIOFSDirectory won't be much slower.

Why not use MMapDirectory for both indexing and searching?
Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 28, 2016 at 7:20 AM, Kumaran Ramasubramanian
 wrote:
> Thanks a lot Uwe!!! Do we get any benefit on using MMapDirectory over
> NIOFSDir during indexing? During merging? Is it ok to change to
> MMapDirectory during search alone?
>
> --
> Kumaran R
>
>
> On Nov 24, 2016 11:27 PM, "Erick Erickson"  wrote:
>>
>> Thanks Uwe!
>>
>>
>>
>>
>> On Thu, Nov 24, 2016 at 9:41 AM, Uwe Schindler  wrote:
>> > Hi Kumaran, hi Erick,
>> >
>> >> Not really, as I don't know that code well, Uwe and company
>> >> are the masters of that realm ;)
>> >>
>> >> Sorry I can't be more help there
>> >
>> > I can help!
>> >
>> >> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
>> >>  wrote:
>> >> > Erick, Thanks a lot for sharing an excellent post...
>> >> >
>> >> > Btw, am using NIOFSDirectory, could you please elaborate on below
>> >> mentioned
>> >> > lines? or any further pointers?
>> >> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price:
> Our
>> >> code
>> >> >> has to do a lot of syscalls to the O/S kernel to copy blocks of data
>> >> >> between the disk or filesystem cache and our buffers residing in
> Java
>> >> heap.
>> >> >> This needs to be done on every search request, over and over again.
>> >
>> > the blog post just says it simple: You should use MMapDirectory and
> avoid SimpleFSDir or MMapDirectory! The blog post explains why: SimpleFSDir
> and NIOFSDir extend BufferedIndexInput. This class uses an on-heap buffer
> for reading index files (which is 16 KB). For some parts of the index (like
> doc values), this is not ideal. E.g. if you sort against a doc values field
> and it needs to access a sort value (e.g. a short, integer or byte, which
> is very small), it will ask the buffer for the like 4 bytes. In most cases
> when sorting the buffer will not contain those byte, as sorting requires
> random access over a huge file (so it is unlikely that the buffer will
> help). Then BufferedIndexInput will seek the NIO/Simple file pointer and
> read 16 KiB into the buffer. This requires a syscall to the OS kernel,
> which is expensive. During sorting search results this can be millions or
> billions of times. In addition it will copy chunks of memory between Java
> heap and operating system cache over and over.
>> >
>> > With MMapDirectory no buffering is done, the Lucene code directly
> accesses the file system cache and this is much more optimized.
>> >
>> > So for fast index access:
>> > - avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32 bit
> operating systems and JVMs)
>> > - configure your operating system kernel as described in the blog post
> and use MMapDirectory
>> > - tell the sysadmin to inform himself about the output of linux
> commands free/top/... (or Windows complements).
>> >
>> > Uwe
>> >
>> >> > --
>> >> > Kumaran R
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
>> >> 
>> >> > wrote:
>> >> >
>> >> >> see Uwe's blog:
>> >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
>> >> 64bit.html
>> >> >>
>> >> >> Short form: files are read into the OS's memory as needed. the whole
>> >> >> file isn't read at once.
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
>> >> >>  wrote:
>> >> >> > Hi All,
>> >> >> >
>> >> >> > how do lucene read large index files?
>> >> >> > for example, if one file (for eg: .dat file) is 4GB.
>> >> >> > lucene read only part of file to RAM? or
>> >> >> > is it different approach for different lucene file formats?
>> >> >> >
>> >> >> >
>> >> >> > Related Link:
>> >> >> > How do applications (and OS) handle very big files?
>> >> >> > http://superuser.com/a/361201
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Kumaran R
>> >> >>
>> >> >>
> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, 

Re: Understanding Query Parser Behavior

2016-11-29 Thread Michael McCandless
Can you try escaping the / character to the query parser?  E.g. pass
this string instead:

String value = "http\\:\\/\\/www.google.com";

Mike McCandless

http://blog.mikemccandless.com


On Tue, Nov 29, 2016 at 11:38 AM, Peru Redmi  wrote:
> Hello ,
>
> It would be great , if someone could help on this.
> *Note : I am using Lucene 4.10.4 version*
>
> On Mon, Nov 28, 2016 at 5:37 PM, Peru Redmi  wrote:
>
>> Any help on this would be greatly appreciated.
>>
>> Thanks.
>>
>> On Thu, Nov 24, 2016 at 8:14 PM, Peru Redmi  wrote:
>>
>>>
>>> Hello Mike,
>>>
>>> Here is, how i analyze my text using QueryParser ( with ClassicAnalyzer)
>>> and plain ClassicAnalyzer. On checking the same in luke, i get "//"
>>> as RegexQuery.
>>>
>>> Here is my code snippet:
>>>
>>> String value = "http\\://www.google.com";
 Analyzer anal = new ClassicAnalyzer(Version.LUCENE_30, new
 StringReader(""));
 QueryParser parser = new QueryParser(Version.LUCENE_30, "name",
 anal);
 Query query = parser.parse(value);
 System.out.println(" output terms from query parser ::" + query);
>>>
>>>
>>>

 ArrayList list = new ArrayList();
 TokenStream stream = anal.tokenStream("name", new
 StringReader(value));
 stream.reset();
 while (stream.incrementToken())
 {
 list.add(stream.getAttribute(CharTermAttribute.class).toStri
 ng());
 }
 System.out.println(" output terms from analyzer " + list);
>>>
>>>
>>>
>>> output:
>>>
>>> output terms from query parser ::name:http name:// name:www.google.com
>>> output terms from analyzer [http, www.google.com]
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Nov 24, 2016 at 5:10 PM, Michael McCandless <
>>> luc...@mikemccandless.com> wrote:
>>>
 Hi,

 You should double check which analyzer you are using during indexing.

 The same analyzer on the same string should produce the same tokens.

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Nov 23, 2016 at 9:38 PM, Peru Redmi 
 wrote:
 > Could someone elaborate this.
 >
 > On Tue, Nov 22, 2016 at 11:41 AM, Peru Redmi 
 wrote:
 >
 >> Hello,
 >> Can you help me out on your "No" .
 >>
 >> On Mon, Nov 21, 2016 at 11:16 PM, wmartin...@gmail.com <
 >> wmartin...@gmail.com> wrote:
 >>
 >>> No
 >>>
 >>> Sent from my LG G4, an AT 4G LTE smartphone
 >>>
 >>> -- Original message--
 >>> *From: *Peru Redmi
 >>> *Date: *Mon, Nov 21, 2016 10:44 AM
 >>> *To: *java-user@lucene.apache.org;
 >>> *Cc: *
 >>> *Subject:*Understanding Query Parser Behavior
 >>>
 >>> Hello All ,Could someone explain *QueryParser* behavior on these
 cases1. While Indexing ,Document doc = new Document();doc.add(new
 Field("*Field*", "*http://www.google.com*;, Field.Store.YES,
 Field.Index.ANALYZED));  index has *two* terms - *http* & *
 www.google.com**2.* While searching ,Analyzer anal = new
 *ClassicAnalyzer*(Version.LUCENE_30, newStringReader(""));QueryParser
 parser=new *MultiFieldQueryParser*(Version.LUCENE_30,
 newString[]{"*Field*"},anal);Query query = parser.parse("*
 http://www.google.com *");Now , query has *three *terms  -
 (Field:http) *(Field://)* (Field:www.google.com)i) Why I have got 3
 terms while parsing , and 2 terms on indexing (Usingsame ClassicAnalyzer in
 both cases ) ?ii) is this expected behavior of
 ClassicAnalyzer(Version.LUCENE_30) onParser ?iii) what should be done
 to avoid query part *(Field://) *?Thanks,Peru.
 >>>
 >>>
 >>

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-29 Thread Michael McCandless
Doc values fields are never loaded into memory; at most some small
index structures are.

When you use those fields, the bytes (for just the one doc values
field you are using) are pulled from disk, and the OS will cache them
in memory if available.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 28, 2016 at 6:01 AM, Chitra R  wrote:
> Hi,
>  When opening SortedSetDocValuesReaderState at search time, whether
> the whole doc value files (.dvd & .dvm) information are loaded in memory or
> specified field information(say $facets field) alone load in memory?
>
>
>
>
> Any help is much appreciated.
>
>
> Regards,
> Chitra
>
> On Tue, Nov 22, 2016 at 5:47 PM, Chitra R  wrote:
>>
>>
>> Kindly post your suggestions.
>>
>> Regards,
>> Chitra
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R  wrote:
>>>
>>> Hey, I got it clearly. Thank you so much. Could you please help us to
>>> implement it in our use case?
>>>
>>>
>>> In our case, we are having dynamic index and it is variable depth too. So
>>> flat facet is enough.No need of hierarchical facets.
>>>
>>> What I think is,
>>>
>>> Index my facet field as normal doc value field, so that no special
>>> operation (like taxonomy and sorted set doc values facet field) will be done
>>> at index time and only doc value field stores its ordinals in their
>>> respective field.
>>> At search time, I will pass query (user search query) , filter (path
>>> traversed list)  and collect the matching documents in Facetscollector.
>>> To compute facet count for the specific field, I will gather those
>>> resulted docs, then move through each segment for collecting the matching
>>> ordinals using AtomicReader.
>>>
>>>
>>> And know when I use this means, can't calculate facet count for more than
>>> one field(facet) in a search.
>>>
>>> Instead of loading all the dimensions in DocValuesReaderState (will take
>>> more time and memory) at search time, loading specific fields will take less
>>> time and memory, hope so. Kindly help to solve.
>>>
>>>
>>> It will do it in a minimal index and search cost, I think. And hope this
>>> won't put overload at index time, also at search time this will be better.
>>>
>>>
>>> Kindly post your suggestions.
>>>
>>>
>>> Regards,
>>> Chitra
>>>
>>>
>>>
>>>
>>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless
>>>  wrote:

 I think you've summed up exactly the differences!

 And, yes, it would be possible to emulate hierarchical facets on top
 of flat facets, if the hierarchy is fixed depth like year/month/day.

 But if it's variable depth, it's trickier (but I think still
 possible).  See e.g. the Committed Paths drill-down on the left, on
 our dog-food server
 http://jirasearch.mikemccandless.com/search.py?index=jira

 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Nov 18, 2016 at 1:43 AM, Chitra R  wrote:
 > case 1:
 > In taxonomy, for each indexed document, examines facet label ,
 > computes their ordinals and mappings, and which will be stored in
 > sidecar
 > index at index time.
 >
 > case 2:
 > In doc values, these(ordinals) are computed at search time, so
 > there
 > will be a time and memory trade-off between both cases, hope so.
 >
 >
 > In taxonomy, building hierarchical facets at index time makes faceting
 > cost
 > minimal at search time than flat facets in doc values.
 >
 > Except (memory,time and NRT latency) , Is any another contrast between
 > hierarchical and flat facets at search time?
 >
 >
 > Kindly post your suggestions...
 >
 >
 > Regards,
 > Chitra
 >
 > On Thu, Nov 17, 2016 at 6:40 PM, Chitra R 
 > wrote:
 >>
 >> Okay. I agree with you, Taxonomy maintains and supports hierarchical
 >> facets during indexing. Hope hierarchical in the sense, we might
 >> index the
 >> field Publish date : 2010/10/15 as Publish date: 2010 , Publish date:
 >> 2010/10 and Publish date: 2010/10/15 , their facet ordinals are
 >> maintained
 >> in sidecar index and it is mapped to the main index.
 >>
 >> For example:
 >>
 >> In search-lucene.com , I enter a term (say facet),
 >> top
 >> documents and their categories are displayed after performing the
 >> search.
 >> Say I drill down through Publish date/2010 to collect its child
 >> counts and
 >> after I will pass through publishdate/2010/10 to collect their child
 >> counts.
 >> And for each drill down, each search will be performed to collect its
 >> top
 >> docs and categories.
 >>
 >>
 >>Even I can achieve this in 

Re: Understanding Query Parser Behavior

2016-11-29 Thread Peru Redmi
Hello ,

It would be great , if someone could help on this.
*Note : I am using Lucene 4.10.4 version*

On Mon, Nov 28, 2016 at 5:37 PM, Peru Redmi  wrote:

> Any help on this would be greatly appreciated.
>
> Thanks.
>
> On Thu, Nov 24, 2016 at 8:14 PM, Peru Redmi  wrote:
>
>>
>> Hello Mike,
>>
>> Here is, how i analyze my text using QueryParser ( with ClassicAnalyzer)
>> and plain ClassicAnalyzer. On checking the same in luke, i get "//"
>> as RegexQuery.
>>
>> Here is my code snippet:
>>
>> String value = "http\\://www.google.com";
>>> Analyzer anal = new ClassicAnalyzer(Version.LUCENE_30, new
>>> StringReader(""));
>>> QueryParser parser = new QueryParser(Version.LUCENE_30, "name",
>>> anal);
>>> Query query = parser.parse(value);
>>> System.out.println(" output terms from query parser ::" + query);
>>
>>
>>
>>>
>>> ArrayList list = new ArrayList();
>>> TokenStream stream = anal.tokenStream("name", new
>>> StringReader(value));
>>> stream.reset();
>>> while (stream.incrementToken())
>>> {
>>> list.add(stream.getAttribute(CharTermAttribute.class).toStri
>>> ng());
>>> }
>>> System.out.println(" output terms from analyzer " + list);
>>
>>
>>
>> output:
>>
>> output terms from query parser ::name:http name:// name:www.google.com
>> output terms from analyzer [http, www.google.com]
>>
>>
>>
>>
>>
>>
>> On Thu, Nov 24, 2016 at 5:10 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Hi,
>>>
>>> You should double check which analyzer you are using during indexing.
>>>
>>> The same analyzer on the same string should produce the same tokens.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Wed, Nov 23, 2016 at 9:38 PM, Peru Redmi 
>>> wrote:
>>> > Could someone elaborate this.
>>> >
>>> > On Tue, Nov 22, 2016 at 11:41 AM, Peru Redmi 
>>> wrote:
>>> >
>>> >> Hello,
>>> >> Can you help me out on your "No" .
>>> >>
>>> >> On Mon, Nov 21, 2016 at 11:16 PM, wmartin...@gmail.com <
>>> >> wmartin...@gmail.com> wrote:
>>> >>
>>> >>> No
>>> >>>
>>> >>> Sent from my LG G4, an AT 4G LTE smartphone
>>> >>>
>>> >>> -- Original message--
>>> >>> *From: *Peru Redmi
>>> >>> *Date: *Mon, Nov 21, 2016 10:44 AM
>>> >>> *To: *java-user@lucene.apache.org;
>>> >>> *Cc: *
>>> >>> *Subject:*Understanding Query Parser Behavior
>>> >>>
>>> >>> Hello All ,Could someone explain *QueryParser* behavior on these
>>> cases1. While Indexing ,Document doc = new Document();doc.add(new
>>> Field("*Field*", "*http://www.google.com*;, Field.Store.YES,
>>> Field.Index.ANALYZED));  index has *two* terms - *http* & *
>>> www.google.com**2.* While searching ,Analyzer anal = new
>>> *ClassicAnalyzer*(Version.LUCENE_30, newStringReader(""));QueryParser
>>> parser=new *MultiFieldQueryParser*(Version.LUCENE_30,
>>> newString[]{"*Field*"},anal);Query query = parser.parse("*
>>> http://www.google.com *");Now , query has *three *terms  -
>>> (Field:http) *(Field://)* (Field:www.google.com)i) Why I have got 3
>>> terms while parsing , and 2 terms on indexing (Usingsame ClassicAnalyzer in
>>> both cases ) ?ii) is this expected behavior of
>>> ClassicAnalyzer(Version.LUCENE_30) onParser ?iii) what should be done
>>> to avoid query part *(Field://) *?Thanks,Peru.
>>> >>>
>>> >>>
>>> >>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>


Re: BlockJoin with RAM Directory

2016-11-29 Thread Mikhail Khludnev
Use the specially proposed
https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/IndexWriter.html#addDocuments-java.lang.Iterable-
which prevents flush from the cutting block in the middle.

On Tue, Nov 29, 2016 at 12:27 PM, ASKozitsin  wrote:

> Hi everyone!
>
> I'm trying to fill RAMDirectory with documents according to BlockJoin
> structure:
>* child1_1
>* child1_2
>* child1_3
>- parent1
>* child2_1
>* child2_2
>- parent2
> and so on.
>
> If I have small number of documents (less than 10.000) everything is okay.
> I
> can search among them.
> Also CheckJoinIndex.check(directoryReader, new
> QueryBitSetProducer(IntPoint.newExactQuery(IS_PARENT_DOCUMENT, PARENT)));
> works fine.
>
> But if document amount is greater than 10k (for example 50k), I receive
> error message "Every segment should have at least one parent, but
> _0(6.3.0):c1 does not have any".
>
> I suppose, that my documents are splitted in several RAMFiles. And this
> split divides child-parent block in two files.
>
> My own check on document collection does not show errors.
> Should I avoid RAMDirectory? OR is there any option to control RAMFile
> split?
>
> Thanks in advance!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/BlockJoin-with-RAM-Directory-tp4307818.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev


BlockJoin with RAM Directory

2016-11-29 Thread ASKozitsin
Hi everyone!

I'm trying to fill RAMDirectory with documents according to BlockJoin
structure:
   * child1_1
   * child1_2
   * child1_3
   - parent1
   * child2_1
   * child2_2
   - parent2
and so on.

If I have small number of documents (less than 10.000) everything is okay. I
can search among them.
Also CheckJoinIndex.check(directoryReader, new
QueryBitSetProducer(IntPoint.newExactQuery(IS_PARENT_DOCUMENT, PARENT)));
works fine.

But if document amount is greater than 10k (for example 50k), I receive
error message "Every segment should have at least one parent, but
_0(6.3.0):c1 does not have any".

I suppose, that my documents are splitted in several RAMFiles. And this
split divides child-parent block in two files.

My own check on document collection does not show errors.
Should I avoid RAMDirectory? OR is there any option to control RAMFile
split?

Thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/BlockJoin-with-RAM-Directory-tp4307818.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org