Re: [lucy-user] Lucy Benchmarking

2017-02-14 Thread Nick Wellnhofer

On 14/02/2017 00:57, Kasi Lakshman Karthi Anbumony wrote:

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)


Lexicon is essentially a sorted on-disk array that is searched with binary 
search. Clownfish::Hash, on the other hand, is an in-memory data structure. 
Lucy doesn't build in-memory structures for most index data because this would 
incur a huge startup penalty. This also makes it possible to work with indices 
that don't fit in RAM, although performance deteriorates quickly in this case.



(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)


Posting lists are stored in an on-disk array. The indices are found in Lexicon.


(3) Which compression method is used? Is it enabled by default?


Lexicon and posting list data is always compressed with delta encoding for 
numbers and incremental encoding for strings.



(4) Why there is no API (function call) to know the number of terms in
lexicon and posting list for a given cf.dat?


It's generally hard to tell why a certain feature wasn't implemented. The only 
answer I can give is that no one deemed it important enough so far. But Lucy 
is open-source software. So, basically, anyone can implement any features they 
want.



(3) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?


Lucy uses memory-mapped files to access most index data so the distinction 
between in-memory and IO-based operation blurs quite a bit.


Nick



Re: [lucy-user] Lucy Benchmarking

2017-02-13 Thread Kasi Lakshman Karthi Anbumony
Hi Murphy:

Thanks for your detailed explanation.

Given the significance of inverted index compression, can I know the
following for better understanding of inner workings:

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(3) Which compression method is used? Is it enabled by default?

(4) Why there is no API (function call) to know the number of terms in
lexicon and posting list for a given cf.dat?

(3) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?

Thanks
-Kasi


On Sat, Feb 11, 2017 at 1:30 PM, Marvin Humphrey 
wrote:

> On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
>  wrote:
>
> > As a follow on question, based on this link:
> > https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
> >
> > (1) Why the cf.dat has a document section?
>
> The search needs to give something back to you to identify which
> documents were hits. Lucy's internal document IDs change over time, so
> are not suitable for that purpose.  You need to at least store your
> own identifier, even if you choose not to store other parts of the
> document.
>
> > (2) Why is it not compressed?
>
> It's not done by default, but there are extension points allowing that
> behavior to be overridden. There's even example code which ships with
> Lucy which does exactly what you suggest.  It's in Perl, but could be
> ported to C.
>
>  $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
>  $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm
>
> > I see most of the content of the books I have indexed being part of
> cf.dat
> > file and can read the text as it is! Is this how the inverted indexing
> > works?
>
> The document storage part of a Lucy datastore is separate from the
> inverted index.  The inverted index data structures are definitely
> compressed, using algorithms tuned to the task of search. The first
> part of the search yields a set of internal Lucy document IDs, which
> are then used to look up whatever's in document storage.
>
> From a performance perspective, the cost to perform the inverted index
> search is roughly proportional to the size of the corpus, whereas the
> cost to retrieve the document content afterwards is proportional to
> the number of documents retrieved.  When scaling to larger
> collections, compressing the inverted index is more important than
> compressing document storage, since the number of documents searched
> grows while the number of documents retrieved often stays the same.
>
> Of course it may still be reasonable to compress document storage,
> depending on usage pattern. But if for example you're only storing
> short identifiers, there's no need.
>
> Marvin Humphrey
>


Re: [lucy-user] Lucy Benchmarking

2017-02-11 Thread Marvin Humphrey
On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
 wrote:

> As a follow on question, based on this link:
> https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
>
> (1) Why the cf.dat has a document section?

The search needs to give something back to you to identify which
documents were hits. Lucy's internal document IDs change over time, so
are not suitable for that purpose.  You need to at least store your
own identifier, even if you choose not to store other parts of the
document.

> (2) Why is it not compressed?

It's not done by default, but there are extension points allowing that
behavior to be overridden. There's even example code which ships with
Lucy which does exactly what you suggest.  It's in Perl, but could be
ported to C.

 $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
 $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm

> I see most of the content of the books I have indexed being part of cf.dat
> file and can read the text as it is! Is this how the inverted indexing
> works?

The document storage part of a Lucy datastore is separate from the
inverted index.  The inverted index data structures are definitely
compressed, using algorithms tuned to the task of search. The first
part of the search yields a set of internal Lucy document IDs, which
are then used to look up whatever's in document storage.

>From a performance perspective, the cost to perform the inverted index
search is roughly proportional to the size of the corpus, whereas the
cost to retrieve the document content afterwards is proportional to
the number of documents retrieved.  When scaling to larger
collections, compressing the inverted index is more important than
compressing document storage, since the number of documents searched
grows while the number of documents retrieved often stays the same.

Of course it may still be reasonable to compress document storage,
depending on usage pattern. But if for example you're only storing
short identifiers, there's no need.

Marvin Humphrey


Re: [lucy-user] Lucy Benchmarking

2017-02-09 Thread Peter Karman

Kasi Lakshman Karthi Anbumony wrote on 2/9/17 5:51 PM:

Thanks for the explanation.

As a follow on question, based on this link:
https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html

(1) Why the cf.dat has a document section?

(2) Why is it not compressed?

I see most of the content of the books I have indexed being part of cf.dat
file and can read the text as it is! Is this how the inverted indexing
works?


Do you have the "stored" flag or "highlightable" flag set to true for your 
Plan::FullTextType schema definitions?


IIRC that's why doc text is stored, which seems to be confirmed in that URL you 
reference.


As far as why it is not compressed, I'm not sure. I expect that decompression 
incurs a performance hit.



--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman


Re: [lucy-user] Lucy Benchmarking

2017-02-09 Thread Kasi Lakshman Karthi Anbumony
Thanks for the explanation.

As a follow on question, based on this link:
https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html

(1) Why the cf.dat has a document section?

(2) Why is it not compressed?

I see most of the content of the books I have indexed being part of cf.dat
file and can read the text as it is! Is this how the inverted indexing
works?

Thanks
-Kasi

On Thu, Feb 9, 2017 at 1:21 PM, Peter Karman  wrote:

> >
> >
> >- vary relationship of terms (e.g., proximity)
> >>   - How to do it? Is there an operator like NEAR?
> >>
> >
> > There's ProximityQuery but I'm not sure how it works:
> >
> > http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html
> >
> >
>
> ​You can see one example of ProximityQuery usage here (Perl)
>
> https://metacpan.org/source/KARMAN/Search-Query-Dialect-
> Lucy-0.202/lib/Search/Query/Dialect/Lucy.pm#L701
>
> Of note:
>
> * `within` is like NEAR - it takes an integer argument
> * order of terms is respected. It's like a phrase​
>
>
>
> --
> Peter Karman . https://peknet.com/  .
> https://keybase.io/peterkarman
>


Re: [lucy-user] Lucy Benchmarking

2017-02-09 Thread Peter Karman
>
>
>- vary relationship of terms (e.g., proximity)
>>   - How to do it? Is there an operator like NEAR?
>>
>
> There's ProximityQuery but I'm not sure how it works:
>
> http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html
>
>

​You can see one example of ProximityQuery usage here (Perl)

https://metacpan.org/source/KARMAN/Search-Query-Dialect-Lucy-0.202/lib/Search/Query/Dialect/Lucy.pm#L701

Of note:

* `within` is like NEAR - it takes an integer argument
* order of terms is respected. It's like a phrase​



-- 
Peter Karman . https://peknet.com/  .
https://keybase.io/peterkarman


Re: [lucy-user] Lucy Benchmarking

2017-02-09 Thread Nick Wellnhofer

On 09/02/2017 01:46, Kasi Lakshman Karthi Anbumony wrote:

(1) Plan is to report the below metrics:

   - Index creation: tokens/second
  -  Can I know how to obtain the tokens in the lucy_index created? Do
  you think a better metric will be  (Number of terms in the posting
  list/second)? If so, how to obtain the number of terms in the
posting list?


AFAIK, the total number of terms in all input documents isn't available 
because the term frequencies aren't stored separately. I'd simply use the 
total size of the input documents in bytes.



(2) What are the different query types possible?

   - vary document weighting
  - Is it possible or is it fixed for a given lucy_index generated?


You can apply a boost to queries at query time:

http://lucy.apache.org/docs/c/Lucy/Search/Query.html#func_Set_Boost

And to fields and documents at indexing time:

http://lucy.apache.org/docs/c/Lucy/Plan/FieldType.html#func_Set_Boost
http://lucy.apache.org/docs/c/Lucy/Index/Indexer.html#func_Add_Doc

But for benchmarking purposes, it mostly matters whether you sort by score, 
document id, or a field value. See


http://lucy.apache.org/docs/c/Lucy/Search/SortSpec.html


   - vary relationship of terms (e.g., proximity)
  - How to do it? Is there an operator like NEAR?


There's ProximityQuery but I'm not sure how it works:

http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html


   - vary operations (e.g., AND, OR)
  - I see that the support is available for boolean query parser. Can I
  know whether for a given search instance I can have multiple boolean
  queries like below?


Yes, that's possible.

Nick



Re: [lucy-user] Lucy Benchmarking

2017-02-08 Thread Kasi Lakshman Karthi Anbumony
Dear Experts:

I am trying to measure the indexing and searching performance using a toy
benchmark and have few questions.

(1) Plan is to report the below metrics:

   - Index creation: tokens/second
  -  Can I know how to obtain the tokens in the lucy_index created? Do
  you think a better metric will be  (Number of terms in the posting
  list/second)? If so, how to obtain the number of terms in the
posting list?
   - Search: free-text query/second
   - Search metric is clean and clear since the toy benchmark controls the
  number of queries.

(2) What are the different query types possible?

   - vary document weighting
  - Is it possible or is it fixed for a given lucy_index generated?
   - vary number of terms
   - vary relationship of terms (e.g., proximity)
  - How to do it? Is there an operator like NEAR?
   - vary operations (e.g., AND, OR)
  - I see that the support is available for boolean query parser. Can I
  know whether for a given search instance I can have multiple boolean
  queries like below?

 if (category) {

String *category_name = Str_newf("category");

String *category_str  = Str_newf("%s", category);

TermQuery *category_query

= TermQuery_new(category_name, (Obj*)category_str);


Vector *children = Vec_new(2);

Vec_Push(children, (Obj*)query1);

Vec_Push(children, (Obj*)category_query);

 query1 = (Query*)ANDQuery_new(children);


Vector *children = Vec_new(2);

Vec_Push(children, (Obj*)query2);

Vec_Push(children, (Obj*)category_query);

query2 = (Query*)ANDQuery_new(children);


DECREF(children);

DECREF(category_str);

DECREF(category_name);

}


Thanks
-Kasi



On Thu, Feb 2, 2017 at 5:26 PM, Nick Wellnhofer  wrote:

> On 02/02/2017 21:44, Kasi Lakshman Karthi Anbumony wrote:
>
>> Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?
>>
>> I do have the ARM cross-compiler tool chain and would like to know which
>> files to change?
>>
>
> Cross compiling Lucy isn't supported yet. I haven't tried to build Lucy on
> ARM myself, but we have successful test reports from CPAN Testers with
> Raspberry Pis. So, if you're feeling adventurous:
>
> 1. Build the Clownfish compiler normally for the host platform.
> 2. Configure the Clownfish runtime using the host compiler.
> 3. Edit the generated Makefile.
>- Replace CC with the cross compiler.
>- Check CFLAGS etc.
> 4. Edit the generated charmony.h file to match the target
>platform.
>- CHY_SIZEOF macros
>- Endian macro
>- Possibly other stuff
> 5. (Maybe) Run `make autogen/hierarchy.json` first and edit the
>generated file autogen/include/cfish_platform.h to match the
>target platform.
> 6. Run `make`. If you run into errors, adjust charmony.h or the
>Makefile.
> 7. Make sure to make backups of Makefile, charmony.h, and
>cfish_platform.h. These files might be recreated and you'll
>lose your changes.
> 8. Repeat steps 2-7 for Lucy.
>
> Nick
>
>


Re: [lucy-user] Lucy Benchmarking

2017-02-02 Thread Nick Wellnhofer

On 02/02/2017 21:44, Kasi Lakshman Karthi Anbumony wrote:

Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?

I do have the ARM cross-compiler tool chain and would like to know which
files to change?


Cross compiling Lucy isn't supported yet. I haven't tried to build Lucy on ARM 
myself, but we have successful test reports from CPAN Testers with Raspberry 
Pis. So, if you're feeling adventurous:


1. Build the Clownfish compiler normally for the host platform.
2. Configure the Clownfish runtime using the host compiler.
3. Edit the generated Makefile.
   - Replace CC with the cross compiler.
   - Check CFLAGS etc.
4. Edit the generated charmony.h file to match the target
   platform.
   - CHY_SIZEOF macros
   - Endian macro
   - Possibly other stuff
5. (Maybe) Run `make autogen/hierarchy.json` first and edit the
   generated file autogen/include/cfish_platform.h to match the
   target platform.
6. Run `make`. If you run into errors, adjust charmony.h or the
   Makefile.
7. Make sure to make backups of Makefile, charmony.h, and
   cfish_platform.h. These files might be recreated and you'll
   lose your changes.
8. Repeat steps 2-7 for Lucy.

Nick



Re: [lucy-user] Lucy Benchmarking

2017-02-02 Thread Kasi Lakshman Karthi Anbumony
Thanks Nick.

Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?

I do have the ARM cross-compiler tool chain and would like to know which
files to change?

Thanks
-Kasi

On Wed, Feb 1, 2017 at 7:42 AM, Nick Wellnhofer  wrote:

> On 01/02/2017 01:44, Kasi Lakshman Karthi Anbumony wrote:
>
>> (1)  Is Lucy multithreaded or single threaded?
>>
>
> Single-threaded.
>
> (2) Are "C" runtime and bindings stable?
>>
>
> Yes.
>
> (2) Is there preexisting benchmark code written in "C" to measure Lucy
>> performance?
>>
>
> No.
>
> (3) I am seeing one under devel/benchmarks/indexers/LuceneIndexer.java.
>> But this one is written in Java and looks like benchmarking Lucene not
>> Lucy. Am I right in my observation?
>>
>
> The corresponding Perl benchmark script for Lucy is lucy_indexer.plx:
>
>
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;
> f=devel/benchmarks/indexers;h=77626c37285602941376c5e5950a20
> e50683da40;hb=HEAD
>
> (4) I was thinking of modifying the lucy/c/sample applications as
>> benchmarking application. Is this a good strategy.
>> Btw is there a good way to build sample files. I have to modify the
>> Makefile in luc/c/ directory to build the sample files and  I am not sure
>> if this is the correct way.
>>
>
> You can find some guidance on how to compile Lucy applications in the
> comment on top of getting_started.c:
>
>
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;
> f=c/sample/getting_started.c;h=6d6193d772f2ceaac86c67cc4916
> 9878b4d4d2f6;hb=HEAD
>
> Basically, you have to run the Clownfish compiler "cfc" to generate header
> files, then you can compile your code and link against libclownfish and
> liblucy.
>
> Benchmark results for the indexer will largely depend on the particular
> Analyzer chain and the total size of your index. The default EasyAnalyzer
> consists of
>
> - StandardTokenizer
> - Unicode Normalizer
> - SnowballStemmer
>
> StandardTokenizer is pretty fast, but Normalizer and Stemmer are
> CPU-intensive. Last time I checked, they account for about two-thirds of
> the processing time for small indices.
>
> A better benchmarking framework would be a much needed contribution.
>
> Nick
>
>


Re: [lucy-user] Lucy Benchmarking

2017-02-01 Thread Nick Wellnhofer

On 01/02/2017 01:44, Kasi Lakshman Karthi Anbumony wrote:

(1)  Is Lucy multithreaded or single threaded?


Single-threaded.


(2) Are "C" runtime and bindings stable?


Yes.


(2) Is there preexisting benchmark code written in "C" to measure Lucy 
performance?


No.


(3) I am seeing one under devel/benchmarks/indexers/LuceneIndexer.java. But 
this one is written in Java and looks like benchmarking Lucene not Lucy. Am I 
right in my observation?


The corresponding Perl benchmark script for Lucy is lucy_indexer.plx:


https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;f=devel/benchmarks/indexers;h=77626c37285602941376c5e5950a20e50683da40;hb=HEAD


(4) I was thinking of modifying the lucy/c/sample applications as benchmarking 
application. Is this a good strategy.
Btw is there a good way to build sample files. I have to modify the Makefile in 
luc/c/ directory to build the sample files and  I am not sure if this is the 
correct way.


You can find some guidance on how to compile Lucy applications in the comment 
on top of getting_started.c:



https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=c/sample/getting_started.c;h=6d6193d772f2ceaac86c67cc49169878b4d4d2f6;hb=HEAD

Basically, you have to run the Clownfish compiler "cfc" to generate header 
files, then you can compile your code and link against libclownfish and liblucy.


Benchmark results for the indexer will largely depend on the particular 
Analyzer chain and the total size of your index. The default EasyAnalyzer 
consists of


- StandardTokenizer
- Unicode Normalizer
- SnowballStemmer

StandardTokenizer is pretty fast, but Normalizer and Stemmer are 
CPU-intensive. Last time I checked, they account for about two-thirds of the 
processing time for small indices.


A better benchmarking framework would be a much needed contribution.

Nick