Re: [lucy-user] Re: Regarding document Ids

2016-11-21 Thread Serkan Mulayim
Thank you Peter for your comments. Regards...

On Wed, Nov 16, 2016 at 3:05 PM, Peter Karman  wrote:

> Serkan Mulayim wrote on 11/16/16, 2:21 PM:
>
>> Thank you Peter for your quick response.
>>
>> As I understand before adding new documents to the index, you delete by
>> query (by using your primary key). How is the performance in your end,
>> then? Since delete by query will search through all segments in the index
>> for the deletion, I feel like the performance would be affected. Roughly,
>> how many documents do you have in your index, and what is the document
>> size?
>>
>> BTW, my document sizes are very small, and I think I will have around 40K
>> documents.
>>
>>
> performance is fast enough for me. I have 1MM+ docs but not much churn
> (not updating docs constantly). IME the bottleneck is not the search. It's
> a search engine; it's pretty fast. The bottleneck is updating the index.
> That's true whether you delete first or not.
>
>
>
> --
> Peter Karman  .  http://peknet.com/  .  pe...@peknet.com
>


Re: [lucy-user] Re: Regarding document Ids

2016-11-16 Thread Peter Karman

Serkan Mulayim wrote on 11/16/16, 2:21 PM:

Thank you Peter for your quick response.

As I understand before adding new documents to the index, you delete by
query (by using your primary key). How is the performance in your end,
then? Since delete by query will search through all segments in the index
for the deletion, I feel like the performance would be affected. Roughly,
how many documents do you have in your index, and what is the document size?

BTW, my document sizes are very small, and I think I will have around 40K
documents.



performance is fast enough for me. I have 1MM+ docs but not much churn (not 
updating docs constantly). IME the bottleneck is not the search. It's a search 
engine; it's pretty fast. The bottleneck is updating the index. That's true 
whether you delete first or not.



--
Peter Karman  .  http://peknet.com/  .  pe...@peknet.com


Re: [lucy-user] Re: Regarding document Ids

2016-11-16 Thread Peter Karman

Serkan Mulayim wrote on 11/16/16, 1:17 PM:

Hi guys,

I think I need to simplify my question. After reading it one more time, I
realized I touched many things, and it seem confusing.

It seems like if we index the same document twice, a new document is
created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
you truly need a primary key field, you must define it and populate it
yourself". How can we do this, are there any examples around this? Should I
search for the document with the primary key before indexing and if it
exists, should I not index it?


What I do in all my apps is use delete_by_term
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/Indexer.pod#delete_by_term

I have my own primary key system that varies based on the application. Sometimes 
it is a URI, sometimes a db PK. I maintain the document integrity myself.


One example from how Dezi solves this more generally:

https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Indexer.pm#L451

Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and 
retrieves very quickly.



--
Peter Karman  .  http://peknet.com/  .  pe...@peknet.com