Thank you Peter for your quick response.

As I understand before adding new documents to the index, you delete by
query (by using your primary key). How is the performance in your end,
then? Since delete by query will search through all segments in the index
for the deletion, I feel like the performance would be affected. Roughly,
how many documents do you have in your index, and what is the document size?

BTW, my document sizes are very small, and I think I will have around 40K
documents.

Thanks,
Serkan

On Wed, Nov 16, 2016 at 11:25 AM, Peter Karman <[email protected]> wrote:

> Serkan Mulayim wrote on 11/16/16, 1:17 PM:
>
>> Hi guys,
>>
>> I think I need to simplify my question. After reading it one more time, I
>> realized I touched many things, and it seem confusing.
>>
>> It seems like if we index the same document twice, a new document is
>> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html,
>> " If
>> you truly need a primary key field, you must define it and populate it
>> yourself". How can we do this, are there any examples around this? Should
>> I
>> search for the document with the primary key before indexing and if it
>> exists, should I not index it?
>>
>
> What I do in all my apps is use delete_by_term
> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/
> Indexer.pod#delete_by_term
>
> I have my own primary key system that varies based on the application.
> Sometimes it is a URI, sometimes a db PK. I maintain the document integrity
> myself.
>
> One example from how Dezi solves this more generally:
>
> https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/
> Lucy/Indexer.pm#L451
>
> Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and
> retrieves very quickly.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  [email protected]
>

Reply via email to