Re: [lucy-user] Re: Regarding document Ids
Thank you Peter for your comments. Regards... On Wed, Nov 16, 2016 at 3:05 PM, Peter Karman wrote: > Serkan Mulayim wrote on 11/16/16, 2:21 PM: > >> Thank you Peter for your quick response. >> >> As I understand before adding new documents to the index, you delete by >> query (by using your primary key). How is the performance in your end, >> then? Since delete by query will search through all segments in the index >> for the deletion, I feel like the performance would be affected. Roughly, >> how many documents do you have in your index, and what is the document >> size? >> >> BTW, my document sizes are very small, and I think I will have around 40K >> documents. >> >> > performance is fast enough for me. I have 1MM+ docs but not much churn > (not updating docs constantly). IME the bottleneck is not the search. It's > a search engine; it's pretty fast. The bottleneck is updating the index. > That's true whether you delete first or not. > > > > -- > Peter Karman . http://peknet.com/ . pe...@peknet.com >
Re: [lucy-user] Re: Regarding document Ids
Serkan Mulayim wrote on 11/16/16, 2:21 PM: Thank you Peter for your quick response. As I understand before adding new documents to the index, you delete by query (by using your primary key). How is the performance in your end, then? Since delete by query will search through all segments in the index for the deletion, I feel like the performance would be affected. Roughly, how many documents do you have in your index, and what is the document size? BTW, my document sizes are very small, and I think I will have around 40K documents. performance is fast enough for me. I have 1MM+ docs but not much churn (not updating docs constantly). IME the bottleneck is not the search. It's a search engine; it's pretty fast. The bottleneck is updating the index. That's true whether you delete first or not. -- Peter Karman . http://peknet.com/ . pe...@peknet.com
Re: [lucy-user] Re: Regarding document Ids
Thank you Peter for your quick response. As I understand before adding new documents to the index, you delete by query (by using your primary key). How is the performance in your end, then? Since delete by query will search through all segments in the index for the deletion, I feel like the performance would be affected. Roughly, how many documents do you have in your index, and what is the document size? BTW, my document sizes are very small, and I think I will have around 40K documents. Thanks, Serkan On Wed, Nov 16, 2016 at 11:25 AM, Peter Karman wrote: > Serkan Mulayim wrote on 11/16/16, 1:17 PM: > >> Hi guys, >> >> I think I need to simplify my question. After reading it one more time, I >> realized I touched many things, and it seem confusing. >> >> It seems like if we index the same document twice, a new document is >> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, >> " If >> you truly need a primary key field, you must define it and populate it >> yourself". How can we do this, are there any examples around this? Should >> I >> search for the document with the primary key before indexing and if it >> exists, should I not index it? >> > > What I do in all my apps is use delete_by_term > https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/ > Indexer.pod#delete_by_term > > I have my own primary key system that varies based on the application. > Sometimes it is a URI, sometimes a db PK. I maintain the document integrity > myself. > > One example from how Dezi solves this more generally: > > https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/ > Lucy/Indexer.pm#L451 > > Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and > retrieves very quickly. > > > -- > Peter Karman . http://peknet.com/ . pe...@peknet.com >
Re: [lucy-user] Re: Regarding document Ids
Serkan Mulayim wrote on 11/16/16, 1:17 PM: Hi guys, I think I need to simplify my question. After reading it one more time, I realized I touched many things, and it seem confusing. It seems like if we index the same document twice, a new document is created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If you truly need a primary key field, you must define it and populate it yourself". How can we do this, are there any examples around this? Should I search for the document with the primary key before indexing and if it exists, should I not index it? What I do in all my apps is use delete_by_term https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/Indexer.pod#delete_by_term I have my own primary key system that varies based on the application. Sometimes it is a URI, sometimes a db PK. I maintain the document integrity myself. One example from how Dezi solves this more generally: https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Indexer.pm#L451 Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and retrieves very quickly. -- Peter Karman . http://peknet.com/ . pe...@peknet.com
[lucy-user] Re: Regarding document Ids
Hi guys, I think I need to simplify my question. After reading it one more time, I realized I touched many things, and it seem confusing. It seems like if we index the same document twice, a new document is created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If you truly need a primary key field, you must define it and populate it yourself". How can we do this, are there any examples around this? Should I search for the document with the primary key before indexing and if it exists, should I not index it? Thanks, Serkan On Tue, Nov 15, 2016 at 2:22 PM, Serkan Mulayim wrote: > Hi, > > As far as I see if we add the same document twice, it creates a new > document. As per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If > you truly need a primary key field, you must define it and populate it > yourself". Can you please elaborate on this one? Does it mean choosing a > field to be primary key and delete the document with the primary key and > re-add it? If so the document might have not been created until we commit, > so deletion would not be possible, right? Also performance would be another > issue. > > Another solution might be hashing the "primary key" and put it as the > documentId (but the referred page also says that docIds are ephemeral). If > the ephemeralness of the docId is not a problem, my concern is regarding > the collisions considering that I might need to have many documents in the > same index. This boils down to the birthday problem and we might not be > safe in the range of an integer. > > Do you have any suggestions about this one? > > Thanks, > Serkan >