RE: [lucy-user] Doc id from hits and remove redundant documents

Gupta, Rajiv Wed, 23 Nov 2016 08:21:40 -0800

What I'm doing now is since I have line number and seek position I'm moving 
forward line by line based on last record that I got. I'm also adding an 
end_point marker which is my search to decide to move forward.


Thanks,
Rajiv Gupta

-----Original Message-----
From: Nick Wellnhofer [mailto:[email protected]] 
Sent: Wednesday, November 23, 2016 9:30 PM
To: [email protected]
Subject: Re: [lucy-user] Doc id from hits and remove redundant documents

On 23/11/2016 16:31, Gupta, Rajiv wrote:
> Thanks for your reply Nick.
>
> I wanted to delete the old documents that is why I was trying to get the 
> doc_id and use that to delete it. However, that does not help it deleted 
> other documents and keep changing the document. I wanted to use delete by 
> term but in my doc I don't have any primary key.
>
> I add document like this:
>
> $indexer->add_doc({
>                 title    => $mytitle,
>                 content  => substr($mybodytext,0,1024),
>                 url      => $onlyfilename,
>                 urlpath  => $filpath,
>                 position => $fileseektostart,
>                 linenum  => $filelinenumtostart,
>                 jobtype  => $self->{_logfile_hash}{$filetoindex}[5] ,
>             });

You can use any field as primary key if the field's value is guaranteed to be 
unique for all your documents. But it seems that you index the contents of 
files line by line, so "urlpath" isn't unique. Your primary key is probably the 
tuple (urlpath, linenum).

If you update all the lines of a file at once, this isn't a problem. You can 
simply delete all documents relating to the file with

     $indexer->delete_by_term(
         field => 'urlpath',
         term  => $filepath,
     );

If you only want to update certain lines, you'll have to construct an ANDQuery 
for each line and use delete_by_query. For example:

     $indexer->delete_by_query(Lucy::Search::ANDQuery->new(
         children => [
             Lucy::Search::TermQuery->new(
                 field => 'urlpath',
                 term  => $filepath,
             ),
             Lucy::Search::TermQuery->new(
                 field => 'linenum',
                 term  => $linenum,
             ),
         ],
     ));

Or maybe use a RangeQuery to delete a contiguous range of lines.

Nick

RE: [lucy-user] Doc id from hits and remove redundant documents

Reply via email to