Hi Patrick, Yes, I did want to mention that it will not affect previous fetch lists. Sorry for the confusion.
Thanks, Shashanka Balakuntala On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <[email protected]> wrote: > Hello, > > On 23/07/2020 14:37, Shashanka Balakuntala wrote: > > Hi Patrick, > > > > Yes, the idea that you have suggested would work, but i do have to > mention > > that it might just affect the next iteration. So you can just clean the > > last parse segment and parse again and updatedb with the plugins > activated > > and that would do. > > I do not follow you. How could the similarity scores of all documents be > collected and used by updatedb without reparsing all content? From what I > see, the similarity scorer operates during the parse phase and the score > should be recorded in crawl_parse. > > > Deleting all the the parsed segments might not work because, because a > url > > with score less than threshold will not be generated or fetched, so none > of > > its outlinks will be fetched as well. So if you just delete parse segment > > and do the process, it would mean the all the already fetched segments > will > > not be impacted. So it will update the scoring, if you just need the > score > > for something else, please do go ahead with this. > > Again, I am confused. My mental model is: > > - Delete and reparse everything. I means similarity scores are taken in > account and included all segments crawl_parse. > - Run updatedb on all segments. CrawlDatum entries will be gathered by > "url" and some final score will be generated in the reduce phase, probably > favoring the more recent score. > > Now, maybe the existing crawldb might interfere during the final merge and > I should clear it somehow, but otherwise, once the similarity scores are > reflected in the updated crawldb, the next generate phase will take them in > account. > > Obviously, they will not retroactively affect the previous fetch lists. Is > it what you tried to tell me? > > Thanks for your comments, > -- > Patrick Mézard > > > Lets see if anyone has any other items to add or clear here. > > > > *Regards* > > Shashanka Balakuntala Srinivasa > > > > > > > > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <[email protected]> > wrote: > > > >> Hello, > >> > >> I have crawled a first document set using a combination of depth and > opic > >> scoring plugins. I would like to add the similarity scoring plugin but > >> obviously the crawldb scores should be updated for it and following > >> "generate" phases to be effective. Is there a recommended approach to > >> achieve this? > >> > >> My current understanding is since the similarity plugin operates in > parse > >> phase, I would have to remove all parsed data from segments, re-parse > them > >> and updatedb? Would that work? Is there anything smarter? > >> > >> Thanks, > >> -- > >> Patrick Mézard > >> > > > >

