Luigi Is there a reason for not indexing all of your on-disk pages? That seems to be the first step. But I do not understand what your goal is. Cheers -- Rick
On January 30, 2018 1:33:27 PM EST, Luigi Caiazza <lcaiazz...@gmail.com> wrote: >Hello, > >I am working on a project that simulates a selective, large-scale >crawling. >The system adapts its behaviour according with some external user >queries >received at crawling time. Briefly, it analyzes the already crawled >pages >in the top-k results for each query, and prioritizes the visit of the >discovered links accordingly. In a generic experiment, I measure the >time >units as the number of crawling cycles completed so far, i.e., with an >integer value. Finally, I evaluate the experiment by analyzing the >documents fetched over the crawling cycles. In this work I am using >Lucene >7.2.1, but this should not be an issue since I need just some >conceptual >help. > >In my current implementation, an experiment starts with an empty index. >When a Web page is fetched during the crawling cycle *x*, the system >builds >a document with the URL as StringField, the title and the body as >TextFields, and *x* as an IntPoint. When I get an external user query, >I >submit it to get the top-k relevant documents crawled so far. When I >need >to retrieve the documents indexed from cycle *i* to cycle *j*, I >execute a >range query over this last IntPoint field. This strategy does the job, >but >of course the write operations take some hours overall for a single >experiment, even if I crawl just half a million of Web pages. > >Since I am not crawling real-time data, but I am working over a static >set >of many billions of Web pages (whose contents are already stored on >disk), >I am investigating some opportunities to reduce the number of writes >during >an experiment. For instance, I could avoid to index everything from >scratch >for each run. I would be happy to index all the static contents of my >dataset (i.e., URL, title and body of a Web page) once and for all. >Then, >for a single experiment, I would mark a document as crawled at cycle >*x* without >storing this information permanently, in order both to filter out the >documents that in the current simulation have not been crawled when >processing the external queries, and to still perform the range queries >at >evaluation time. Do you have any idea on how to do that? > >Thank you in advance for your support. -- Sorry for being brief. Alternate email is rickleir at yahoo dot com