Hello, I am working on a project that simulates a selective, large-scale crawling. The system adapts its behaviour according with some external user queries received at crawling time. Briefly, it analyzes the already crawled pages in the top-k results for each query, and prioritizes the visit of the discovered links accordingly. In a generic experiment, I measure the time units as the number of crawling cycles completed so far, i.e., with an integer value. Finally, I evaluate the experiment by analyzing the documents fetched over the crawling cycles. In this work I am using Lucene 7.2.1, but this should not be an issue since I need just some conceptual help.
In my current implementation, an experiment starts with an empty index. When a Web page is fetched during the crawling cycle *x*, the system builds a document with the URL as StringField, the title and the body as TextFields, and *x* as an IntPoint. When I get an external user query, I submit it to get the top-k relevant documents crawled so far. When I need to retrieve the documents indexed from cycle *i* to cycle *j*, I execute a range query over this last IntPoint field. This strategy does the job, but of course the write operations take some hours overall for a single experiment, even if I crawl just half a million of Web pages. Since I am not crawling real-time data, but I am working over a static set of many billions of Web pages (whose contents are already stored on disk), I am investigating some opportunities to reduce the number of writes during an experiment. For instance, I could avoid to index everything from scratch for each run. I would be happy to index all the static contents of my dataset (i.e., URL, title and body of a Web page) once and for all. Then, for a single experiment, I would mark a document as crawled at cycle *x* without storing this information permanently, in order both to filter out the documents that in the current simulation have not been crawled when processing the external queries, and to still perform the range queries at evaluation time. Do you have any idea on how to do that? Thank you in advance for your support.