Hello,

I am working on a project that simulates a selective, large-scale crawling.
The system adapts its behaviour according with some external user queries
received at crawling time. Briefly, it analyzes the already crawled pages
in the top-k results for each query, and prioritizes the visit of the
discovered links accordingly. In a generic experiment, I measure the time
units as the number of crawling cycles completed so far, i.e., with an
integer value. Finally, I evaluate the experiment by analyzing the
documents fetched over the crawling cycles. In this work I am using Lucene
7.2.1, but this should not be an issue since I need just some conceptual
help.

In my current implementation, an experiment starts with an empty index.
When a Web page is fetched during the crawling cycle *x*, the system builds
a document with the URL as StringField, the title and the body as
TextFields, and *x* as an IntPoint. When I get an external user query, I
submit it  to get the top-k relevant documents crawled so far. When I need
to retrieve the documents indexed from cycle *i* to cycle *j*, I execute a
range query over this last IntPoint field. This strategy does the job, but
of course the write operations take some hours overall for a single
experiment, even if I crawl just half a million of Web pages.

Since I am not crawling real-time data, but I am working over a static set
of many billions of Web pages (whose contents are already stored on disk),
I am investigating some opportunities to reduce the number of writes during
an experiment. For instance, I could avoid to index everything from scratch
for each run. I would be happy to index all the static contents of my
dataset (i.e., URL, title and body of a Web page) once and for all. Then,
for a single experiment, I would mark a document as crawled at cycle
*x* without
storing this information permanently, in order both to filter out the
documents that in the current simulation have not been crawled when
processing the external queries, and to still perform the range queries at
evaluation time. Do you have any idea on how to do that?

Thank you in advance for your support.

Reply via email to