Luigi
Is there a reason for not indexing all of your on-disk pages? That seems to be 
the first step. But I do not understand what your goal is.
Cheers -- Rick

On January 30, 2018 1:33:27 PM EST, Luigi Caiazza <lcaiazz...@gmail.com> wrote:
>Hello,
>
>I am working on a project that simulates a selective, large-scale
>crawling.
>The system adapts its behaviour according with some external user
>queries
>received at crawling time. Briefly, it analyzes the already crawled
>pages
>in the top-k results for each query, and prioritizes the visit of the
>discovered links accordingly. In a generic experiment, I measure the
>time
>units as the number of crawling cycles completed so far, i.e., with an
>integer value. Finally, I evaluate the experiment by analyzing the
>documents fetched over the crawling cycles. In this work I am using
>Lucene
>7.2.1, but this should not be an issue since I need just some
>conceptual
>help.
>
>In my current implementation, an experiment starts with an empty index.
>When a Web page is fetched during the crawling cycle *x*, the system
>builds
>a document with the URL as StringField, the title and the body as
>TextFields, and *x* as an IntPoint. When I get an external user query,
>I
>submit it  to get the top-k relevant documents crawled so far. When I
>need
>to retrieve the documents indexed from cycle *i* to cycle *j*, I
>execute a
>range query over this last IntPoint field. This strategy does the job,
>but
>of course the write operations take some hours overall for a single
>experiment, even if I crawl just half a million of Web pages.
>
>Since I am not crawling real-time data, but I am working over a static
>set
>of many billions of Web pages (whose contents are already stored on
>disk),
>I am investigating some opportunities to reduce the number of writes
>during
>an experiment. For instance, I could avoid to index everything from
>scratch
>for each run. I would be happy to index all the static contents of my
>dataset (i.e., URL, title and body of a Web page) once and for all.
>Then,
>for a single experiment, I would mark a document as crawled at cycle
>*x* without
>storing this information permanently, in order both to filter out the
>documents that in the current simulation have not been crawled when
>processing the external queries, and to still perform the range queries
>at
>evaluation time. Do you have any idea on how to do that?
>
>Thank you in advance for your support.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Reply via email to