Hi Sebastian!

Glad to hear that you had read my work.

*Generating URL filters.*
This part of my work was not clearly described and out of date in my thesis.

Main idea is to reduce crawldb size (reducing junk in fetch list is side effect).
Suppose we have some method to detect that downloaded document is "useful"
Then we consider that pages with enough outlinks (e.g. 10) to useful documents are "semi-useful" After that we can count number of "useful" and "semiuseful" pages for each url prefix (each url splits to set of its prefixes), and add prefix to filter using some heuristics (I had blocked prefixes without any useful or semiuseful urls)

Actually it have not worked on production as good as I have supposed =( Maybe because using discrete value for "usefullness" instead of float is not a good idea.

Much more efficient in my case was adding robots.txt filters as url-filter.

*Deleting duplicates.
*This tweak works really good in very specific case, when I don't need to update documents in index. (I'm working with news sites, so for me it was ok). Main idea is use key value storage instead of DeleteDuplicates. In this case we don't have to read full index. This is not MapReduce way, but it is simple and works really good, because even with very limited resources kv storages can handle few query per ms (we transfer only url+hash to server), so indexing performance changes insignificantly.

There is no need in kv storage if lucene can store fields in separate files, but as far as I know it can't.

>Did you work with TextProfileSignature or MD5Signature?
I have used md5 signature of cleaned text. It's not like in TextProfileSignature but serves for similar purpose.

Thanks,
Sergey.

On 11/26/2011 01:36 AM, Sebastian Nagel wrote:
Hi Sergey,

a late answer, but I just read your work and found it very interesting and
inspiring, especially your description of "a system for the automatic
construction of URL filters". Why? - We recently had to setup for
a customer URL filter and normalization rules to limit the number
of crawled documents to a reasonable number. URLs were long (>300 chars)
and contained many query parameters. Hard work and I thought that
it would be a nice machine learning problem with
- some similarity measure based on content data on the one side
- URL features (path and query parts, host, etc.) on the other side
and the question: Which parts of the URL lead to (near) duplicates and
can or should be removed. Such, the target would be to find URL
normalization rules, not filters.
Your algorithm transferred to my problem would roughly mean:
if addition or variation of one parameter does not lead
to new interesting links the parameter could be skipped.
I have to think about it. Unhappily, there is few time at work
to solve nice machine learning problems.

Just for all others: Sergey's algorithm eliminates URLs
(by constructing URL filters) by counting the number of good links
with this prefix.
A question whether my understanding is right:
I guess "u_p - number of "useful" links with the given prefix"
means that there are u_p useful links pointing to URLs with this prefix.
Or is it the opposite: outlinks pointing from documents with the prefix
to useful documents.

You also adressed duplicate detection performed on an earlier
step (realized as indexing filter not operating on the index).
I think this is of general interest. Just one further question:
Did you work with TextProfileSignature or MD5Signature?

And finally, as Markus pointed out there are many problems
of common and/or academic interest around crawlers and links.
I think the most important thing for an PhD is to find a problem
without a bulk of papers being written about it.

So, hope to hear more from you.

Sebastian

On 11/16/2011 03:06 AM, Sergey A Volkov wrote:
Thanks!

Unfortunately my work is only in russian - anyway here is the link
https://github.com/volkov/diploma/blob/master/main.pdf?raw=true
Actually it is technical work contains some specific optimizations for crawling news sites. At this moment my english writing skills aren't good, but i'll try to represent some interesting aspects of
my graduate paper somehow=)

On Wed 16 Nov 2011 05:16:05 AM MSK, Lewis John Mcgibbney wrote:
---------- Forwarded message ----------
From: Lewis John Mcgibbney<[email protected]>
Date: Wed, Nov 16, 2011 at 1:15 AM
Subject: Re: Nutch project and my Ph.D. thesis.
To: [email protected]


Hi Sergey,

There was a Professor from somewhere in S America that posted recently
rearding some work he did, if you search the archives you may get a taster
for work related to Nutch.

Also can you provide a link to your work? I would be very intersted in
having a look at the areas you have been working on. Also feel free to add
your work to the wiki page references for others to see.

Thank you.

http://wiki.apache.org/nutch/AcademicArticles


On Wed, Nov 16, 2011 at 12:39 AM, Sergey A Volkov<[email protected]
wrote:

Hi!

I am postgraduate student in Saint Petersburg State University. I was
working with Nutch for about 3 years, have written my graduate work based
on it, and now I don't know what to do in my Ph.D work. (Nobody in my
department (System Programming) deals with web crawling)

I hope someone knows problems in web crawling, whose solutions can help
Nutch project and me in my future Ph.D. thesis.

Any ideas?

Thanks,
Sergey.








Reply via email to