Hi Sergey,
a late answer, but I just read your work and found it very interesting and
inspiring, especially your description of "a system for the automatic
construction of URL filters". Why? - We recently had to setup for
a customer URL filter and normalization rules to limit the number
of crawled documents to a reasonable number. URLs were long (>300 chars)
and contained many query parameters. Hard work and I thought that
it would be a nice machine learning problem with
- some similarity measure based on content data on the one side
- URL features (path and query parts, host, etc.) on the other side
and the question: Which parts of the URL lead to (near) duplicates and
can or should be removed. Such, the target would be to find URL
normalization rules, not filters.
Your algorithm transferred to my problem would roughly mean:
if addition or variation of one parameter does not lead
to new interesting links the parameter could be skipped.
I have to think about it. Unhappily, there is few time at work
to solve nice machine learning problems.
Just for all others: Sergey's algorithm eliminates URLs
(by constructing URL filters) by counting the number of good links
with this prefix.
A question whether my understanding is right:
I guess "u_p - number of "useful" links with the given prefix"
means that there are u_p useful links pointing to URLs with this prefix.
Or is it the opposite: outlinks pointing from documents with the prefix
to useful documents.
You also adressed duplicate detection performed on an earlier
step (realized as indexing filter not operating on the index).
I think this is of general interest. Just one further question:
Did you work with TextProfileSignature or MD5Signature?
And finally, as Markus pointed out there are many problems
of common and/or academic interest around crawlers and links.
I think the most important thing for an PhD is to find a problem
without a bulk of papers being written about it.
So, hope to hear more from you.
Sebastian
On 11/16/2011 03:06 AM, Sergey A Volkov wrote:
Thanks!
Unfortunately my work is only in russian - anyway here is the link
https://github.com/volkov/diploma/blob/master/main.pdf?raw=true
Actually it is technical work contains some specific optimizations for crawling
news sites. At this
moment my english writing skills aren't good, but i'll try to represent some
interesting aspects of
my graduate paper somehow=)
On Wed 16 Nov 2011 05:16:05 AM MSK, Lewis John Mcgibbney wrote:
---------- Forwarded message ----------
From: Lewis John Mcgibbney<[email protected]>
Date: Wed, Nov 16, 2011 at 1:15 AM
Subject: Re: Nutch project and my Ph.D. thesis.
To: [email protected]
Hi Sergey,
There was a Professor from somewhere in S America that posted recently
rearding some work he did, if you search the archives you may get a taster
for work related to Nutch.
Also can you provide a link to your work? I would be very intersted in
having a look at the areas you have been working on. Also feel free to add
your work to the wiki page references for others to see.
Thank you.
http://wiki.apache.org/nutch/AcademicArticles
On Wed, Nov 16, 2011 at 12:39 AM, Sergey A Volkov<[email protected]
wrote:
Hi!
I am postgraduate student in Saint Petersburg State University. I was
working with Nutch for about 3 years, have written my graduate work based
on it, and now I don't know what to do in my Ph.D work. (Nobody in my
department (System Programming) deals with web crawling)
I hope someone knows problems in web crawling, whose solutions can help
Nutch project and me in my future Ph.D. thesis.
Any ideas?
Thanks,
Sergey.