Re: Fwd: Nutch project and my Ph.D. thesis.

Sergey A Volkov Mon, 28 Nov 2011 18:21:58 -0800

Hi Sebastian!

Glad to hear that you had read my work.


*Generating URL filters.*
This part of my work was not clearly described and out of date in my thesis.

Main idea is to reduce crawldb size (reducing junk in fetch list is sideeffect).

Suppose we have some method to detect that downloaded document is "useful"

Then we consider that pages with enough outlinks (e.g. 10) to usefuldocuments are "semi-useful"After that we can count number of "useful" and "semiuseful" pages foreach url prefix (each url splits to set of its prefixes), and add prefixto filter using some heuristics (I had blocked prefixes without anyuseful or semiuseful urls)

Actually it have not worked on production as good as I have supposed =(Maybe because using discrete value for "usefullness" instead of float isnot a good idea.


Much more efficient in my case was adding robots.txt filters as url-filter.

*Deleting duplicates.

*This tweak works really good in very specific case, when I don't needto update documents in index. (I'm working with news sites, so for me itwas ok).Main idea is use key value storage instead of DeleteDuplicates. In thiscase we don't have to read full index.This is not MapReduce way, but it is simple and works really good,because even with very limited resources kv storages can handle fewquery per ms (we transfer only url+hash to server), so indexingperformance changes insignificantly.

There is no need in kv storage if lucene can store fields in separatefiles, but as far as I know it can't.


>Did you work with TextProfileSignature or MD5Signature?

I have used md5 signature of cleaned text. It's not like inTextProfileSignature but serves for similar purpose.


Thanks,
Sergey.

On 11/26/2011 01:36 AM, Sebastian Nagel wrote:

Hi Sergey,

a late answer, but I just read your work and found it very interestingand

inspiring, especially your description of "a system for the automatic
construction of URL filters". Why? - We recently had to setup for
a customer URL filter and normalization rules to limit the number
of crawled documents to a reasonable number. URLs were long (>300 chars)
and contained many query parameters. Hard work and I thought that
it would be a nice machine learning problem with
- some similarity measure based on content data on the one side
- URL features (path and query parts, host, etc.) on the other side
and the question: Which parts of the URL lead to (near) duplicates and
can or should be removed. Such, the target would be to find URL
normalization rules, not filters.
Your algorithm transferred to my problem would roughly mean:
if addition or variation of one parameter does not lead
to new interesting links the parameter could be skipped.
I have to think about it. Unhappily, there is few time at work
to solve nice machine learning problems.

Just for all others: Sergey's algorithm eliminates URLs
(by constructing URL filters) by counting the number of good links
with this prefix.
A question whether my understanding is right:
I guess "u_p - number of "useful" links with the given prefix"
means that there are u_p useful links pointing to URLs with this prefix.
Or is it the opposite: outlinks pointing from documents with the prefix
to useful documents.

You also adressed duplicate detection performed on an earlier
step (realized as indexing filter not operating on the index).
I think this is of general interest. Just one further question:
Did you work with TextProfileSignature or MD5Signature?

And finally, as Markus pointed out there are many problems
of common and/or academic interest around crawlers and links.
I think the most important thing for an PhD is to find a problem
without a bulk of papers being written about it.

So, hope to hear more from you.

Sebastian

On 11/16/2011 03:06 AM, Sergey A Volkov wrote:

Thanks!

Unfortunately my work is only in russian - anyway here is the link
https://github.com/volkov/diploma/blob/master/main.pdf?raw=true
Actually it is technical work contains some specific optimizationsfor crawling news sites. At thismoment my english writing skills aren't good, but i'll try torepresent some interesting aspects of
my graduate paper somehow=)

On Wed 16 Nov 2011 05:16:05 AM MSK, Lewis John Mcgibbney wrote:
---------- Forwarded message ----------
From: Lewis John Mcgibbney<[email protected]>
Date: Wed, Nov 16, 2011 at 1:15 AM
Subject: Re: Nutch project and my Ph.D. thesis.
To: [email protected]


Hi Sergey,

There was a Professor from somewhere in S America that posted recently
rearding some work he did, if you search the archives you may get ataster
for work related to Nutch.

Also can you provide a link to your work? I would be very intersted in
having a look at the areas you have been working on. Also feel freeto add
your work to the wiki page references for others to see.

Thank you.

http://wiki.apache.org/nutch/AcademicArticles
On Wed, Nov 16, 2011 at 12:39 AM, Sergey AVolkov<[email protected]
wrote:
Hi!

I am postgraduate student in Saint Petersburg State University. I was
working with Nutch for about 3 years, have written my graduate workbased
on it, and now I don't know what to do in my Ph.D work. (Nobody in my
department (System Programming) deals with web crawling)
I hope someone knows problems in web crawling, whose solutions canhelp
Nutch project and me in my future Ph.D. thesis.

Any ideas?

Thanks,
Sergey.

Re: Fwd: Nutch project and my Ph.D. thesis.

Reply via email to