Re: Frontera: large-scale, distributed web crawling framework

Jessica Glover Fri, 02 Oct 2015 09:18:25 -0700

Alexander, I apologize. I misunderstood the intent of your message and I
was very rude in my response. I will think about what you've asked and get
back to you.


Also, I enjoyed your slide presentation. It's very pleasing to the eye.

Sincerely,
Jessica

On Fri, Oct 2, 2015 at 11:51 AM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Hi,
>
> I don’t think Alexander is doing anything wrong. In fact, he’s
> asking for input on his web crawling framework on the Nutch user
> list which I imagine contains many people interested in distributed
> web crawling.
>
> There doesn’t appear to be a direct Nutch connection here in his
> framework, however it uses other Apache technologies, Kafka, HBase,
> etc., that we are using (or thinking of using) and are interested in
> at least from my perspective as a Nutch developer and PMC Member.
> There are also several efforts to figure out how to use Scrapy
> with Nutch and this may be an interesting connection.
>
> If Alexander and people like him who aren’t using Nutch per-se never
> came to the Nutch list and discussed common web crawling topics of
> interest, we’d continue to have our silos and our own separate lists,
> and our own discussions, etc., instead of trying to work together
> as a broader community of folks and we’d miss out on potential
> opportunities where in the future, perhaps we could actually share
> more than simply ideas, but also software too.
>
> I applaud Alexander for coming to this list and not staying in his
> own silo and trying to get input from the Apache Nutch community.
>
> Thank you Alexander.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Jessica Glover <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Friday, October 2, 2015 at 8:45 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Frontera: large-scale, distributed web crawling framework
>
> >Hmm... you're asking for a free consultation on an open source software
> >user mailing list? First, this doesn't exactly seem like the appropriate
> >place for that. Second, offer some incentive if you want someone to help
> >you with your business.
> >
> >On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov
> ><[email protected]>
> >wrote:
> >
> >> Hi Nutch users!
> >>
> >> Last 8 months at Scrapinghub we’ve been working on a new web crawling
> >> framework called Frontera. This is a distributed implementation of crawl
> >> frontier part of web crawler, the component which decides what to crawl
> >> next, when and when to stop. So, it’s not a complete web crawler.
> >>However,
> >> it suggests overall crawler design. There is a clean and tested way how
> >>to
> >> build a such crawler in half of the day from existing components.
> >>
> >> Here is a list of main features:
> >> Online operation: scheduling of new batch, updating of DB state. No need
> >> to stop crawling to change the crawling strategy.
> >> Storage abstraction: write your own backend (sqlalchemy, HBase is
> >> included).
> >> Canonical URLs resolution abstraction: each document has many URLs,
> >>which
> >> to use? We provide a place where you can code your own logic.
> >> Scrapy ecosystem: good documentation, big community, ease of
> >>customization.
> >> Communication layer is Apache Kafka: easy to plug somewhere and debug.
> >> Crawling strategy abstraction: crawling goal, url ordering, scoring
> >>model
> >> is coded in separate module.
> >> Polite by design: each website is downloaded by at most one spider
> >>process.
> >> Workers are implemented in Python.
> >> In general, such a web crawler should be very easy for customization,
> >>easy
> >> to plug in existing infrastructure and it’s online operation could be
> >> useful for crawling frequently changing web pages: news websites for
> >> example. We tested it at some scale, by crawling part of Spanish
> >>internet,
> >> you can find details in my presentation.
> >>
> >>
> >>
> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-craw
> >>ling%20the%20spanish%20web.pdf
> >>
> >> This project currently on a github, it’s an open source, under own
> >>license.
> >> https://github.com/scrapinghub/frontera
> >> https://github.com/scrapinghub/distributed-frontera
> >>
> >> The questions are, what you guys think? Is this a useful thing? If yes,
> >> what kind of use cases do you see? Currently, I’m looking for a
> >>businesses
> >> who can benefit from it, please write me if you have any ideas on that.
> >>
> >> A.
>
>

Re: Frontera: large-scale, distributed web crawling framework

Reply via email to