Alexander, I apologize. I misunderstood the intent of your message and I was very rude in my response. I will think about what you've asked and get back to you.
Also, I enjoyed your slide presentation. It's very pleasing to the eye. Sincerely, Jessica On Fri, Oct 2, 2015 at 11:51 AM, Mattmann, Chris A (3980) < [email protected]> wrote: > Hi, > > I don’t think Alexander is doing anything wrong. In fact, he’s > asking for input on his web crawling framework on the Nutch user > list which I imagine contains many people interested in distributed > web crawling. > > There doesn’t appear to be a direct Nutch connection here in his > framework, however it uses other Apache technologies, Kafka, HBase, > etc., that we are using (or thinking of using) and are interested in > at least from my perspective as a Nutch developer and PMC Member. > There are also several efforts to figure out how to use Scrapy > with Nutch and this may be an interesting connection. > > If Alexander and people like him who aren’t using Nutch per-se never > came to the Nutch list and discussed common web crawling topics of > interest, we’d continue to have our silos and our own separate lists, > and our own discussions, etc., instead of trying to work together > as a broader community of folks and we’d miss out on potential > opportunities where in the future, perhaps we could actually share > more than simply ideas, but also software too. > > I applaud Alexander for coming to this list and not staying in his > own silo and trying to get input from the Apache Nutch community. > > Thank you Alexander. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Jessica Glover <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, October 2, 2015 at 8:45 AM > To: "[email protected]" <[email protected]> > Subject: Re: Frontera: large-scale, distributed web crawling framework > > >Hmm... you're asking for a free consultation on an open source software > >user mailing list? First, this doesn't exactly seem like the appropriate > >place for that. Second, offer some incentive if you want someone to help > >you with your business. > > > >On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov > ><[email protected]> > >wrote: > > > >> Hi Nutch users! > >> > >> Last 8 months at Scrapinghub we’ve been working on a new web crawling > >> framework called Frontera. This is a distributed implementation of crawl > >> frontier part of web crawler, the component which decides what to crawl > >> next, when and when to stop. So, it’s not a complete web crawler. > >>However, > >> it suggests overall crawler design. There is a clean and tested way how > >>to > >> build a such crawler in half of the day from existing components. > >> > >> Here is a list of main features: > >> Online operation: scheduling of new batch, updating of DB state. No need > >> to stop crawling to change the crawling strategy. > >> Storage abstraction: write your own backend (sqlalchemy, HBase is > >> included). > >> Canonical URLs resolution abstraction: each document has many URLs, > >>which > >> to use? We provide a place where you can code your own logic. > >> Scrapy ecosystem: good documentation, big community, ease of > >>customization. > >> Communication layer is Apache Kafka: easy to plug somewhere and debug. > >> Crawling strategy abstraction: crawling goal, url ordering, scoring > >>model > >> is coded in separate module. > >> Polite by design: each website is downloaded by at most one spider > >>process. > >> Workers are implemented in Python. > >> In general, such a web crawler should be very easy for customization, > >>easy > >> to plug in existing infrastructure and it’s online operation could be > >> useful for crawling frequently changing web pages: news websites for > >> example. We tested it at some scale, by crawling part of Spanish > >>internet, > >> you can find details in my presentation. > >> > >> > >> > http://events.linuxfoundation.org/sites/events/files/slides/Frontera-craw > >>ling%20the%20spanish%20web.pdf > >> > >> This project currently on a github, it’s an open source, under own > >>license. > >> https://github.com/scrapinghub/frontera > >> https://github.com/scrapinghub/distributed-frontera > >> > >> The questions are, what you guys think? Is this a useful thing? If yes, > >> what kind of use cases do you see? Currently, I’m looking for a > >>businesses > >> who can benefit from it, please write me if you have any ideas on that. > >> > >> A. > >

