awesome thanks for sharing! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Julien Nioche <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, September 18, 2015 at 2:54 AM To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist >Nutch people, > > >Just in case you missed the announcement below. As you probably know CC >use Nutch for their crawls, this is a fantastic opportunity to put your >Nutch skills to great use! > > >Julien > >---------- Forwarded message ---------- >From: Sara Crouse <[email protected]> >Date: 17 September 2015 at 22:51 >Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist >To: Common Crawl <[email protected]> > > >Hello again CC community, > >In addition to my appointment, another staff transition is on the >horizon, and I would like to ask for your help finding candidates to fill >a critical role. At the end of this month, Stephen Merity (data >scientist, crawl engineer, and much more!) will leave > Common Crawl to work on image recognition and language understanding >using deep learning at MetaMind, a new startup. Stephen, has been a great >asset to Common Crawl, and we are grateful that he wishes to remain >engaged with us in a volunteer capacity going > forward. > >This week, we therefore launch a search to fill the role of Crawl >Engineer/Data Scientist. Below and posted here >https://commoncrawl.org/jobs/ is the job description. We appreciate any >help you can provide in spreading the word about this unique opportunity. >If you have specific referrals, or wish to apply, please > contact [email protected]. > >Many thanks, > >Sara >------------------------------------------------------- > >_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_ > >*Location* >San Francisco or Remote > > >*Job Summary* >Common Crawl (CC) is the non-profit organization that builds and >maintains the single largest publicly accessible dataset of the world’s >knowledge, encompassing petabytes of web crawl data. > >If democratizing access to web information and tackling the engineering >challenges of working with data at the scale of the web sounds exciting >to you, we would love to hear from you. If you have worked on open source >projects before or can share code samples > with us, please don't hesitate to send relevant links along with your >application. > > > >*Description* > > >/Primary Responsibilities/ >_Running the crawl_ >* Spinning up and managing Hadoop clusters on Amazon EC2 >* Running regular comprehensive crawls of the web using Nutch >* Preparing and publishing crawl data to data hosting partner, Amazon Web >Services >* Incident response and diagnosis of crawl issues as they occur, e.g. >** Replacing lost instances due to EC2 problems / spot instance losses >** Responding to and remedying webmaster queries and issues > >_Crawl engineering_ >* Maintaining, developing, and deploying new features as required by >running the Nutch crawler, e.g.: >** Providing netiquette features, such as following robots.txt, as >required, and load balancing a crawl across millions of domains > >** Implementing and improving ranking algorithms to prioritize the >crawling of popular pages >* Extending existing tools to work efficiently with large datasets >* Working with the Nutch community to push improvements to the crawler to >the public > >/Other Responsibilities/ >* Building support tools and artifacts, including documentation, >tutorials, and example code or supporting frameworks for processing CC >data using different tools. >* Identifying and reporting on research and innovations that result from >analysis and derivative use of CC data. >* Community evangelism: >** Collaborating with partners in academia and industry >** Engaging regularly with user discussion group and responding to >frequent inquiries about how to use CC data >** Writing technical blog posts >** Presenting on or representing CC at conferences, meetups, etc. > > >*Qualifications* >/Minimum qualifications/ >* Fluent in Java (Nutch and Hadoop are core to our mission) >* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...) >* Knowledge the Amazon Web Services (AWS) ecosystem >* Experience with Python >* Basic command line Unix knowledge >* BS Computer Science or equivalent work experience > >/Preferred qualifications/ >* Experience with running web crawlers >* Cluster computing experience (Hadoop preferred) >* Running parallel jobs over dozens of terabytes of data >* Experience committing to open source projects and participating in open >source forums > > >*About Common Crawl* >The Common Crawl Foundation is a California 501(c)(3) registered >non-profit with the goal of democratizing access to web information by >producing and maintaining an open repository of web crawl data that is >universally accessible and analyzable. > >Our vision is of a truly open web that allows open access to information >and enables greater innovation in research, business and education. We >level the playing field by making wholesale extraction, transformation >and analysis of web data cheap and easy. > >The Common Crawl Foundation is an Equal Opportunity Employer. > > >*To Apply* >Please send your cover letter and resumé to >[email protected] <mailto:[email protected]>. > > > > >-- >You received this message because you are subscribed to the Google Groups >"Common Crawl" group. >To unsubscribe from this group and stop receiving emails from it, send an >email to >[email protected]. >To post to this group, send email to >[email protected] <mailto:[email protected]>. >Visit this group at >http://groups.google.com/group/common-crawl ><http://groups.google.com/group/common-crawl>. >For more options, visit >https://groups.google.com/d/optout <https://groups.google.com/d/optout>. > > > > > >-- > >Open Source Solutions for Text Engineering > >http://digitalpebble.blogspot.com/ >http://www.digitalpebble.com >http://twitter.com/digitalpebble > >

