Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

Mattmann, Chris A (3980) Fri, 18 Sep 2015 07:12:00 -0700

awesome thanks for sharing!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Julien Nioche <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, September 18, 2015 at 2:54 AM
To: "[email protected]" <[email protected]>, "[email protected]"
<[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

>Nutch people, 
>
>
>Just in case you missed the announcement below. As you probably know CC
>use Nutch for their crawls, this is a fantastic opportunity to put your
>Nutch skills to great use!
>
>
>Julien
>
>---------- Forwarded message ----------
>From: Sara Crouse <[email protected]>
>Date: 17 September 2015 at 22:51
>Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
>To: Common Crawl <[email protected]>
>
>
>Hello again CC community,
>
>In addition to my appointment, another staff transition is on the
>horizon, and I would like to ask for your help finding candidates to fill
>a critical role. At the end of this month, Stephen Merity (data
>scientist, crawl engineer, and much more!) will leave
> Common Crawl to work on image recognition and language understanding
>using deep learning at MetaMind, a new startup. Stephen, has been a great
>asset to Common Crawl, and we are grateful that he wishes to remain
>engaged with us in a volunteer capacity going
> forward.
>
>This week, we therefore launch a search to fill the role of Crawl
>Engineer/Data Scientist. Below and posted here
>https://commoncrawl.org/jobs/ is the job description. We appreciate any
>help you can provide in spreading the word about this unique opportunity.
>If you have specific referrals, or wish to apply, please
> contact [email protected].
>
>Many thanks,
>
>Sara
>-------------------------------------------------------
>
>_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_
>
>*Location* 
>San Francisco or Remote
>
>
>*Job Summary*
>Common Crawl (CC) is the non-profit organization that builds and
>maintains the single largest publicly accessible dataset of the world’s
>knowledge, encompassing petabytes of web crawl data.
>
>If democratizing access to web information and tackling the engineering
>challenges of working with data at the scale of the web sounds exciting
>to you, we would love to hear from you. If you have worked on open source
>projects before or can share code samples
> with us, please don't hesitate to send relevant links along with your
>application.
>
>
>
>*Description*
>
>
>/Primary Responsibilities/
>_Running the crawl_
>* Spinning up and managing Hadoop clusters on Amazon EC2
>* Running regular comprehensive crawls of the web using Nutch
>* Preparing and publishing crawl data to data hosting partner, Amazon Web
>Services
>* Incident response and diagnosis of crawl issues as they occur, e.g.
>** Replacing lost instances due to EC2 problems / spot instance losses
>** Responding to and remedying webmaster queries and issues
>
>_Crawl engineering_
>* Maintaining, developing, and deploying new features as required by
>running the Nutch crawler, e.g.:
>** Providing netiquette features, such as following robots.txt, as
>required, and load balancing a crawl across millions of domains
>
>** Implementing and improving ranking algorithms to prioritize the
>crawling of popular pages
>* Extending existing tools to work efficiently with large datasets
>* Working with the Nutch community to push improvements to the crawler to
>the public
>
>/Other Responsibilities/
>* Building support tools and artifacts, including documentation,
>tutorials, and example code or supporting frameworks for processing CC
>data using different tools.
>* Identifying and reporting on research and innovations that result from
>analysis and derivative use of CC data.
>* Community evangelism:
>** Collaborating with partners in academia and industry
>** Engaging regularly with user discussion group and responding to
>frequent inquiries about how to use CC data
>** Writing technical blog posts
>** Presenting on or representing CC at conferences, meetups, etc.
>
>
>*Qualifications*
>/Minimum qualifications/
>* Fluent in Java (Nutch and Hadoop are core to our mission)
>* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
>* Knowledge the Amazon Web Services (AWS) ecosystem
>* Experience with Python
>* Basic command line Unix knowledge
>* BS Computer Science or equivalent work experience
>
>/Preferred qualifications/
>* Experience with running web crawlers
>* Cluster computing experience (Hadoop preferred)
>* Running parallel jobs over dozens of terabytes of data
>* Experience committing to open source projects and participating in open
>source forums
>
>
>*About Common Crawl*
>The Common Crawl Foundation is a California 501(c)(3) registered
>non-profit with the goal of democratizing access to web information by
>producing and maintaining an open repository of web crawl data that is
>universally accessible and analyzable.
>
>Our vision is of a truly open web that allows open access to information
>and enables greater innovation in research, business and education. We
>level the playing field by making wholesale extraction, transformation
>and analysis of web data cheap and easy.
>
>The Common Crawl Foundation is an Equal Opportunity Employer.
>
>
>*To Apply* 
>Please send your cover letter and resumé to
>[email protected] <mailto:[email protected]>.
>
>
>
>
>-- 
>You received this message because you are subscribed to the Google Groups
>"Common Crawl" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to
>[email protected].
>To post to this group, send email to
>[email protected] <mailto:[email protected]>.
>Visit this group at
>http://groups.google.com/group/common-crawl
><http://groups.google.com/group/common-crawl>.
>For more options, visit
>https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
>
>-- 
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble
>
>

Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

Reply via email to