Hey Guys,

I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA
2014:

Real Data Science: Exploring the FBI's Vault dataset with Apache Tika,
Nutch and Solr
Event ApacheCon North America
Submission Type Lightning Talk
Category Developer
Biography Chris Mattmann has a wealth of experience in software design,
and in the construction of large-scale data-intensive systems. His work
has infected a broad set of communities, ranging from helping NASA unlock
data from its next generation of earth science system satellites, to
assisting graduate students at the University of Southern California (his
Alma mater) in the study of software architecture, all the way to helping
industry and open source as a member of the Apache Software Foundation.
When he's not busy being busy, he's spending time with his lovely wife and
son braving the mean streets of Southern California.
Abstract Apache Tika is a content detection and analysis toolkit allowing
automated MIME type identification and rapid parsing of text and metadata
from over 1200 types of files including all major file types from the
Internet Assigned Number Authority's MIME database. In this talk I'll show
you how to practically use Apache Tika to explore the FBI's vault of
declassified PDF documents, and to use Apache Nutch to pull down the
dataset, and how to use Solr to ingest, and geoclassify the documents so
that can build a map of FBI PDF documents corresponding to your favorite
conspiracies throughout the USA. I've taught this material in my CSCI 572
Search Engines class at USC and it's a big hit. These are normally three
assignments, so I will do my best to boil down their essence into a
45min-60 min talk replete with danger and excitement.
Audience Developers interested in using Tika, Nutch and Solr. Folks
interested in the FBI vault dataset. GIS wonks. The like.
Experience Level Intermediate
Benefits to the Ecosystem The core of the talk will be Tika, but there
will be some Nutch magic, and some Solr magic at very basic levels. The
benefits of the ecosystem will be the real display of data science
involved and on a real dataset.
Technical Requirements I need an internet connection, and a projector.
Status New




Cheers,
Chris


Reply via email to