Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Cindy Harper
It's not ironic - my post was musing inspired by your work.  I guess I
wasn't sure if I understood your results. You were looking at the overall
POS usage in the entire texts as a possible way of ranking the texts. I was
wondering about POS of particular search terms - those that could take on
several POS. A related question - does SOLR use stemming to widen the search
to various POS?  Then would it be meaningful to rank the given texts by the
POS of the actual search terms?  And has anyone looked at samples of user
search terms - are they almost always noun phrases?  Just wanting to
understand what you have explored.  And I probably should have added to your
thread on NGC4LIB, rather than Code4lib - I tend to conflate them.

Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363



On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan  wrote:

> On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote:
>
> > I just was testing our discovery engine for any technical issues after a
> > reboot. I was just using random single words, and one word I used was
> > "correct".  Looking at the first ranked items, I wondered if there's some
> > role for parts-of-speech in ranking hits - are nouns and , in this case,
> > adjectives more indicative of aboutness than verbs?  The first items were
> > "Miss Manners ...  excruciating correctly behavior", then a bunch of
> govdocs
> > on "an act to correct".  I don't think there's any reason to prefer
> > nouns over verbs, but I thought I'd throw the thought at you anyway.
>
>
>
> Ironically, I was playing with parts-of-speech (POS) analysis the other
> day. [1]
>
> Using a pseudo-random sample of texts, I found there to be surprisingly
> similar POS usage between texts. With such similarity, I thought it would be
> difficult to use general POS as a means for ranking or sorting. On the other
> hand, specific POS may be useful. For example, Thoreau was dominated by
> first-person male pronouns but Austen was dominated by second person female
> pronouns.
>
> I think there is something to be explored here.
>
> [1] POS - http://bit.ly/hsxD2i
>
> --
> Eric "Still Counting Tweets and Chats" Morgan
>


Re: [CODE4LIB] Trial run of Virtual Lightning Talks

2011-02-22 Thread Peter Murray
A couple of clarifications.  This is just a trial run to see if the software 
works; a prepared talk isn't necessary or expected.  The time is also 2pm EST.

Room for a few more volunteers...


Peter

On Feb 21, 2011, at 12:10 PM, Peter Murray wrote:
> 
> All,
> 
> I'm looking for some volunteers to make a trial run at virtual lightning 
> talks.  This is an idea that came to me during Code4Lib earlier this month -- 
> use a webinar tool to replicate the environment of the conference lightning 
> talks.  The outline of the concept is at:
> 
>  http://wiki.code4lib.org/index.php/Virtual_Lightning_Talks
> 
> LYRASIS has a subscription to a 100-seat instance of Centra Saba that we can 
> try.  It is Java-based with claimed support for sharing desktops under Mac, 
> Linux and Windows.  I'd like to test that support to see if it can be used.  
> So I'm looking for a half dozen volunteers to sign into a test room on 
> Wednesday at 2pm.
> 
> Please let me know if you can help.  Read the presenter guidelines at the URL 
> above to make sure you have the minimum requirements and for links to install 
> the webinar client software.  The URL to the trial run space is 
> http://tinyurl.com/5vzd8st and it will be active on Wednesday at 2pm.
> 
> Thanks,
> 
> 
> Peter


-- 
Peter Murray peter.mur...@lyrasis.orgtel:+1-678-235-2955
 
Ass't Director, Technology Services Development   http://dltj.org/about/
Lyrasis   --Great Libraries. Strong Communities. Innovative Answers.
The Disruptive Library Technology Jesterhttp://dltj.org/ 
Attrib-Noncomm-Share   http://creativecommons.org/licenses/by-nc-sa/2.5/ 


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Rob Casson
>And I probably should have added to your thread on NGC4LIB, rather than 
>Code4lib - I tend to conflate them.

i'm offended ;)


[CODE4LIB] Job Posting - Scholars' Lab, University of Virginia

2011-02-22 Thread Graham, Wayne (wsg4w)
http://www.scholarslab.org/announcements/web-applications-specialist/

The Scholars’ Lab at the University of Virginia seeks an enthusiastic web 
applications specialist with a background in programming and the humanities or 
cultural heritage.  As a Web Applications Specialist reporting to the Head of 
R&D for the Scholars’ Lab, you will be responsible for building, testing, and 
debugging code. You should possess an extreme attention to detail and a high 
level of accountability and responsibility. We’re looking for someone who 
enjoys technical challenges, likes to figure out how things work, and stays 
involved in the latest Web and digital humanities technologies. You will need 
to be able to fit in to a creative and collaborative environment.

Web Applications Specialist Responsibilities

 *   Build, test, and debug code
 *   Write test cases
 *   Estimate coding projects
 *   Provide consultation on collaborative projects
 *   Develop documentation
 *   Assist in the debugging and system troubleshooting for existing software 
written in a variety of languages and platform

Qualifications

 *   1+ years full-time experience with web development (Rails and PHP 
preferred)
 *   2+ years experience of standards compliant HTML, CSS, and Javascript
 *   Javascript skills (AJAX, JQuery or similar JS framework)
 *   Experience with Test Driven Development (Shoulda, RSpec, PHPUnit)
 *   Experience with relational database management systems (MySQL, Postgresql)
 *   Familiarity with version control systems
 *   Understanding of software life cycle
 *   Strong foundation in OO programming and practices
 *   Experience with Omeka a plus

Salary is commensurate with experience, and expected to range between 
approximately $43,500 and $75,500 per annum. We’re looking to fill this 
position quickly, so please don’t delay!

Consideration of applications will begin immediately and continue until the 
position is filled.

Job posting: http://jobs.virginia.edu/applicants/Central?quickFind=63332


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Erik Hatcher
Solr _can_ use stemming, but to do it with POS would be flakey I'd think.  Is 
"work" a verb or noun?

Some of the (Solr-using) customers that I work with have done POS tagging 
(using tools like BasisTech Solr plugins for entity tagging).  Payloads can be 
assigned to terms during indexing and then used to weight the score when query 
terms match.  Lucene supports payloads and scoring based on them natively, but 
it requires some code to wire together.  Solr supports a little in terms of 
payloads, but to really use them effectively custom coding is needed.  See 
 for example.

Erik

On Feb 22, 2011, at 09:02 , Cindy Harper wrote:

> It's not ironic - my post was musing inspired by your work.  I guess I
> wasn't sure if I understood your results. You were looking at the overall
> POS usage in the entire texts as a possible way of ranking the texts. I was
> wondering about POS of particular search terms - those that could take on
> several POS. A related question - does SOLR use stemming to widen the search
> to various POS?  Then would it be meaningful to rank the given texts by the
> POS of the actual search terms?  And has anyone looked at samples of user
> search terms - are they almost always noun phrases?  Just wanting to
> understand what you have explored.  And I probably should have added to your
> thread on NGC4LIB, rather than Code4lib - I tend to conflate them.
> 
> Cindy Harper, Systems Librarian
> Colgate University Libraries
> char...@colgate.edu
> 315-228-7363
> 
> 
> 
> On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan  wrote:
> 
>> On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote:
>> 
>>> I just was testing our discovery engine for any technical issues after a
>>> reboot. I was just using random single words, and one word I used was
>>> "correct".  Looking at the first ranked items, I wondered if there's some
>>> role for parts-of-speech in ranking hits - are nouns and , in this case,
>>> adjectives more indicative of aboutness than verbs?  The first items were
>>> "Miss Manners ...  excruciating correctly behavior", then a bunch of
>> govdocs
>>> on "an act to correct".  I don't think there's any reason to prefer
>>> nouns over verbs, but I thought I'd throw the thought at you anyway.
>> 
>> 
>> 
>> Ironically, I was playing with parts-of-speech (POS) analysis the other
>> day. [1]
>> 
>> Using a pseudo-random sample of texts, I found there to be surprisingly
>> similar POS usage between texts. With such similarity, I thought it would be
>> difficult to use general POS as a means for ranking or sorting. On the other
>> hand, specific POS may be useful. For example, Thoreau was dominated by
>> first-person male pronouns but Austen was dominated by second person female
>> pronouns.
>> 
>> I think there is something to be explored here.
>> 
>> [1] POS - http://bit.ly/hsxD2i
>> 
>> --
>> Eric "Still Counting Tweets and Chats" Morgan
>> 


[CODE4LIB] Job Posting: Systems Engineer, Sheridan Libraries, Johns Hopkins University

2011-02-22 Thread Sean Hannan
We’re looking for a sysadmin at Hopkins.  Come work with me.  It’ll be cool, I 
promise.

-Sean
---

https://hrnt.jhu.edu/jhujobs/job_view.cfm?view_req_id=46964

The Systems Engineer will provide systems administration and, to a lesser 
extent, programming support for the Systems department’s multi-platform - 
primarily Linux, but also some Windows and Solaris – environment. This position 
will support services provided by the Systems department, including, but not 
limited to, library catalog, search interface, federated search tools, library 
web sites, blogs, file and print shares, desktop applications and mobile 
interfaces. The Systems department shares server infrastructure with Digital 
Research and Curation Center (DRCC), and collaborates closely with DRCC systems 
administrator.

Primary Duties and Responsibilities:
* Installing, upgrading and patching operating systems; installing, upgrading 
and maintaining server hardware and peripheral devices (disk arrays, tape 
libraries).
* Working with other systems administrators and programmers to proactively and 
appropriately monitor hardware, operating systems, and applications in support 
of services provided by Systems department.
* Providing support to programmers in selecting, packaging, deploying and 
configuring applications across a diverse server environment.
* Managing system backup and recovery across all supported servers.
* Supporting a virtual machine infrastructure as well as stand-alone servers.
* Troubleshooting problems across several areas, including application, 
network, OS, hardware.
* Installing, configuring, maintaining and providing security for all 
Linux/Unix systems and peripheral devices.
* Installing and maintaining small to mid-range UPS equipment.
* Configuring and managing infrastructure services, which include DNS, DHCP, 
SMTP, SSH, FTP and SMB services and software; web servers; servlet containers; 
database software (MySQL, Postgres, MSSQL).
* Serving as the point of contact for software and hardware vendors and 
vendors' technical support staff.
* Participating in the analysis and planning of systems and services, including 
recommending server configurations and purchasing.
* Serving as the liaison to the University IT community on issues related to 
Unix/Linux and systems administration.
* Participating in the Systems Office 24x7 on-call plan – includes being 
available by cell phone and participating in the on-call pager rotation.
* Sharing responsibility for physical and server environment in data center
* Programming support for optimizing system performance.
* Identifying areas for improvement in server and/or application management, 
and proposing/implementing solutions to improve processes.

Qualifications:
* Bachelor’s degree and five years related experience required. Additional 
education may substitute for required experience and additional related 
experience may substitute for required education, to the extent permitted by 
the JHU equivalency formula.
* The candidate will support a variety of applications and services running on 
Linux, Unix (Solaris), and Windows. Individual must work closely with other 
staff in the Library Systems department, DRCC, central IT department, and with 
external vendors and developers. Excellent oral and written communication and 
interpersonal skills are essential. Position may require lifting of materials 
less than 50 pounds occasionally.

Preferred Qualifications:
* Working experience with a virtual machine framework, such as XenServer; 
experience with Windows AD; experience with deploying software packages; 
experience with Tomcat, MySQL and PostgreSQL; programming experience in Unix 
shells, Ruby, Java, and Perl; and knowledge or experience with libraries are 
desirable.

The Sheridan Libraries encompass the Milton S. Eisenhower Library and its 
collections at the John Work Garrett Library, the George Peabody Library, the 
Albert D. Hutzler Reading Room, and the DC Centers. Its primary constituency is 
the students and faculty in the schools of Arts & Sciences, Engineering, Carey 
Business School and the School of Education. A key partner in the academic 
enterprise, the library is a leader in the innovative application of 
information technology and has implemented notable diversity and organizational 
development programs. The Sheridan Libraries are strongly committed to 
diversity. A strategic goal of the Libraries is to 'work toward achieving 
diversity when recruiting new and promoting existing staff.' The Libraries 
prize initiative, creativity, professionalism, and teamwork. For information on 
the Sheridan Libraries, visit www.library.jhu.edu .


[CODE4LIB] Job opening in Atlanta - U.S. Court of Appeals, 11th Circuit

2011-02-22 Thread Carol Bean
This is primarily a technology training position, within the Circuit
Library, but will also involve technology development. Yeah, you'd have to
work with me, but don't hold that against the job! ;-)

http://www.ca11.uscourts.gov/hr/listings/Information_Services_Specialist_2-2011.pdf

-- 
Carol Bean
beanwo...@gmail.com


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Jodi Schneider
On Tue, Feb 22, 2011 at 3:02 PM, Erik Hatcher  wrote:
> Solr _can_ use stemming, but to do it with POS would be flakey I'd think.  Is 
> "work" a verb or > noun?

First you detect POS on tokens, *then* you stem. The other way around
wouldn't work.

-Jodi

PS-I loved your "When Solr is your hammer..." post on randomly
choosing names, Erik!


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Eric Lease Morgan
On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote:

> It's not ironic - my post was musing inspired by your work.  I guess I wasn't 
> sure if I understood your results. You were looking at the overall POS usage 
> in the entire texts as a possible way of ranking the texts. I was wondering 
> about POS of particular search terms - those that could take on several 
> POS


Initially I wanted to see if I could classify works based on their POS usage. 
[1] I was hoping to find lots of action verbs in one work and call it an action 
story. I was hoping to find lots of nouns in another story and call it... I 
don't know, something else. Instead, after rudimentary investigation, I 
discovered that all of of the works I analyzed had the same relative percentage 
of nouns, pronouns, verbs, adverbs, adjectives, etc. Maybe such a thing is 
indicative of the English language.

On the other hand, I did notice a difference in the use of particular pronouns 
between works. In Walden by Thoreau, a story about an individual living on the 
banks of a "pond", there was a lot of use of the word "I", but in a different 
story, where the author and his brother canoe down a river, the word "we" 
predominated. Similarly, three Jane Austen stories have many words like "she" 
and "her" where those words are less frequent in the works by Thoreau. While my 
analysis was trivial and thin, I think we might be able to classify some works 
by gender or speaking voice. 

Similar things may be possible with other parts-of-speech, like adjectives, 
specifically colors. For example 214 of the 117,540 words in Walden (0.18%) are 
colors  [1] But only 13  of 121,917 words in Pride and Prejudice (0.01%) are 
color words. Despite the similar lengths of the works, Walden is 18 times more 
"colorful" than Pride. Interesting? This only begs other questions. Is 0.18% a 
high value or a low value? Is the relative use of colors similar within a 
particular author or not? Has the use of color changed over time or indicative 
of genres? Does the use of specific colors actually denote mood?

In the past libraries did not have a whole lot of full text in order to 
evaluate content. That is not true now-a-days. It is now possible to literally 
count and measure a book's characteristics. Since this metadata is numeric in 
nature, it lends itself to visualization. (Think Karen C's presentation at 
Code4Lib.) And this whole thing is good fodder for search, discovery, and 
evaluation. Too much of our metadata is qualitative.


[1] foray's into POS - http://bit.ly/aM2eZx
[2] color words in Walden - http://t.co/hlg5ibL
[3] color words in Pride - http://t.co/VflNf3n

-- 
Eric Lease Morgan