Re: Advice

2005-02-06 Thread Ken Williams
Aha, yes. AI::Categorizer lets you customize the tokenization behavior to be however you want, by subclassing the Document class and overriding the tokenize() method. You could do something like this: { package My::Documents; @ISA = qw(AI::Categorizer::Document::Text); sub tokenize {

Re: Advice

2005-02-05 Thread Jason Armstrong
Thanks for all the good feedback, I'll certainly be following up on it. I did find one reason why I wasn't getting good matches ... when I looked more carefully at the perl data structure, I found that the 'features' hash only contained alphabetic characters. So, for example, in the string 'WARRIO

Re: Advice

2005-02-05 Thread Ken Williams
On Feb 5, 2005, at 1:26 AM, Richard Jelinek wrote: True true. And while this is true, the reports about nonfunctional SVM are also true. At least I can confirm them and have mentioned them here some time ago already. What can/will "we" do about this? Oh yes, sorry I forgot to address this in my mes

Re: Advice

2005-02-04 Thread Richard Jelinek
Hi Ken, On Fri, Feb 04, 2005 at 08:36:10PM -0600, Ken Williams wrote: > What this means is that in order to use AI::Categorizer in the obvious > way for this project, you're going to have to get your hands on some > training data that has the same statistical properties as what you'll > see at

Re: Advice

2005-02-04 Thread Ken Williams
ueries" are the noisy strings you're trying to clean up. Sometimes that works pretty well. Or you could try the Levenshtein edit distance that Samy suggested. Or you could try something else that you invent. =) -Ken On Feb 4, 2005, at 4:18 AM, Jason Armstrong wrote: Perhaps someo

Re: Advice

2005-02-04 Thread Marco Baroni
It is not in perl, but SMVlight (http://svmlight.joachims.org/) offers a very efficient (in my experience) C implementation of support vector machines, and, being a command line tool, it's easy to interface it to perl. Regards, Marco -- Marco Baroni SSLMIT, University of Bologna http://sslm

Re: Advice

2005-02-04 Thread Tim Allwine
Jason Armstrong wrote: ... I've been looking at AI::Categorizer. I have a list of all valid vehicle descriptions (about 8200). I create for each of these a knowledge set, with the content the same as the category: Briefly: my $c = new AI::Categorizer( knowledge_set => AI::Categorizer:

Re: Advice

2005-02-04 Thread Samy Kamkar
wrote: Perhaps someone on this list has some good advice for me. I am working on a project that imports vehicle descriptions. Very often, the data capturers give invalid information, or mistyped data. I am looking for a way to intelligently reformat the data, and add the mistyped entry for future

Advice

2005-02-04 Thread Jason Armstrong
Perhaps someone on this list has some good advice for me. I am working on a project that imports vehicle descriptions. Very often, the data capturers give invalid information, or mistyped data. I am looking for a way to intelligently reformat the data, and add the mistyped entry for future use. I