David Legg wrote:
That's pretty straightforward actually. Suppose you have a sentence "Mary had a
little lamb" then you would generate the following token values in addition to the
single word tokens if you were capturing a phrase size of 2: -
Maryhad
hada
alittle
littlelamb
Neat trick, I wonder how it works out.
Might be too large, especially with malformed MIME types.
I recommend you read Paul Graham's 'Better Bayesian Filtering' [2] (especially
the bit titled 'Tokens'). It's fascinating stuff... or maybe I'm getting too
old and geeky :-)
Sure I did, quite a while ago.
Image info needs extracting too. So things like the width, height, bit depth,
type of encoding, Exif data and any tags should all be captured.
...what would you use to extract image info?
I haven't used any graphics libraries recently but a quick scan suggests
'Commons Sanselan' [3] which happily is an Apache project now.
Seams easy.
Broken link to MedatdataExample.java:/
Well, you got it all covered.
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]