You have a lot of problems to solve here.

1) can you find the price? Is it in text or in structured data? If text you 
have an NLP problem. You can use regex for price.
2) how do you associate a price with the object, there may be several money 
amounts in the ad. Some do this with proximity so how many chanracters away 
from the item id is the price.
3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6 
plus", and it gets worse for things with lots of numbers and modifiers in the 
name like "super whiz bang deLux 5G XLS” The right level of de-duplication vs 
fragmentation is a deep and hard problem.

How much is an NLP problem and what structure does the data have? Unless I 
misunderstand your problem, extracting the data will be the hardest part and 
not something Mahout can help with.

On Aug 12, 2015, at 5:49 AM, David Kaplan <[email protected]> wrote:

Hi all,
Hope someone can please point me in the right direction,
Very new to mahout..
Here's my scenario:

I have written a system that collects Classifieds items from multiple
websites - phones,cars,antiques and many more using scrapy, all the items
are then ingested into Solr - +- 3 million entries.
This is then the backend for my search engine

I want to be able to extract meaningful information to accurately
calculate realistic price average etc. I need guidance/perhaps examples in
accurate outlier detection, categorization etc extreme beginner in machine
learning so need to know if that's what I should be using

Part of my challenge is the broad range of items/categories, different
levels of skewed data etc. e.g. finding outliers with "iphone" results when
many of those are cheap iphone accessories.

Basically it seems i need to cluster/classify but not sure exactly how to
go about it, because i do already have the categories for 500K of the
entries, example category "Cell Phones & Accessories - Accessories"

And then actually connecting Mahout to Solr...

Many thanks!
David

Reply via email to