You have a lot of problems to solve here. 1) can you find the price? Is it in text or in structured data? If text you have an NLP problem. You can use regex for price. 2) how do you associate a price with the object, there may be several money amounts in the ad. Some do this with proximity so how many chanracters away from the item id is the price. 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6 plus", and it gets worse for things with lots of numbers and modifiers in the name like "super whiz bang deLux 5G XLS” The right level of de-duplication vs fragmentation is a deep and hard problem.
How much is an NLP problem and what structure does the data have? Unless I misunderstand your problem, extracting the data will be the hardest part and not something Mahout can help with. On Aug 12, 2015, at 5:49 AM, David Kaplan <[email protected]> wrote: Hi all, Hope someone can please point me in the right direction, Very new to mahout.. Here's my scenario: I have written a system that collects Classifieds items from multiple websites - phones,cars,antiques and many more using scrapy, all the items are then ingested into Solr - +- 3 million entries. This is then the backend for my search engine I want to be able to extract meaningful information to accurately calculate realistic price average etc. I need guidance/perhaps examples in accurate outlier detection, categorization etc extreme beginner in machine learning so need to know if that's what I should be using Part of my challenge is the broad range of items/categories, different levels of skewed data etc. e.g. finding outliers with "iphone" results when many of those are cheap iphone accessories. Basically it seems i need to cluster/classify but not sure exactly how to go about it, because i do already have the categories for 500K of the entries, example category "Cell Phones & Accessories - Accessories" And then actually connecting Mahout to Solr... Many thanks! David
