seq2sparse in 0.8 throwing class not found for analyzers

2013-04-22 Thread Chris Harrington
HI all, I'm trying to run the seq2sparse tool with one of the lucene analyzers but it throws a class not found exception mahout seq2sparse -i ./contentDataDir/sequenced -o ./contentDataDir/sparseVectors --namedVector -wt tf -a org.apache.lucene.analysis.EnglishAnalyzer

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-22 Thread Ryan Josal
Ryan, Hadoop splits based on the min size, as Matt mentioned, and the max split size, and also the dfs.block.size. You can calculate the split size from that as max(minSplit,min(maxSplit,blockSize)). I have found that for CPU intensive operations on smaller data sets, like I was doing with

Re: seq2sparse in 0.8 throwing class not found for analyzers

2013-04-22 Thread Suneel Marthi
Chris, Thanks for reporting this. I am able to replicate this problem with trunk. Created  Mahout-1195 to track this, I'll take a look sometime today. From: Chris Harrington ch...@heystaks.com To: user@mahout.apache.org Sent: Monday, April 22, 2013 6:08 AM

Re: seq2sparse in 0.8 throwing class not found for analyzers

2013-04-22 Thread Suneel Marthi
Phew,... The fix for this was a DUD. In Lucene 4.2.1 the package name for this class was changed to org.apache.lucene.analysis.en.EnglishAnalyzer. Notice 'en' in the package path. This should work. From: Chris Harrington ch...@heystaks.com To:

Does seq2sparse drop empty documents?

2013-04-22 Thread Matt Molek
I'm losing a some documents when running seq2sparse. I think it's because the documents are composed of common terms, and end up having no terms at all once common words are pruned. I couldn't find documentation that this is what's supposed to be happening though, so I wanted to ask if this is

Mahout Similarity Caching

2013-04-22 Thread Gabor Bernat
Hello, I'm using Mahout in a system, where the typical response time should be below 100ms. I'm using an item based recommender with float preference values (with Tanimato similarity for now, which is passed into a CachingItemSimilarity objec for performance reasonst). My model has around 7k

Re: Mahout Similarity Caching

2013-04-22 Thread Sean Owen
49 seconds is orders of magnitude too long -- something is very wrong here, for so little data. Are you running this off a database? or are you somehow counting the overhead of 3-4K network calls? On Mon, Apr 22, 2013 at 11:22 PM, Gabor Bernat ber...@primeranks.net wrote: Hello, I'm using

Fwd: mahout lucene.vector from multiple solrcloud index directories for kmeans

2013-04-22 Thread Sebastian Ramirez
Hello everyone, I want to know if it's possible to do a clustering of documents in SolrCloud indices (multiple index directories) and how would one accomplish that. --- I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT I can cluster documents from one Lucene/Solr index. I can even cluster documents

Re: Mahout Similarity Caching

2013-04-22 Thread Gabor Bernat
Nope, and nope. Note that this is an outlier example, however even in other cases it does takes 500ms+ which is way to much for what I need. Thanks, Bernát GÁBOR On Tue, Apr 23, 2013 at 12:53 AM, Sean Owen sro...@gmail.com wrote: 49 seconds is orders of magnitude too long -- something is