[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543 ] Ankur commented on MAHOUT-305: -- Sean, Thanks for filing the jira. Nothing points from our

Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang
Hi, Glad to hear here that mahout devs are interested in pig. Actually I believe pig is very helpful when you want to quickly implement a prototype of machine learning algorithms. And Pig has java API, it is easy to integrate pig script with java. Maybe we can start with implementing NB using

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
I see pig as useful for data preparation, but for any numerical tasks, it is likely to be completely hopeless. On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote: Glad to hear here that mahout devs are interested in pig. Actually I believe pig is very helpful when you want

Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang
Pig can only make the implementation of map-reduce easier, the numerical computation can been done in UDF. And piglet is a DSL upon pig latin which make pig support loop. http://github.com/iconara/piglet On Mon, Feb 22, 2010 at 4:25 PM, Ted Dunning ted.dunn...@gmail.com wrote: I see pig as

Re: Algorithm implementations in Pig

2010-02-22 Thread Robin Anil
On Mon, Feb 22, 2010 at 1:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: I see pig as useful for data preparation, but for any numerical tasks, it is likely to be completely hopeless. PIG will be a great tool to experiment quickly on algorithms. But, with people here trying to focus on

Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
Ted, The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for performance but also for doing a better job of managing memory in its MR job pipeline. Also support for both inner and outer skewed join is something that I found indispensable when dealing with really large

Re: Algorithm implementations in Pig

2010-02-22 Thread Grant Ingersoll
I'm all for Pig, especially once we are a TLP. I haven't had the proper time to review the PLSI implementation, but it looks useful. I agree on the other points, though, in that I think we it would be nice to have consistent formats based on Vector so that things can be more portable. On

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836598#action_12836598 ] Robin Anil commented on MAHOUT-300: --- We should be multiplying using sparsity instead of

Re: Algorithm implementations in Pig

2010-02-22 Thread David Stuart
Seems like the guys at twitter are going down the pig/hadoop http://highscalability.com/blog/2010/2/19/twitters-plan-to-analyze-100-billion-tweets.html route could be worth getting them on board the Mahout wagon especially with previous discussion had about classification efforts

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836624#action_12836624 ] Robin Anil commented on MAHOUT-300: --- I think the irregularity is due to the sparse vector

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836630#action_12836630 ] Robin Anil commented on MAHOUT-300: --- Ted, your loop structure seem to be slower by about

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836633#action_12836633 ] Sean Owen commented on MAHOUT-300: -- Tiny comment -- will probably be wise to use BitSet

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836649#action_12836649 ] Robin Anil commented on MAHOUT-300: --- On dense data 1000, 1000 {noformat} BenchMarks

[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Solve performance issues with Vector Implementations

[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Increased loop by 3x to give more stability to perf values Solve

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1283#action_1283 ] Ankur commented on MAHOUT-305: -- Hey Sean, Have you played with netflix dataset?

[jira] Assigned: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned MAHOUT-305: Assignee: Ankur Combine both cooccurrence-based CF M/R jobs ---

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836679#action_12836679 ] Robin Anil commented on MAHOUT-300: --- i found the anomaly Jake was talking about. It was

[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Solve performance issues with Vector Implementations

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836689#action_12836689 ] Sean Owen commented on MAHOUT-305: -- Yes there are some prolific users. I don't have

[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-304: -- Attachment: MAHOUT-304.patch Jeff, Meanshift uses only ids generated by the mapper to keep vector

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Actually, no. I meant other programs written in pure Java. It used to be that the very restricted scripting ability of Pig made processing chains composed of Pig and map-reduce programs very brittle. In fact, just gluing together multiple Pig programs used to be very ugly. On Mon, Feb 22, 2010

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Has the interface for writing UDF's stabilized? For quite some time, the UDF API was changing every 3 months. On Mon, Feb 22, 2010 at 12:35 AM, Jeff Zhang zjf...@gmail.com wrote: Pig can only make the implementation of map-reduce easier, the numerical computation can been done in UDF. --

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836706#action_12836706 ] Jake Mannix commented on MAHOUT-300: The sparse data is odd... (-vs 50 -sp 5000)

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836713#action_12836713 ] Robin Anil commented on MAHOUT-300: --- Can i commit the latest. If you dont have any

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836725#action_12836725 ] Ankur commented on MAHOUT-305: -- Typically when doing train-test data split, we divide the data

Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
In the next pig release (0.7) Pig's load/store func would be moving to use hadoop's input/output format. So there are some changes planned for that - http://wiki.apache.org/pig/Pig070IncompatibleChanges After that I don't expect any interface level change in UDF. -...@nkur On 2/22/10 10:10

Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
I agree with you and while some of that has been remedied, I wouldn't say things are perfect. Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. If

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836733#action_12836733 ] Sean Owen commented on MAHOUT-305: -- Say I've made the following ratings: 5 stars: Harry

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.com wrote: Scripting ability while still limited has better streaming support so you can have relations

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836815#action_12836815 ] Jake Mannix commented on MAHOUT-300: With these opts: -vs 50 -sp 500 -nv 50 -l 500

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836817#action_12836817 ] Ted Dunning commented on MAHOUT-300: These are getting respectable! As a quick hack,

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836818#action_12836818 ] Jake Mannix commented on MAHOUT-300: agreed, Ted. I'm liking that we're getting

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836819#action_12836819 ] Robin Anil commented on MAHOUT-300: --- Seq.rand and rand.seq shoudl get the same perf level

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836826#action_12836826 ] Jake Mannix commented on MAHOUT-300: and now that my run (of three comments ago) is

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman
If the Vector-MSCanopy pre-job outputs all of its canopies then each of those canopies would contain the generated canopyId and its canopy center would contain the original vector with its docId. Seems like one could use that data set to get the membership information in a separate

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836839#action_12836839 ] Robin Anil commented on MAHOUT-300: --- {noformat} seq.seq= 46,855 rand.seq = 37,397

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil
after the ListVector - ListcanopyId optimization. I did that in the patch. Take a look :)

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836848#action_12836848 ] Robin Anil commented on MAHOUT-300: --- {noformat} rand.rand = 14,435 dense.rand = 9,172

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836909#action_12836909 ] Jake Mannix commented on MAHOUT-300: New benchmark additions: {code}INFO: BenchMarks

[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-304: -- Affects Version/s: (was: 0.3) 0.4 Fix Version/s: (was: 0.3)

[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Attachment: MAHOUT-301.patch Fancy new version. Run as follows: Set your $MAHOUT_CONF_DIR to a

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836952#action_12836952 ] Jake Mannix commented on MAHOUT-301: Oh, I forgot to finish my sentence which began run

[jira] Created: (MAHOUT-306) Profile and improve perfomance of algorithms based on vectors

2010-02-22 Thread Robin Anil (JIRA)
Profile and improve perfomance of algorithms based on vectors - Key: MAHOUT-306 URL: https://issues.apache.org/jira/browse/MAHOUT-306 Project: Mahout Issue Type: Improvement

[jira] Resolved: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-300. --- Resolution: Fixed Assignee: Robin Anil Solve performance issues with Vector Implementations

[jira] Updated: (MAHOUT-306) Profile and improve performance of algorithms based on vectors

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-306: -- Summary: Profile and improve performance of algorithms based on vectors (was: Profile and improve

[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Issue Type: Sub-task (was: Improvement) Parent: MAHOUT-306 Solve performance issues with

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836962#action_12836962 ] Robin Anil commented on MAHOUT-301: --- The help comments are missing from the mahout/bin

Look! No more ISSUES

2010-02-22 Thread Robin Anil
waiting for 301 to get commited. https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751styleName=Htmlversion=12314281 PMC's. Its in your hands now :D Robin

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman
Robin Anil wrote: after the ListVector - ListcanopyId optimization. I did that in the patch. Take a look :) +1 Simply marvelous

Re: more svn:ignore

2010-02-22 Thread Drew Farris
Ok, I've committed the ignores for .classpath, .project, .settings created by eclipse and a couple target directories that hadn't been excluded. I'll get the idea stuff on another pass once I figure out how to do global wildcard ignores. On Sun, Feb 21, 2010 at 7:53 AM, Sean Owen sro...@gmail.com

[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-301: --- Attachment: MAHOUT-301-drew.patch Did some testing, here's a patch to clean some of these things up

Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
Those would be passed as parameters either through -param option or through a parameter file with -param_file option and the pig's preprocessor just substitutes the values in your script. Since its just a blind parameter substitution, in my shingling script I even had the schema definition

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Good answer. On Mon, Feb 22, 2010 at 8:52 PM, Ankur C. Goel gan...@yahoo-inc.com wrote: Those would be passed as parameters either through -param option or through a parameter file with -param_file option and the pig's preprocessor just substitutes the values in your script. Since its just a

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837120#action_12837120 ] Robin Anil commented on MAHOUT-301: --- including the job jar is much cleaner than adding

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837123#action_12837123 ] Ankur commented on MAHOUT-305: -- With co-occurrence analysis we are dropping ratings. So if

[jira] Updated: (MAHOUT-283) Update assemblies to include mahout-collections for release build

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-283: -- Fix Version/s: (was: 0.4) 0.3 Update assemblies to include mahout-collections

[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-281: -- Fix Version/s: (was: 0.4) 0.3 scm urls are wrong in the poms

[jira] Updated: (MAHOUT-280) Clean some redundant POM declarations

2010-02-22 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-280: -- Fix Version/s: (was: 0.4) 0.3 Clean some redundant POM declarations