[jira] Updated: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-239: - Resolution: Fixed Assignee: Benson Margulies Status: Resolved (was: Patch Available) Committed Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 0.3 Attachments: MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-85. Resolution: Fixed Finally committed. Perceptron/Winnow Trainer - Key: MAHOUT-85 URL: https://issues.apache.org/jira/browse/MAHOUT-85 Project: Mahout Issue Type: New Feature Components: Classification Affects Versions: 0.1 Reporter: Isabel Drost Assignee: Isabel Drost Fix For: 0.3 Attachments: MAHOUT-85.patch, MAHOUT-85.patch, perceptronWinnowTrainer.diff Please find attached a first sketch for perceptron and winnow training. Please look very, very carefully at the patch, as I added the heart of the algorithms in the emergency room at Charite Berlin (after I broke my leg when cycling to the Hadoop Get Together ;) ). The patch does not yet feature unit tests nor is it parallelised. Currently my plan is to set up an example with the webKb dataset, add unit tests to the code and after that go parallel. I would like to get some feedback early on, in addition I would feel a lot better, if a second and third pair of eyes had a look at the code to make sure all obvious mistakes are out as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-240) Parallel version of Perceptron
Parallel version of Perceptron -- Key: MAHOUT-240 URL: https://issues.apache.org/jira/browse/MAHOUT-240 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Isabel Drost Fix For: 0.3 So far Perceptron (as well as Winnow) training is still implemented to run w/o parallelization. The goal of this issue is to explore ways for parallelization and if possible to provide a parallel version, that is one that is based on map reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-241) Example for perceptron
Example for perceptron -- Key: MAHOUT-241 URL: https://issues.apache.org/jira/browse/MAHOUT-241 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Isabel Drost Fix For: 0.3 The goal is to provide an end-to-end example based on the 20-newsgroups dataset to show how to get from a set of labelled training examples to a trained model that can later be reused. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
SparseVectors writing out a lot of data
I have been testing out the DictionaryVectorizer on 20news dataset. Its writing out 2GB vector files for the 38MB dataset This is what i am doing. Tell me where I am going wrong First I create an infinite dimensional vector of size 10, SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE, 10); Foreach(word = int id : dictionary) vector.setQuick(dictionary.get(word), weight); output.write(docid, vector) Robin
Re: SparseVectors writing out a lot of data
https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch Reduce = PartialVectorGenerator Class
[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: DictionaryVectorizer.patch Some tidying up. Still the large output bug remains Map/Reduce Implementation of Document Vectorizer Key: MAHOUT-237 URL: https://issues.apache.org/jira/browse/MAHOUT-237 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch Current Vectorizer uses Lucene Index to convert documents into SparseVectors Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector This is a pure bag-of-words based Vectorizer written in Map/Reduce. The input document is in SequenceFileText,Text . with key = docid, value = content First Map/Reduce over the document collection and generate the feature counts. Second Sequential pass reads the output of the map/reduce and converts them to SequenceFileText, LongWritable where key=feature, value = unique id Second stage should create shards of features of a given split size Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors Fourth Map/Reduce over partial shard, group by docid, create full document Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798489#action_12798489 ] Grant Ingersoll commented on MAHOUT-85: --- Why is PerceptronTrainingMapper empty? Are there Driver programs for this? How do you use the model once it is trained? Perceptron/Winnow Trainer - Key: MAHOUT-85 URL: https://issues.apache.org/jira/browse/MAHOUT-85 Project: Mahout Issue Type: New Feature Components: Classification Affects Versions: 0.1 Reporter: Isabel Drost Assignee: Isabel Drost Fix For: 0.3 Attachments: MAHOUT-85.patch, MAHOUT-85.patch, perceptronWinnowTrainer.diff Please find attached a first sketch for perceptron and winnow training. Please look very, very carefully at the patch, as I added the heart of the algorithms in the emergency room at Charite Berlin (after I broke my leg when cycling to the Hadoop Get Together ;) ). The patch does not yet feature unit tests nor is it parallelised. Currently my plan is to set up an example with the webKb dataset, add unit tests to the code and after that go parallel. I would like to get some feedback early on, in addition I would feel a lot better, if a second and third pair of eyes had a look at the code to make sure all obvious mistakes are out as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SparseVectors writing out a lot of data
Any clue why this is happening? I am running it over a small sample. I will try and pin point the issue On Sun, Jan 10, 2010 at 5:30 PM, Robin Anil robin.a...@gmail.com wrote: https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch Reduce = PartialVectorGenerator Class
Re: SparseVectors writing out a lot of data
Lot of zeros being printed in the Json string. Is that normal for an infinite cardinality vector? http://pastebin.com/m6ff5f0ef Same is true if I type cast to a Vector On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.orgwrote: Have you dumped out the file? What's in it? Also, if you can use Vector instead of SparseVector in the API (it's fine to bind to SparseVector in the implementation) I think that would be better. On Jan 10, 2010, at 7:00 AM, Robin Anil wrote: https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch Reduce = PartialVectorGenerator Class
Re: SparseVectors writing out a lot of data
On Jan 10, 2010, at 9:43 AM, Robin Anil wrote: Lot of zeros being printed in the Json string. Is that normal for an infinite cardinality vector? It shouldn't print them if you are using a SparseVector, but my guess is there is something odd going on here when writing it out such that it is writing all the zeros. Also, is it writing JSON to the SeqFile or is that just the result of the dumper? Sounds like you need to hook up a debugger. http://pastebin.com/m6ff5f0ef Same is true if I type cast to a Vector Sure, it's still a SparseVector. My comment about using Vector was just for the API level, not the actual implementation On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.orgwrote: Have you dumped out the file? What's in it? Also, if you can use Vector instead of SparseVector in the API (it's fine to bind to SparseVector in the implementation) I think that would be better. On Jan 10, 2010, at 7:00 AM, Robin Anil wrote: https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch Reduce = PartialVectorGenerator Class
Re: SparseVectors writing out a lot of data
I've noticed the same thing when looking at SparseVectors contained withinthe results of ClusterDumper -- I didn't explore very far into why, but it seems that the json representation of the SparseVector doesn't use a map but instead uses parallel arrays of certain sizes. I'm not certain how the sizes are determined, but I assumed that this had something to do with how SparseVector is implemented. Perhaps this is/will be remedied in some of Jake's recent work? On Sun, Jan 10, 2010 at 9:43 AM, Robin Anil robin.a...@gmail.com wrote: Lot of zeros being printed in the Json string. Is that normal for an infinite cardinality vector? http://pastebin.com/m6ff5f0ef Same is true if I type cast to a Vector On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.org wrote: Have you dumped out the file? What's in it? Also, if you can use Vector instead of SparseVector in the API (it's fine to bind to SparseVector in the implementation) I think that would be better. On Jan 10, 2010, at 7:00 AM, Robin Anil wrote: https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch Reduce = PartialVectorGenerator Class
[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798499#action_12798499 ] Benson Margulies commented on MAHOUT-239: - Sean, I don't see the deletes. Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 0.3 Attachments: MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies reopened MAHOUT-239: - Assignee: Sean Owen (was: Benson Margulies) These files still need to be deleted: math/src/main/java/org/apache/mahout/math/map: OpenDoubleIntHashMap.java OpenIntDoubleHashMap.java OpenIntIntHashMap.java Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798503#action_12798503 ] Sean Owen commented on MAHOUT-239: -- I don't see these files and see no diff from svn diff...? anyone? That said I'm not surprised if there is a glitch. IntelliJ wouldn't apply the patch. I did apply with 'patch' but didn't see deletes. Not sure what's up. Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-239: Attachment: MAHOUT-239.diff The previous one wasn't all there. Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-239.diff, MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798538#action_12798538 ] Benson Margulies commented on MAHOUT-239: - When I want back to my tree today, to my horror I discovered that I hadn't done all the svn add's I thought I had done. So I've reposted the patch. Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-239.diff, MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[math] Another taste question about collection
The hash map classes define 'pairsSortedByValue'. In the case of a map that delivers objects, this requires the value type to implement comparable, which, of course, not all classes will. It seems insane to only support these maps (e.g. OpenIntObjectHashMapT) for 'T extends Comparable'. So I plan to make it throw if the type happens not to implement comparable.
Re: [math] Question of taste: 'ObjectArrayList'
I weakly vote for chucking it out. On Sun, Jan 10, 2010 at 8:17 PM, Benson Margulies bimargul...@gmail.com wrote: Colt brought us 'ObjectArrayList'. You might ask, what advantage does it have over ArrayListT?
Re: [math] Another taste question about collection
Agree with this as well. On Sun, Jan 10, 2010 at 8:22 PM, Benson Margulies bimargul...@gmail.com wrote: The hash map classes define 'pairsSortedByValue'. In the case of a map that delivers objects, this requires the value type to implement comparable, which, of course, not all classes will. It seems insane to only support these maps (e.g. OpenIntObjectHashMapT) for 'T extends Comparable'. So I plan to make it throw if the type happens not to implement comparable.
[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798540#action_12798540 ] Sean Owen commented on MAHOUT-239: -- Committed, but the deletes still appear to delete files not in SVN according to svn and IntelliJ. I'm hoping that's not some glitch on my end. Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-239.diff, MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value
[ https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-239. -- Resolution: Fixed Assignee: Benson Margulies (was: Sean Owen) Complete set of open hash maps with primitive types as both key and value - Key: MAHOUT-239 URL: https://issues.apache.org/jira/browse/MAHOUT-239 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 0.3 Attachments: MAHOUT-239.diff, MAHOUT-239.diff Here is the template providing the hash map and the test for all the primitive type pairs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SparseVectors writing out a lot of data
On Sun, Jan 10, 2010 at 6:43 AM, Robin Anil robin.a...@gmail.com wrote: Lot of zeros being printed in the Json string. Is that normal for an infinite cardinality vector? http://pastebin.com/m6ff5f0ef Same is true if I type cast to a Vector Where did this JSON output come from, Robin? That isn't what's in the file, is it? That is the output of AbstractVector.decodeVector(v), right? In that form, there can certainly be zeroes in the output, because the GSON decoder is just spitting out the whole hashmap entries (both keys and values as separate arrays), and it's got a bunch of zeroes still in it (because it's not anywhere near perfectly packed), that's not the problem, I don't think. This doesn't really shed much light on what the vector contents is. The Vectorizer is checked in, right? Or is it still on the JIRA patch? -jake
Re: [math] Question of taste: 'ObjectArrayList'
I take it that the point of this code is to allow filling an ArrayList with null values efficiently. This might sometimes be useful, I suppose. It sounds like you are saying that the virtue of the ObjectArrayList is that we own it and can make this resizing method efficient. I don't see any advantage of that strategy versus forking some other implementation or even starting a new implementation from scratch. I am also not clear on the virtues of this resizing in general. In particular, the idiom if (l.size() size) l.sublist(size, l.size() - size).clear() seems a better way to clear a bunch of values. Collections.fill() applied to a sublist should be good as well. If you just need to fill in an ArrayList quickly to a desired size, then addAll from a static list of nulls could be a bit faster. Is speed really important here, though? As a side note, the tests in your code appear to be backwards. On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies bimargul...@gmail.comwrote: Colt brought us 'ObjectArrayList'. You might ask, what advantage does it have over ArrayListT? Well, I just found myself writing the following to use ArrayListT in the xxxObjectHashMap set. I could rework ObjectArrayList to be a subclass of ArrayList that provided this efficiently, instead of my current plan to throw it out altogether once I've got other things cleaned up. Thoughts? private void resizeArrayList(ArrayListT l, int size) { while (size l.size()) { l.add(null); } while(size l.size()) { l.remove(l.size()-1); } } -- Ted Dunning, CTO DeepDyve
Re: [math] Another taste question about collection
Totally agree with this. IllegalArgumentException seems made for this. On Sun, Jan 10, 2010 at 12:22 PM, Benson Margulies bimargul...@gmail.comwrote: So I plan to make it throw if the type happens not to implement comparable. -- Ted Dunning, CTO DeepDyve
Re: SparseVectors writing out a lot of data
The only major cases that I know that we have are matrices of small integers. The most impressive case is where the sparse matrix can only contain 1 for non-zero values since you don't have to store the value at all, just the index. If the indexes are sorted, then delta-PFOR coding or some such could give some very impressively nice scanning behavior but pretty hideous indexing behavior. Just keeping the indexes in sorted order gives you pretty good memory performance (4 bytes per value) which might be sufficient. For small integer needs, I would recommend something akin to PFOR so that you could store bytes for most values and have an exception table of some kind for larger values. One off the cuff design would keep the small values in a sparse byte matrix and keep the (hopefullly much sparser) larger values in a tower of larger stores. Such a matrix would have pretty bizarre properties, but would be an ace on storage size. On Sun, Jan 10, 2010 at 1:03 PM, Robin Anil robin.a...@gmail.com wrote: I used the VIntWritable and VLongWritable in the IntTupleWritable to compress the space(variable 2-5 bytes to store integers) needed to represent smaller integers. That gave me a lot of savings in PFPgrowth algorithm. Does someone have a similar representation for double values. -- Ted Dunning, CTO DeepDyve
Re: SparseVectors writing out a lot of data
I don't know if it helps, but I have a sparse vector file that is based off a 1.8 MB Lucene index and it takes up 143 kb. Earlier, I had a Lucene index that was several megs (20+) and the vectors only took 1 mb. Have you tried debugging? If I can finish up my chapter tonight, I will try to take a closer look. On Jan 10, 2010, at 6:45 AM, Robin Anil wrote: I have been testing out the DictionaryVectorizer on 20news dataset. Its writing out 2GB vector files for the 38MB dataset This is what i am doing. Tell me where I am going wrong First I create an infinite dimensional vector of size 10, SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE, 10); Foreach(word = int id : dictionary) vector.setQuick(dictionary.get(word), weight); output.write(docid, vector) Robin -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: [math] Question of taste: 'ObjectArrayList'
Ted, I had much the same thoughts while driving to the grocery store after sending that message. Death to the class, and I'll clean up the implementation of the users of the resize business. --benson On Sun, Jan 10, 2010 at 4:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: I take it that the point of this code is to allow filling an ArrayList with null values efficiently. This might sometimes be useful, I suppose. It sounds like you are saying that the virtue of the ObjectArrayList is that we own it and can make this resizing method efficient. I don't see any advantage of that strategy versus forking some other implementation or even starting a new implementation from scratch. I am also not clear on the virtues of this resizing in general. In particular, the idiom if (l.size() size) l.sublist(size, l.size() - size).clear() seems a better way to clear a bunch of values. Collections.fill() applied to a sublist should be good as well. If you just need to fill in an ArrayList quickly to a desired size, then addAll from a static list of nulls could be a bit faster. Is speed really important here, though? As a side note, the tests in your code appear to be backwards. On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies bimargul...@gmail.comwrote: Colt brought us 'ObjectArrayList'. You might ask, what advantage does it have over ArrayListT? Well, I just found myself writing the following to use ArrayListT in the xxxObjectHashMap set. I could rework ObjectArrayList to be a subclass of ArrayList that provided this efficiently, instead of my current plan to throw it out altogether once I've got other things cleaned up. Thoughts? private void resizeArrayList(ArrayListT l, int size) { while (size l.size()) { l.add(null); } while(size l.size()) { l.remove(l.size()-1); } } -- Ted Dunning, CTO DeepDyve
Re: [math] Question of taste: 'ObjectArrayList'
p.s. .sdrawkcab si od I gnihtyreve os cibaraA tuoba gnikniht neeb evah I. On Sun, Jan 10, 2010 at 4:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: I take it that the point of this code is to allow filling an ArrayList with null values efficiently. This might sometimes be useful, I suppose. It sounds like you are saying that the virtue of the ObjectArrayList is that we own it and can make this resizing method efficient. I don't see any advantage of that strategy versus forking some other implementation or even starting a new implementation from scratch. I am also not clear on the virtues of this resizing in general. In particular, the idiom if (l.size() size) l.sublist(size, l.size() - size).clear() seems a better way to clear a bunch of values. Collections.fill() applied to a sublist should be good as well. If you just need to fill in an ArrayList quickly to a desired size, then addAll from a static list of nulls could be a bit faster. Is speed really important here, though? As a side note, the tests in your code appear to be backwards. On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies bimargul...@gmail.comwrote: Colt brought us 'ObjectArrayList'. You might ask, what advantage does it have over ArrayListT? Well, I just found myself writing the following to use ArrayListT in the xxxObjectHashMap set. I could rework ObjectArrayList to be a subclass of ArrayList that provided this efficiently, instead of my current plan to throw it out altogether once I've got other things cleaned up. Thoughts? private void resizeArrayList(ArrayListT l, int size) { while (size l.size()) { l.add(null); } while(size l.size()) { l.remove(l.size()-1); } } -- Ted Dunning, CTO DeepDyve
[math] no-such-integer value
Colt code is inconsistent in dealing with the following case: iIntSomethingHashMap.keyOf(someValue) Some code we got from them returns 0 if there is no such value, other code returns MIN_VALUE. In floating-point land, it returns NAN. Personally, I'd be inclined to nuke the entire API. It's implemented as the obvious iteration, and the caller can iterate for themselves without creating a shoot-yourself opportunity for the unwary. Another alternative is to make the signature keyType keyOf(valueType value, boolean[] present) I'm in favor of removal, but I could live with the boolean. Thoughts?
Re: [math] no-such-integer value
Nuke it if you don't use it. On Sun, Jan 10, 2010 at 7:36 PM, Benson Margulies bimargul...@gmail.comwrote: Personally, I'd be inclined to nuke the entire API. It's implemented as the obvious iteration, and the caller can iterate for themselves without creating a shoot-yourself opportunity for the unwary. Another alternative is to make the signature keyType keyOf(valueType value, boolean[] present) I'm in favor of removal, but I could live with the boolean. Thoughts? -- Ted Dunning, CTO DeepDyve
[jira] Created: (MAHOUT-242) LLR Collocation Identifier
LLR Collocation Identifier -- Key: MAHOUT-242 URL: https://issues.apache.org/jira/browse/MAHOUT-242 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Drew Farris Priority: Minor Attachments: mahout-colloc.tar.gz Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation. As discussed in: * http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 * http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using: {noformat} mvn -e exec:java -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver -Dexec.args=--input src/test/resources/article --colloc target/colloc --output target/output -w {noformat} Output will be placed in target/output and can be viewed nicely using: {noformat} sort -rn -k1 target/output/part-0 {noformat} Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in MAHOUT-237 Some basic TODO/FIXME's include: * use mahout math's ObjectInt map implementation when available * make the analyzer configurable * better input validation + negative unit tests. * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-242) LLR Collocation Identifier
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-242: --- Attachment: mahout-colloc.tar.gz LLR Collocation Identifier -- Key: MAHOUT-242 URL: https://issues.apache.org/jira/browse/MAHOUT-242 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Drew Farris Priority: Minor Attachments: mahout-colloc.tar.gz Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation. As discussed in: * http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 * http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using: {noformat} mvn -e exec:java -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver -Dexec.args=--input src/test/resources/article --colloc target/colloc --output target/output -w {noformat} Output will be placed in target/output and can be viewed nicely using: {noformat} sort -rn -k1 target/output/part-0 {noformat} Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in MAHOUT-237 Some basic TODO/FIXME's include: * use mahout math's ObjectInt map implementation when available * make the analyzer configurable * better input validation + negative unit tests. * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-242) LLR Collocation Identifier
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798575#action_12798575 ] Robin Anil commented on MAHOUT-242: --- * Try to stick to SequenceFileText,Text docid = for input content and leave it to the user to generate this file. There is a SequenceFileFromDirectory class in examples in my patch which does this conversion and writes SequenceFiles on HDFS directly * Also Take a look at the TermCountMapper, where I have parameterized the Lucene Analyzer through the conf * If you need to pass a tuple as input or output. Check out the StringTupleWritable class, instead of appending stuff to Text or splitting it. LLR Collocation Identifier -- Key: MAHOUT-242 URL: https://issues.apache.org/jira/browse/MAHOUT-242 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Drew Farris Priority: Minor Attachments: mahout-colloc.tar.gz Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation. As discussed in: * http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 * http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using: {noformat} mvn -e exec:java -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver -Dexec.args=--input src/test/resources/article --colloc target/colloc --output target/output -w {noformat} Output will be placed in target/output and can be viewed nicely using: {noformat} sort -rn -k1 target/output/part-0 {noformat} Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in MAHOUT-237 Some basic TODO/FIXME's include: * use mahout math's ObjectInt map implementation when available * make the analyzer configurable * better input validation + negative unit tests. * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.