[jira] Updated: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-239:
-

Resolution: Fixed
  Assignee: Benson Margulies
Status: Resolved  (was: Patch Available)

Committed

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 0.3

 Attachments: MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-85) Perceptron/Winnow Trainer

2010-01-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-85.


Resolution: Fixed

Finally committed.

 Perceptron/Winnow Trainer
 -

 Key: MAHOUT-85
 URL: https://issues.apache.org/jira/browse/MAHOUT-85
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Affects Versions: 0.1
Reporter: Isabel Drost
Assignee: Isabel Drost
 Fix For: 0.3

 Attachments: MAHOUT-85.patch, MAHOUT-85.patch, 
 perceptronWinnowTrainer.diff


 Please find attached a first sketch for perceptron and winnow training. 
 Please look very, very carefully at the patch, as I added the heart of the 
 algorithms in the emergency room at Charite Berlin (after I broke my leg when 
 cycling to the Hadoop Get Together ;) ). 
 The patch does not yet feature unit tests nor is it parallelised. Currently 
 my plan is to set up an example with the webKb dataset, add unit tests to the 
 code and after that go parallel. I would like to get some feedback early on, 
 in addition I would feel a lot better, if a second and third pair of eyes had 
 a look at the code to make sure all obvious mistakes are out as early as 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-240) Parallel version of Perceptron

2010-01-10 Thread Isabel Drost (JIRA)
Parallel version of Perceptron
--

 Key: MAHOUT-240
 URL: https://issues.apache.org/jira/browse/MAHOUT-240
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Isabel Drost
 Fix For: 0.3


So far Perceptron (as well as Winnow) training is still implemented to run w/o 
parallelization. The goal of this issue is to explore ways for parallelization 
and if possible to provide a parallel version, that is one that is based on map 
reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-241) Example for perceptron

2010-01-10 Thread Isabel Drost (JIRA)
Example for perceptron
--

 Key: MAHOUT-241
 URL: https://issues.apache.org/jira/browse/MAHOUT-241
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Isabel Drost
 Fix For: 0.3


The goal is to provide an end-to-end example based on the 20-newsgroups dataset 
to show how to get from a set of labelled training examples to a trained model 
that can later be reused.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



SparseVectors writing out a lot of data

2010-01-10 Thread Robin Anil
I have been testing out the DictionaryVectorizer on 20news dataset. Its
writing out 2GB vector files for the 38MB dataset

This is what i am doing. Tell me where I am going wrong

First I create an infinite dimensional vector of size 10,
 SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE,
  10);

Foreach(word = int id : dictionary)
  vector.setQuick(dictionary.get(word), weight);

output.write(docid, vector)
Robin


Re: SparseVectors writing out a lot of data

2010-01-10 Thread Robin Anil
https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch

Reduce = PartialVectorGenerator Class


[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-10 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-237:
--

Attachment: DictionaryVectorizer.patch

Some tidying up. Still the large output bug remains

 Map/Reduce Implementation of Document Vectorizer
 

 Key: MAHOUT-237
 URL: https://issues.apache.org/jira/browse/MAHOUT-237
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch


 Current Vectorizer uses Lucene Index to convert documents into SparseVectors
 Ted is working on a Hash based Vectorizer which can map features into Vectors 
 of fixed size and sum it up to get the document Vector
 This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
 The input document is in SequenceFileText,Text . with key = docid, value = 
 content
 First Map/Reduce over the document collection and generate the feature counts.
 Second Sequential pass reads the output of the map/reduce and converts them 
 to SequenceFileText, LongWritable where key=feature, value = unique id 
 Second stage should create shards of features of a given split size
 Third Map/Reduce over the document collection, using each shard and create 
 Partial(containing the features of the given shard) SparseVectors 
 Fourth Map/Reduce over partial shard, group by docid, create full document 
 Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer

2010-01-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798489#action_12798489
 ] 

Grant Ingersoll commented on MAHOUT-85:
---

Why is PerceptronTrainingMapper empty?  Are there Driver programs for this?  
How do you use the model once it is trained?

 Perceptron/Winnow Trainer
 -

 Key: MAHOUT-85
 URL: https://issues.apache.org/jira/browse/MAHOUT-85
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Affects Versions: 0.1
Reporter: Isabel Drost
Assignee: Isabel Drost
 Fix For: 0.3

 Attachments: MAHOUT-85.patch, MAHOUT-85.patch, 
 perceptronWinnowTrainer.diff


 Please find attached a first sketch for perceptron and winnow training. 
 Please look very, very carefully at the patch, as I added the heart of the 
 algorithms in the emergency room at Charite Berlin (after I broke my leg when 
 cycling to the Hadoop Get Together ;) ). 
 The patch does not yet feature unit tests nor is it parallelised. Currently 
 my plan is to set up an example with the webKb dataset, add unit tests to the 
 code and after that go parallel. I would like to get some feedback early on, 
 in addition I would feel a lot better, if a second and third pair of eyes had 
 a look at the code to make sure all obvious mistakes are out as early as 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SparseVectors writing out a lot of data

2010-01-10 Thread Robin Anil
Any clue why this is happening? I am running it over a small sample. I will
try and pin point the issue


On Sun, Jan 10, 2010 at 5:30 PM, Robin Anil robin.a...@gmail.com wrote:


 https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch

 Reduce = PartialVectorGenerator Class





Re: SparseVectors writing out a lot of data

2010-01-10 Thread Robin Anil
Lot of zeros being printed in the Json string. Is that normal for an
infinite cardinality vector?
http://pastebin.com/m6ff5f0ef
Same is true if I type cast to a Vector


On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Have you dumped out the file?  What's in it?

 Also, if you can use Vector instead of SparseVector in the API (it's fine
 to bind to SparseVector in the implementation) I think that would be better.

 On Jan 10, 2010, at 7:00 AM, Robin Anil wrote:

 
 https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch
 
  Reduce = PartialVectorGenerator Class





Re: SparseVectors writing out a lot of data

2010-01-10 Thread Grant Ingersoll

On Jan 10, 2010, at 9:43 AM, Robin Anil wrote:

 Lot of zeros being printed in the Json string. Is that normal for an
 infinite cardinality vector?

It shouldn't print them if you are using a SparseVector, but my guess is there 
is something odd going on here when writing it out such that it is writing all 
the zeros.  Also, is it writing JSON to the SeqFile or is that just the result 
of the dumper?

Sounds like you need to hook up a debugger.

 http://pastebin.com/m6ff5f0ef
 Same is true if I type cast to a Vector

Sure, it's still a SparseVector.  My comment about using Vector was just for 
the API level, not the actual implementation 

 
 
 On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.orgwrote:
 
 Have you dumped out the file?  What's in it?
 
 Also, if you can use Vector instead of SparseVector in the API (it's fine
 to bind to SparseVector in the implementation) I think that would be better.
 
 On Jan 10, 2010, at 7:00 AM, Robin Anil wrote:
 
 
 https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch
 
 Reduce = PartialVectorGenerator Class
 
 
 



Re: SparseVectors writing out a lot of data

2010-01-10 Thread Drew Farris
I've noticed the same thing when looking at SparseVectors contained
withinthe results of ClusterDumper -- I didn't explore very far into why,
but it seems that the json representation of the SparseVector doesn't use a
map but instead uses parallel arrays of certain sizes. I'm not certain how
the sizes are determined, but I assumed that this had something to do with
how SparseVector is implemented.

Perhaps this is/will be remedied in some of Jake's recent work?

On Sun, Jan 10, 2010 at 9:43 AM, Robin Anil robin.a...@gmail.com wrote:

 Lot of zeros being printed in the Json string. Is that normal for an
 infinite cardinality vector?
 http://pastebin.com/m6ff5f0ef
 Same is true if I type cast to a Vector


 On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll gsing...@apache.org
 wrote:

  Have you dumped out the file?  What's in it?
 
  Also, if you can use Vector instead of SparseVector in the API (it's fine
  to bind to SparseVector in the implementation) I think that would be
 better.
 
  On Jan 10, 2010, at 7:00 AM, Robin Anil wrote:
 
  
 
 https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch
  
   Reduce = PartialVectorGenerator Class
 
 
 



[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798499#action_12798499
 ] 

Benson Margulies commented on MAHOUT-239:
-

Sean, I don't see the deletes.


 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 0.3

 Attachments: MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies reopened MAHOUT-239:
-

  Assignee: Sean Owen  (was: Benson Margulies)

These files still need to be deleted:

math/src/main/java/org/apache/mahout/math/map:

OpenDoubleIntHashMap.java
OpenIntDoubleHashMap.java
OpenIntIntHashMap.java

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798503#action_12798503
 ] 

Sean Owen commented on MAHOUT-239:
--

I don't see these files and see no diff from svn diff...? anyone?
That said I'm not surprised if there is a glitch. IntelliJ wouldn't apply the 
patch. I did apply with 'patch' but didn't see deletes. Not sure what's up.

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-239:


Attachment: MAHOUT-239.diff

The previous one wasn't all there.

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-239.diff, MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798538#action_12798538
 ] 

Benson Margulies commented on MAHOUT-239:
-

When I want back to my tree today, to my horror I discovered that I hadn't done 
all the svn add's I thought I had done. So I've reposted the patch.



 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-239.diff, MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[math] Another taste question about collection

2010-01-10 Thread Benson Margulies
The hash map classes define 'pairsSortedByValue'. In the case of a map
that delivers objects, this requires the value type to implement
comparable, which, of course, not all classes will. It seems insane to
only support these maps (e.g. OpenIntObjectHashMapT) for 'T extends
Comparable'. So I plan to make it throw if the type happens not to
implement comparable.


Re: [math] Question of taste: 'ObjectArrayList'

2010-01-10 Thread Sean Owen
I weakly vote for chucking it out.

On Sun, Jan 10, 2010 at 8:17 PM, Benson Margulies bimargul...@gmail.com wrote:
 Colt brought us 'ObjectArrayList'. You might ask, what advantage does
 it have over ArrayListT?



Re: [math] Another taste question about collection

2010-01-10 Thread Sean Owen
Agree with this as well.

On Sun, Jan 10, 2010 at 8:22 PM, Benson Margulies bimargul...@gmail.com wrote:
 The hash map classes define 'pairsSortedByValue'. In the case of a map
 that delivers objects, this requires the value type to implement
 comparable, which, of course, not all classes will. It seems insane to
 only support these maps (e.g. OpenIntObjectHashMapT) for 'T extends
 Comparable'. So I plan to make it throw if the type happens not to
 implement comparable.



[jira] Commented: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798540#action_12798540
 ] 

Sean Owen commented on MAHOUT-239:
--

Committed, but the deletes still appear to delete files not in SVN according to 
svn and IntelliJ. I'm hoping that's not some glitch on my end.

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-239.diff, MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-239) Complete set of open hash maps with primitive types as both key and value

2010-01-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-239.
--

Resolution: Fixed
  Assignee: Benson Margulies  (was: Sean Owen)

 Complete set of open hash maps with primitive types as both key and value
 -

 Key: MAHOUT-239
 URL: https://issues.apache.org/jira/browse/MAHOUT-239
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 0.3

 Attachments: MAHOUT-239.diff, MAHOUT-239.diff


 Here is the template providing the hash map and the test for all the 
 primitive type pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SparseVectors writing out a lot of data

2010-01-10 Thread Jake Mannix
On Sun, Jan 10, 2010 at 6:43 AM, Robin Anil robin.a...@gmail.com wrote:

 Lot of zeros being printed in the Json string. Is that normal for an
 infinite cardinality vector?
 http://pastebin.com/m6ff5f0ef
 Same is true if I type cast to a Vector


Where did this JSON output come from, Robin?  That isn't what's in the file,
is it?  That is the output of AbstractVector.decodeVector(v), right?
In that form, there can certainly be zeroes in the output, because the GSON
decoder is just spitting out the whole hashmap entries (both keys and values
as separate arrays), and it's got a bunch of zeroes still in it (because
it's not
anywhere near perfectly packed), that's not the problem, I don't think.

This doesn't really shed much light on what the vector contents is.  The
Vectorizer is checked in, right? Or is it still on the JIRA patch?

  -jake


Re: [math] Question of taste: 'ObjectArrayList'

2010-01-10 Thread Ted Dunning
I take it that the point of this code is to allow filling an ArrayList with
null values efficiently.  This might sometimes be useful, I suppose.

It sounds like you are saying that the virtue of the ObjectArrayList is that
we own it and can make this resizing method efficient.  I don't see any
advantage of that strategy versus forking some other implementation or even
starting a new implementation from scratch.  I am also not clear on the
virtues of this resizing in general.  In particular, the idiom

if (l.size()  size) l.sublist(size, l.size() - size).clear()

seems a better way to clear a bunch of values.  Collections.fill() applied
to a sublist should be good as well.

If you just need to fill in an ArrayList quickly to a desired size, then
addAll from a static list of nulls could be a bit faster.  Is speed really
important here, though?

As a side note, the tests in your code appear to be backwards.

On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies bimargul...@gmail.comwrote:

 Colt brought us 'ObjectArrayList'. You might ask, what advantage does
 it have over ArrayListT?

 Well, I just found myself writing the following to use ArrayListT in
 the xxxObjectHashMap set. I could rework ObjectArrayList to be a
 subclass of ArrayList that provided this efficiently, instead of my
 current plan to throw it out altogether once I've got other things
 cleaned up. Thoughts?

  private void resizeArrayList(ArrayListT l, int size) {
while (size  l.size()) {
  l.add(null);
}
while(size  l.size()) {
  l.remove(l.size()-1);
}
  }




-- 
Ted Dunning, CTO
DeepDyve


Re: [math] Another taste question about collection

2010-01-10 Thread Ted Dunning
Totally agree with this.  IllegalArgumentException seems made for this.

On Sun, Jan 10, 2010 at 12:22 PM, Benson Margulies bimargul...@gmail.comwrote:

 So I plan to make it throw if the type happens not to
 implement comparable.




-- 
Ted Dunning, CTO
DeepDyve


Re: SparseVectors writing out a lot of data

2010-01-10 Thread Ted Dunning
The only major cases that I know that we have are matrices of small
integers.  The most impressive case is where the sparse matrix can only
contain 1 for non-zero values since you don't have to store the value at
all, just the index.  If the indexes are sorted, then delta-PFOR coding or
some such could give some very impressively nice scanning behavior but
pretty hideous indexing behavior.  Just keeping the indexes in sorted order
gives you pretty good memory performance (4 bytes per value) which might be
sufficient.

For small integer needs, I would recommend something akin to PFOR so that
you could store bytes for most values and have an exception table of some
kind for larger values.  One off the cuff design would keep the small values
in a sparse byte matrix and keep the (hopefullly much sparser) larger values
in a tower of larger stores.  Such a matrix would have pretty bizarre
properties, but would be an ace on storage size.

On Sun, Jan 10, 2010 at 1:03 PM, Robin Anil robin.a...@gmail.com wrote:

 I used the VIntWritable and VLongWritable in the IntTupleWritable to
 compress the space(variable 2-5 bytes to store integers) needed to
 represent
 smaller integers. That gave me a lot of savings in PFPgrowth algorithm.
 Does
 someone have a similar representation for double values.




-- 
Ted Dunning, CTO
DeepDyve


Re: SparseVectors writing out a lot of data

2010-01-10 Thread Grant Ingersoll
I don't know if it helps, but I have a sparse vector file that is based off a 
1.8 MB Lucene index and it takes up 143 kb.  Earlier, I had a Lucene index that 
was several megs (20+) and the vectors only took 1 mb.

Have you tried debugging?  If I can finish up my chapter tonight, I will try to 
take a closer look.

On Jan 10, 2010, at 6:45 AM, Robin Anil wrote:

 I have been testing out the DictionaryVectorizer on 20news dataset. Its
 writing out 2GB vector files for the 38MB dataset
 
 This is what i am doing. Tell me where I am going wrong
 
 First I create an infinite dimensional vector of size 10,
 SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE,
  10);
 
 Foreach(word = int id : dictionary)
  vector.setQuick(dictionary.get(word), weight);
 
 output.write(docid, vector)
 Robin

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: [math] Question of taste: 'ObjectArrayList'

2010-01-10 Thread Benson Margulies
Ted,

I had much the same thoughts while driving to the grocery store after
sending that message.

Death to the class, and I'll clean up the implementation of the users
of the resize business.

--benson


On Sun, Jan 10, 2010 at 4:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 I take it that the point of this code is to allow filling an ArrayList with
 null values efficiently.  This might sometimes be useful, I suppose.

 It sounds like you are saying that the virtue of the ObjectArrayList is that
 we own it and can make this resizing method efficient.  I don't see any
 advantage of that strategy versus forking some other implementation or even
 starting a new implementation from scratch.  I am also not clear on the
 virtues of this resizing in general.  In particular, the idiom

    if (l.size()  size) l.sublist(size, l.size() - size).clear()

 seems a better way to clear a bunch of values.  Collections.fill() applied
 to a sublist should be good as well.

 If you just need to fill in an ArrayList quickly to a desired size, then
 addAll from a static list of nulls could be a bit faster.  Is speed really
 important here, though?

 As a side note, the tests in your code appear to be backwards.

 On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies 
 bimargul...@gmail.comwrote:

 Colt brought us 'ObjectArrayList'. You might ask, what advantage does
 it have over ArrayListT?

 Well, I just found myself writing the following to use ArrayListT in
 the xxxObjectHashMap set. I could rework ObjectArrayList to be a
 subclass of ArrayList that provided this efficiently, instead of my
 current plan to throw it out altogether once I've got other things
 cleaned up. Thoughts?

  private void resizeArrayList(ArrayListT l, int size) {
    while (size  l.size()) {
      l.add(null);
    }
    while(size  l.size()) {
      l.remove(l.size()-1);
    }
  }




 --
 Ted Dunning, CTO
 DeepDyve



Re: [math] Question of taste: 'ObjectArrayList'

2010-01-10 Thread Benson Margulies
p.s.

.sdrawkcab si od I gnihtyreve os cibaraA tuoba gnikniht neeb evah I.

On Sun, Jan 10, 2010 at 4:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 I take it that the point of this code is to allow filling an ArrayList with
 null values efficiently.  This might sometimes be useful, I suppose.

 It sounds like you are saying that the virtue of the ObjectArrayList is that
 we own it and can make this resizing method efficient.  I don't see any
 advantage of that strategy versus forking some other implementation or even
 starting a new implementation from scratch.  I am also not clear on the
 virtues of this resizing in general.  In particular, the idiom

    if (l.size()  size) l.sublist(size, l.size() - size).clear()

 seems a better way to clear a bunch of values.  Collections.fill() applied
 to a sublist should be good as well.

 If you just need to fill in an ArrayList quickly to a desired size, then
 addAll from a static list of nulls could be a bit faster.  Is speed really
 important here, though?

 As a side note, the tests in your code appear to be backwards.

 On Sun, Jan 10, 2010 at 12:17 PM, Benson Margulies 
 bimargul...@gmail.comwrote:

 Colt brought us 'ObjectArrayList'. You might ask, what advantage does
 it have over ArrayListT?

 Well, I just found myself writing the following to use ArrayListT in
 the xxxObjectHashMap set. I could rework ObjectArrayList to be a
 subclass of ArrayList that provided this efficiently, instead of my
 current plan to throw it out altogether once I've got other things
 cleaned up. Thoughts?

  private void resizeArrayList(ArrayListT l, int size) {
    while (size  l.size()) {
      l.add(null);
    }
    while(size  l.size()) {
      l.remove(l.size()-1);
    }
  }




 --
 Ted Dunning, CTO
 DeepDyve



[math] no-such-integer value

2010-01-10 Thread Benson Margulies
Colt code is inconsistent in dealing with the following case:

 iIntSomethingHashMap.keyOf(someValue)

Some code we got from them returns 0 if there is no such value, other
code returns MIN_VALUE. In floating-point land, it returns NAN.

Personally, I'd be inclined to nuke the entire API. It's implemented
as the obvious iteration, and the caller can iterate for themselves
without creating a shoot-yourself opportunity for the unwary.

Another alternative is to make the signature

keyType keyOf(valueType value, boolean[] present)

I'm in favor of removal, but I could live with the boolean. Thoughts?


Re: [math] no-such-integer value

2010-01-10 Thread Ted Dunning
Nuke it if you don't use it.

On Sun, Jan 10, 2010 at 7:36 PM, Benson Margulies bimargul...@gmail.comwrote:

 Personally, I'd be inclined to nuke the entire API. It's implemented
 as the obvious iteration, and the caller can iterate for themselves
 without creating a shoot-yourself opportunity for the unwary.

 Another alternative is to make the signature

 keyType keyOf(valueType value, boolean[] present)

 I'm in favor of removal, but I could live with the boolean. Thoughts?




-- 
Ted Dunning, CTO
DeepDyve


[jira] Created: (MAHOUT-242) LLR Collocation Identifier

2010-01-10 Thread Drew Farris (JIRA)
LLR Collocation Identifier
--

 Key: MAHOUT-242
 URL: https://issues.apache.org/jira/browse/MAHOUT-242
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Attachments: mahout-colloc.tar.gz

Identifies interesting Collocations in text using ngrams scored via the 
LogLikelihoodRatio calculation. 

As discussed in: 
* 
http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
* 
http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e

Current form is a tar of a maven project that depends on mahout. Build as usual 
with 'mvn clean install', can be executed using:

{noformat}
mvn -e exec:java  -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver 
-Dexec.args=--input src/test/resources/article --colloc target/colloc --output 
target/output -w
{noformat}

Output will be placed in target/output and can be viewed nicely using:

{noformat}
sort -rn -k1 target/output/part-0
{noformat}

Includes rudimentary unit tests. Please review and comment. Needs more work to 
get this into patch state and integrate with Robin's document vectorizer work 
in MAHOUT-237

Some basic TODO/FIXME's include:
* use mahout math's ObjectInt map implementation when available
* make the analyzer configurable
* better input validation + negative unit tests.
* more flexible ways to generate units of analysis (n-1)grams.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-242) LLR Collocation Identifier

2010-01-10 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-242:
---

Attachment: mahout-colloc.tar.gz

 LLR Collocation Identifier
 --

 Key: MAHOUT-242
 URL: https://issues.apache.org/jira/browse/MAHOUT-242
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Attachments: mahout-colloc.tar.gz


 Identifies interesting Collocations in text using ngrams scored via the 
 LogLikelihoodRatio calculation. 
 As discussed in: 
 * 
 http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
 * 
 http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
 Current form is a tar of a maven project that depends on mahout. Build as 
 usual with 'mvn clean install', can be executed using:
 {noformat}
 mvn -e exec:java  -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver 
 -Dexec.args=--input src/test/resources/article --colloc target/colloc 
 --output target/output -w
 {noformat}
 Output will be placed in target/output and can be viewed nicely using:
 {noformat}
 sort -rn -k1 target/output/part-0
 {noformat}
 Includes rudimentary unit tests. Please review and comment. Needs more work 
 to get this into patch state and integrate with Robin's document vectorizer 
 work in MAHOUT-237
 Some basic TODO/FIXME's include:
 * use mahout math's ObjectInt map implementation when available
 * make the analyzer configurable
 * better input validation + negative unit tests.
 * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-242) LLR Collocation Identifier

2010-01-10 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798575#action_12798575
 ] 

Robin Anil commented on MAHOUT-242:
---


* Try to stick to SequenceFileText,Text docid = for input  content and leave 
it to the user to generate this file. There is a SequenceFileFromDirectory 
class in examples in my patch which does this conversion and writes 
SequenceFiles  on HDFS directly 
*  Also Take a look at the TermCountMapper, where I have parameterized the 
Lucene Analyzer through the conf

* If you need to pass a tuple as input or output. Check out the 
StringTupleWritable class, instead of appending stuff to Text or splitting it.



 LLR Collocation Identifier
 --

 Key: MAHOUT-242
 URL: https://issues.apache.org/jira/browse/MAHOUT-242
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Attachments: mahout-colloc.tar.gz


 Identifies interesting Collocations in text using ngrams scored via the 
 LogLikelihoodRatio calculation. 
 As discussed in: 
 * 
 http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
 * 
 http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
 Current form is a tar of a maven project that depends on mahout. Build as 
 usual with 'mvn clean install', can be executed using:
 {noformat}
 mvn -e exec:java  -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver 
 -Dexec.args=--input src/test/resources/article --colloc target/colloc 
 --output target/output -w
 {noformat}
 Output will be placed in target/output and can be viewed nicely using:
 {noformat}
 sort -rn -k1 target/output/part-0
 {noformat}
 Includes rudimentary unit tests. Please review and comment. Needs more work 
 to get this into patch state and integrate with Robin's document vectorizer 
 work in MAHOUT-237
 Some basic TODO/FIXME's include:
 * use mahout math's ObjectInt map implementation when available
 * make the analyzer configurable
 * better input validation + negative unit tests.
 * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.