Actually I think I found something that would work: SequenceFilesFromCsvFilter.java

I am trying to use as follows:

bin/mahout seqdirectory -i input -o output -filter org.apache.mahout.text.SequenceFilesFromCsvFilter -ow

But I am receiving the following exception:

Caused by: java.lang.NumberFormatException: null
    at java.lang.Integer.parseInt(Integer.java:417)
    at java.lang.Integer.parseInt(Integer.java:499)
at org.apache.mahout.text.SequenceFilesFromCsvFilter.<init>(SequenceFilesFromCsvFilter.java:56)

I believe this is because this class requires a keyColumn and valueColumn option. Is there anyway for me to pass these options along?

When i try adding it to the above seqdirectory command I receive:

  Unexpected -kcol while processing Job-Specific Options:


Any ideas?

Thanks

On 6/6/11 10:30 AM, Mark wrote:
Thanks

On 6/6/11 10:28 AM, Robin Anil wrote:
Mark you need to write your own tool to convert data into sequence files. Its pretty easy. instantiate SequenceFile.Writer with both key and value as
Text and write your data in the file.

If your data is very large, you might want to consider writing a Map only
MapReduce which can read your input and write Output<Text,Text>  in
SequenceFileOutputFormat

Robin

On Mon, Jun 6, 2011 at 10:53 PM, Mark<[email protected]>  wrote:

I am looking to performing clustering algorithms on these documents which I thought (I could be wrong) requires sequence files? Is this not the case?

Thanks


On 6/6/11 10:11 AM, Daniel McEnnis wrote:

Mark,

Generally speaking, Mahout has pretty good performance over log files
like the ones your describing, so they typically don't get changed
into sequence files.  You'll need to write one for yourself if you
really need sequence files (such as for key management.)

Daniel.

On Mon, Jun 6, 2011 at 12:04 PM, Mark<[email protected]> wrote:

I've been running through the examples as described in the Mahout In
Action
book and I have some questions regarding the
SequenceFilesFromDirectory.java
class.

This class expects a directory of files that contains 1 document per
file.
Is there another mahout class or some options I can supply to
SequenceFilesFromDirectory.java to parse multiple documents per file? For example, my files contain 1 document per line. I would like to parse each line of each file and create a sequence file from this. Is this possible
with SequenceFilesFromDirectory or would I have to write this myself?

Thanks


Reply via email to