Actually I think I found something that would work:
SequenceFilesFromCsvFilter.java
I am trying to use as follows:
bin/mahout seqdirectory -i input -o output -filter
org.apache.mahout.text.SequenceFilesFromCsvFilter -ow
But I am receiving the following exception:
Caused by: java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:417)
at java.lang.Integer.parseInt(Integer.java:499)
at
org.apache.mahout.text.SequenceFilesFromCsvFilter.<init>(SequenceFilesFromCsvFilter.java:56)
I believe this is because this class requires a keyColumn and
valueColumn option. Is there anyway for me to pass these options along?
When i try adding it to the above seqdirectory command I receive:
Unexpected -kcol while processing Job-Specific Options:
Any ideas?
Thanks
On 6/6/11 10:30 AM, Mark wrote:
Thanks
On 6/6/11 10:28 AM, Robin Anil wrote:
Mark you need to write your own tool to convert data into sequence
files.
Its pretty easy. instantiate SequenceFile.Writer with both key and
value as
Text and write your data in the file.
If your data is very large, you might want to consider writing a Map
only
MapReduce which can read your input and write Output<Text,Text> in
SequenceFileOutputFormat
Robin
On Mon, Jun 6, 2011 at 10:53 PM, Mark<[email protected]> wrote:
I am looking to performing clustering algorithms on these documents
which I
thought (I could be wrong) requires sequence files? Is this not the
case?
Thanks
On 6/6/11 10:11 AM, Daniel McEnnis wrote:
Mark,
Generally speaking, Mahout has pretty good performance over log files
like the ones your describing, so they typically don't get changed
into sequence files. You'll need to write one for yourself if you
really need sequence files (such as for key management.)
Daniel.
On Mon, Jun 6, 2011 at 12:04 PM, Mark<[email protected]>
wrote:
I've been running through the examples as described in the Mahout In
Action
book and I have some questions regarding the
SequenceFilesFromDirectory.java
class.
This class expects a directory of files that contains 1 document per
file.
Is there another mahout class or some options I can supply to
SequenceFilesFromDirectory.java to parse multiple documents per
file? For
example, my files contain 1 document per line. I would like to
parse each
line of each file and create a sequence file from this. Is this
possible
with SequenceFilesFromDirectory or would I have to write this myself?
Thanks