You're exactly right... with the sequential flag, my filter is found. An exception is thrown, but for now the problem seems to be the json-reading filter itself and not Mahout. Thanks!
For completeness, the command is now: mahout seqdirectory -o test_json -i json_stems.json -filter MahoutFilter.JsonFilter -ow -xm sequential And the stacktrace, apparently caused by problems in my filter is: Exception in thread "main" java.lang.IllegalStateException: java.lang.NoSuchMethodException: MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration, java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter, java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:53) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:36) at org.apache.mahout.text.SequenceFilesFromDirectory.runSequential(SequenceFilesFromDirectory.java:109) at org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:63) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NoSuchMethodException: MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration, java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter, java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem) at java.lang.Class.getConstructor0(Class.java:2754) at java.lang.Class.getConstructor(Class.java:1684) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:47) ... 18 more On Mon, Aug 26, 2013 at 1:35 PM, Suneel Marthi <[email protected]>wrote: > Seems like a bug in the MR version of seqdirectory. (I am assuming u r > working off of trunk or Mahout 0.8) > > Could you try running this again by specifying the '-xm sequential' option > and check if the behavior is correct? > > > > > ________________________________ > From: Liz Merkhofer <[email protected]> > To: [email protected] > Sent: Monday, August 26, 2013 1:19 PM > Subject: seqdirectory -filter arg: not found, default used, no exception > > > Hello list, > > I'm trying to inject my own filter into "seqdirectory" so I can use a .json > file in the format {"docid": "text", } as input. I understand that a custom > filter can be specified as -filter, replacing the default > PrefixAdditionFilter. > > However, when I put what I thought was a json-reading filter in the > dependancies as MahoutFilter.JsonFilter, it read the whole json file up > with the file's path as the key and the whole json file as the value - that > is, exactly as if the default filter were working. > > Command for that: mahout seqdirectory -o test_json -i json_stems.json > -filter MahoutFilter.JsonFilter -ow > > (MahoutFilter.JsonFilter is the whole classpath.) > > Then I tried putting my a filter name in there that definitely didn't > exist: > > mahout seqdirectory -o test_json -i json_stems.json -filter NoSuchFilter > -ow > > Once again, no exception thrown, and the default filter seems to have been > used. Still, it does recognize that it was given the argument: > Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase= > [2147483647], --fileFilterClass=[NoSuchFilter], --input=[json_stems.json], > --keyPrefix=[], --method=[mapreduce], --output=[test_json], > --overwrite=null, --startPhase=[0], --tempDir=[temp]} > > My take-away from this is: > > 1. When mahout does not find the filter specified, it uses the default. > Minimally, a user should be warned when their argument is ignored. Perhaps > I should document this in the jira. > > 2. Any ideas on helping mahout find my filter? > > 3. There was a csv filter up to 0.5 that also would have done the trick > here - any reason it's no longer included? > > Thanks, > Liz >
