I am assuming your custom JsonFilter extends Mahout's SequenceFilesFromDirectoryFilter and overrides method 'process()'. Correct?
________________________________ From: Liz Merkhofer <[email protected]> To: [email protected]; Suneel Marthi <[email protected]> Sent: Monday, August 26, 2013 1:56 PM Subject: Re: seqdirectory -filter arg: not found, default used, no exception You're exactly right... with the sequential flag, my filter is found. An exception is thrown, but for now the problem seems to be the json-reading filter itself and not Mahout. Thanks! For completeness, the command is now: mahout seqdirectory -o test_json -i json_stems.json -filter MahoutFilter.JsonFilter -ow -xm sequential And the stacktrace, apparently caused by problems in my filter is: Exception in thread "main" java.lang.IllegalStateException: java.lang.NoSuchMethodException: MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration, java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter, java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:53) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:36) at org.apache.mahout.text.SequenceFilesFromDirectory.runSequential(SequenceFilesFromDirectory.java:109) at org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:63) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NoSuchMethodException: MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration, java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter, java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem) at java.lang.Class.getConstructor0(Class.java:2754) at java.lang.Class.getConstructor(Class.java:1684) at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:47) ... 18 more On Mon, Aug 26, 2013 at 1:35 PM, Suneel Marthi <[email protected]>wrote: > Seems like a bug in the MR version of seqdirectory. (I am assuming u r > working off of trunk or Mahout 0.8) > > Could you try running this again by specifying the '-xm sequential' option > and check if the behavior is correct? > > > > > ________________________________ > From: Liz Merkhofer <[email protected]> > To: [email protected] > Sent: Monday, August 26, 2013 1:19 PM > Subject: seqdirectory -filter arg: not found, default used, no exception > > > Hello list, > > I'm trying to inject my own filter into "seqdirectory" so I can use a .json > file in the format {"docid": "text", } as input. I understand that a custom > filter can be specified as -filter, replacing the default > PrefixAdditionFilter. > > However, when I put what I thought was a json-reading filter in the > dependancies as MahoutFilter.JsonFilter, it read the whole json file up > with the file's path as the key and the whole json file as the value - that > is, exactly as if the default filter were working. > > Command for that: mahout seqdirectory -o test_json -i json_stems.json > -filter MahoutFilter.JsonFilter -ow > > (MahoutFilter.JsonFilter is the whole classpath.) > > Then I tried putting my a filter name in there that definitely didn't > exist: > > mahout seqdirectory -o test_json -i json_stems.json -filter NoSuchFilter > -ow > > Once again, no exception thrown, and the default filter seems to have been > used. Still, it does recognize that it was given the argument: > Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase= > [2147483647], --fileFilterClass=[NoSuchFilter], --input=[json_stems.json], > --keyPrefix=[], --method=[mapreduce], --output=[test_json], > --overwrite=null, --startPhase=[0], --tempDir=[temp]} > > My take-away from this is: > > 1. When mahout does not find the filter specified, it uses the default. > Minimally, a user should be warned when their argument is ignored. Perhaps > I should document this in the jira. > > 2. Any ideas on helping mahout find my filter? > > 3. There was a csv filter up to 0.5 that also would have done the trick > here - any reason it's no longer included? > > Thanks, > Liz >
