You're exactly right... with the sequential flag, my filter is found. An
exception is thrown, but for now the problem seems to be the json-reading
filter itself and not Mahout. Thanks!

For completeness, the command is now:
mahout seqdirectory -o test_json -i json_stems.json -filter
MahoutFilter.JsonFilter -ow -xm sequential

And the stacktrace, apparently caused by problems in my filter is:
Exception in thread "main" java.lang.IllegalStateException:
java.lang.NoSuchMethodException:
MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration,
java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter,
java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem)
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:53)
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:36)
at
org.apache.mahout.text.SequenceFilesFromDirectory.runSequential(SequenceFilesFromDirectory.java:109)
at
org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:87)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.NoSuchMethodException:
MahoutFilter.JsonFilter.<init>(org.apache.hadoop.conf.Configuration,
java.lang.String, java.util.Map, org.apache.mahout.utils.io.ChunkedWriter,
java.nio.charset.Charset, org.apache.hadoop.fs.FileSystem)
at java.lang.Class.getConstructor0(Class.java:2754)
at java.lang.Class.getConstructor(Class.java:1684)
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:47)
... 18 more


On Mon, Aug 26, 2013 at 1:35 PM, Suneel Marthi <[email protected]>wrote:

> Seems like a bug in the MR version of seqdirectory. (I am assuming u r
> working off of trunk or Mahout 0.8)
>
> Could you try running this again by specifying the '-xm sequential' option
> and check if the behavior is correct?
>
>
>
>
> ________________________________
>  From: Liz Merkhofer <[email protected]>
> To: [email protected]
> Sent: Monday, August 26, 2013 1:19 PM
> Subject: seqdirectory -filter arg: not found, default used, no exception
>
>
> Hello list,
>
> I'm trying to inject my own filter into "seqdirectory" so I can use a .json
> file in the format {"docid": "text", } as input. I understand that a custom
> filter can be specified as -filter, replacing the default
> PrefixAdditionFilter.
>
> However, when I put what I thought was a json-reading filter in the
> dependancies as MahoutFilter.JsonFilter, it read the whole json file up
> with the file's path as the key and the whole json file as the value - that
> is, exactly as if the default filter were working.
>
> Command for that: mahout seqdirectory -o test_json -i json_stems.json
> -filter MahoutFilter.JsonFilter -ow
>
> (MahoutFilter.JsonFilter is the whole classpath.)
>
> Then I tried putting my a filter name in there that definitely didn't
> exist:
>
> mahout seqdirectory -o test_json -i json_stems.json -filter NoSuchFilter
> -ow
>
> Once again, no exception thrown, and the default filter seems to have been
> used. Still, it does recognize that it was given the argument:
> Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=
> [2147483647], --fileFilterClass=[NoSuchFilter], --input=[json_stems.json],
> --keyPrefix=[], --method=[mapreduce], --output=[test_json],
> --overwrite=null, --startPhase=[0], --tempDir=[temp]}
>
> My take-away from this is:
>
> 1. When mahout does not find the filter specified, it uses the default.
> Minimally, a user should be warned when their argument is ignored. Perhaps
> I should document this in the jira.
>
> 2. Any ideas on helping mahout find my filter?
>
> 3. There was a csv filter up to 0.5 that also would have done the trick
> here - any reason it's no longer included?
>
> Thanks,
> Liz
>

Reply via email to