Hello list,
I'm trying to inject my own filter into "seqdirectory" so I can use a .json
file in the format {"docid": "text", } as input. I understand that a custom
filter can be specified as -filter, replacing the default
PrefixAdditionFilter.
However, when I put what I thought was a json-reading filter in the
dependancies as MahoutFilter.JsonFilter, it read the whole json file up
with the file's path as the key and the whole json file as the value - that
is, exactly as if the default filter were working.
Command for that: mahout seqdirectory -o test_json -i json_stems.json
-filter MahoutFilter.JsonFilter -ow
(MahoutFilter.JsonFilter is the whole classpath.)
Then I tried putting my a filter name in there that definitely didn't exist:
mahout seqdirectory -o test_json -i json_stems.json -filter NoSuchFilter -ow
Once again, no exception thrown, and the default filter seems to have been
used. Still, it does recognize that it was given the argument:
Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=
[2147483647], --fileFilterClass=[NoSuchFilter], --input=[json_stems.json],
--keyPrefix=[], --method=[mapreduce], --output=[test_json],
--overwrite=null, --startPhase=[0], --tempDir=[temp]}
My take-away from this is:
1. When mahout does not find the filter specified, it uses the default.
Minimally, a user should be warned when their argument is ignored. Perhaps
I should document this in the jira.
2. Any ideas on helping mahout find my filter?
3. There was a csv filter up to 0.5 that also would have done the trick
here - any reason it's no longer included?
Thanks,
Liz