Hello list,

I'm trying to inject my own filter into "seqdirectory" so I can use a .json
file in the format {"docid": "text", } as input. I understand that a custom
filter can be specified as -filter, replacing the default
PrefixAdditionFilter.

However, when I put what I thought was a json-reading filter in the
dependancies as MahoutFilter.JsonFilter, it read the whole json file up
with the file's path as the key and the whole json file as the value - that
is, exactly as if the default filter were working.

 Command for that: mahout seqdirectory -o test_json -i json_stems.json
-filter MahoutFilter.JsonFilter -ow

(MahoutFilter.JsonFilter is the whole classpath.)

Then I tried putting my a filter name in there that definitely didn't exist:

mahout seqdirectory -o test_json -i json_stems.json -filter NoSuchFilter -ow

Once again, no exception thrown, and the default filter seems to have been
used. Still, it does recognize that it was given the argument:
Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=
[2147483647], --fileFilterClass=[NoSuchFilter], --input=[json_stems.json],
--keyPrefix=[], --method=[mapreduce], --output=[test_json],
--overwrite=null, --startPhase=[0], --tempDir=[temp]}

My take-away from this is:

1. When mahout does not find the filter specified, it uses the default.
Minimally, a user should be warned when their argument is ignored. Perhaps
I should document this in the jira.

2. Any ideas on helping mahout find my filter?

3. There was a csv filter up to 0.5 that also would have done the trick
here - any reason it's no longer included?

Thanks,
Liz

Reply via email to