On 06/10/15 12:36, Eva Schlauch wrote:
Thanks for the nice introduction to apache tika at the Budapest
apache:big_data conference!
Glad to see we've inspired you to join the community :)
I am considering to use apache tika together with apache oodt. Some of
the files that are generated here (in the mm-wave lab at MPIfR) are of
the touchstone file format, namely .s2p and similar. (I think the
general file would be .snp). I have almost no experience with writing
parsers, but as far as I see, I would need a new parser to be used from
within tika, is that right? Haven't seen any parser in a short web search,
but would be glad to be pointed to the right direction.
I'd suggest first starting with the "5 minute parser quickstart" guide -
http://tika.apache.org/1.10/parser_guide.html - though be aware that
it's only 5 minutes in the same way that a Jamie Oliver 30 minute meal
is 30 minutes, i.e. after practice and with the environment prepared!
How would I go about to get tika parse the metadata from the header?
Here is a link to the format description:
http://na.support.keysight.com/plts/help/WebHelp/FilePrint/SnP_File_Format.htm
It looks like the header should be fairly easy to spot. First thing to
do would be to create/get a small test file for each of the s#p files.
Open an enhancement in jira for mime detection, and attach those. Next,
follow the guide to add mime entries and mime magic for those, along
with writing a small detection test using those files. Attach all that
to the jira
With detection working, it's onto the parser. With a plain text based
format, it should be fairly easy. Looks from that page like using a
BufferedReader to read it line-by-line should work. Tokenize/split to
get the key, map that onto a suitable Tika metadata key, and save the
value (converting if needed). Looks like you might need to have two sets
of logic, one for s1p and one for s2p, as they look to have slightly
different headers, though I guess the data is the same?
Finally, for the values, probably outputting in a table would be the
least worst option
Depending on your preferences, you can use a github fork with a custom
branch, or reviewboard, or just patches in jira, to get feedback on your
work.
Oh, and if you're looking for some existing code to "crib" from, you
might be better looking at some of the non-scientific parsers. Quite a
few of the scientific parsers (though not all!) have been written by
other new community members, so may not always follow best-practice.
Don't get me wrong - they work, and it's awesome that we have them! They
just might not always be the best example to learn from.
We'll look forward to your patches soon :)
Thanks
Nick