Hi,
I have a couple of 100 csv files on a web server that I can just pull down
via https without any credentials, I wonder how I can write a storage
plugin for drill that pull these files directly from the web web server
without having to download them to the local file system.
I have a couple of options:
1) the plugin could just do to a simple http directory listing to get these
files
2) I could provide a text file with the urls of the files, simply like
https://mywebserver.com/myfolder/myfile1.csv
https://mywebserver.com/myfolder/myfile2.csv
3) the web server supports json file listing like this
curl -s https://mywebserver.com/myfolder?format=json | python -m
json.tool
[
{
"hash": "e5f62378c79ec9c491aa130374dba93b",
"last_modified": "2016-09-30T19:15:45.730950",
"bytes": 211169,
"name": "myfile1.csv",
"content_type": "text/csv"
},
{
Option 3 would be the most elegant to me
does something like this already exist or would I duplicate the s3 plugin
and modify it?
like this ?
Thanks for your help!
dipe
{
"type": "file",
"enabled": true,
"connection": "https://mywebserver.com/myfolder?format=json",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}