nary data is read" is not
>> true as orc does not know the offset of each BINARY so things like seek
>> could not happen
>>
>> 2. I've tried orc and it does skip the partition that has no hit. This
>> could be a solution but the performance depends
时间: 2017年4月17日 16:48:47
收件人: 莫涛
抄送: user
主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?
how about the event timeline on executors? It seems add more executor could
help.
1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that
states the ppd
It's hadoop archive.
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html
发件人: Alonso Isidoro Roman
发送时间: 2017年4月20日 17:03:33
收件人: 莫涛
抄送: Jörn Franke; user@spark.apache.org
主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filt
> --
> *发件人:* Jörn Franke
> *发送时间:* 2017年4月17日 22:37:48
> *收件人:* 莫涛
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: How to store 10M records in HDFS to speed up further
> filtering?
>
> Yes 5 mb is a difficult size, too small for HDFS too big
: How to store 10M records in HDFS to speed up further filtering?
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
Maybe you can put the data in a HAR and store id, path in orc/parquet.
On 17. Apr 2017, at 10:52, 莫涛 mailto:mo...@sensetime.com>>
wrote:
Hi Jörn,
I do th
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
Maybe you can put the data in a HAR and store id, path in orc/parquet.
> On 17. Apr 2017, at 10:52, 莫涛 wrote:
>
> Hi Jörn,
>
>
>
> I do think a 5 MB column is odd but I don't have any other idea before asking
> this q
Hi Jörn,
I do think a 5 MB column is odd but I don't have any other idea before asking
this question. The binary data is a short video and the maximum size is no more
than 50 MB.
Hadoop archive sounds very interesting and I'll try it first to check whether
filtering is fast on it.
To my be
s on the distribution of the
> given ID list. No partition could be skipped in the worst case.
>
>
> Mo Tao
>
>
>
> ------------------
> *发件人:* Ryan
> *发送时间:* 2017年4月17日 15:42:46
> *收件人:* 莫涛
> *抄送:* user
> *主题:* Re: 答复: How to store 10M records in HDFS
ribution of the given ID
list. No partition could be skipped in the worst case.
Mo Tao
发件人: Ryan
发送时间: 2017年4月17日 15:42:46
收件人: 莫涛
抄送: user
主题: Re: 答复: How to store 10M records in HDFS to speed up further filtering?
1. Per my understanding, for orc files, i
1. Per my understanding, for orc files, it should push down the filters,
which means all id columns will be scanned but only for matched ones the
binary data is read. I haven't dig into spark-orc reader though..
2. orc itself have row group index and bloom filter index. you may try
configurations
Hi Ryan,
1. "expected qps and response time for the filter request"
I expect that only the requested BINARY are scanned instead of all records, so
the response time would be "10K * 5MB / disk read speed", or several times of
this.
In practice, our cluster has 30 SAS disks and scanning all the
11 matches
Mail list logo