Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
nary data is read" is not >> true as orc does not know the offset of each BINARY so things like seek >> could not happen >> >> 2. I've tried orc and it does skip the partition that has no hit. This >> could be a solution but the performance depends

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
时间: 2017年4月17日 16:48:47 收件人: 莫涛 抄送: user 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering? how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that states the ppd

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filt

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
> -- > *发件人:* Jörn Franke > *发送时间:* 2017年4月17日 22:37:48 > *收件人:* 莫涛 > *抄送:* user@spark.apache.org > *主题:* Re: 答复: How to store 10M records in HDFS to speed up further > filtering? > > Yes 5 mb is a difficult size, too small for HDFS too big

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
: How to store 10M records in HDFS to speed up further filtering? Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. Maybe you can put the data in a HAR and store id, path in orc/parquet. On 17. Apr 2017, at 10:52, 莫涛 mailto:mo...@sensetime.com>> wrote: Hi Jörn, I do th

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. Maybe you can put the data in a HAR and store id, path in orc/parquet. > On 17. Apr 2017, at 10:52, 莫涛 wrote: > > Hi Jörn, > > > > I do think a 5 MB column is odd but I don't have any other idea before asking > this q

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Jörn, I do think a 5 MB column is odd but I don't have any other idea before asking this question. The binary data is a short video and the maximum size is no more than 50 MB. Hadoop archive sounds very interesting and I'll try it first to check whether filtering is fast on it. To my be

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
s on the distribution of the > given ID list. No partition could be skipped in the worst case. > > > Mo Tao > > > > ------------------ > *发件人:* Ryan > *发送时间:* 2017年4月17日 15:42:46 > *收件人:* 莫涛 > *抄送:* user > *主题:* Re: 答复: How to store 10M records in HDFS

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
ribution of the given ID list. No partition could be skipped in the worst case. Mo Tao 发件人: Ryan 发送时间: 2017年4月17日 15:42:46 收件人: 莫涛 抄送: user 主题: Re: 答复: How to store 10M records in HDFS to speed up further filtering? 1. Per my understanding, for orc files, i

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
1. Per my understanding, for orc files, it should push down the filters, which means all id columns will be scanned but only for matched ones the binary data is read. I haven't dig into spark-orc reader though.. 2. orc itself have row group index and bloom filter index. you may try configurations

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, 1. "expected qps and response time for the filter request" I expect that only the requested BINARY are scanned instead of all records, so the response time would be "10K * 5MB / disk read speed", or several times of this. In practice, our cluster has 30 SAS disks and scanning all the