答复: How to store 10M records in HDFS to speed up further filtering?

莫涛 Mon, 17 Apr 2017 00:02:27 -0700

Hi Ryan,

1. "expected qps and response time for the filter request"

I expect that only the requested BINARY are scanned instead of all records, so
the response time would be "10K * 5MB / disk read speed", or several times of
this.

In practice, our cluster has 30 SAS disks and scanning all the 10M * 5MB data
takes about 6 hours now. It should becomes several minutes as expected.

2. "build a search tree using ids within each partition to act like an index,
or create a bloom filter to see if current partition would have any hit"

Sounds like the thing I'm looking for!

Could you kindly provide some links for reference? I found nothing in spark
document about index or bloom filter working inside partition.

Thanks very much!

Mo Tao

________________________________
发件人: Ryan <ryan.hd....@gmail.com>
发送时间: 2017年4月17日 14:32:00
收件人: 莫涛
抄送: user
主题: Re: How to store 10M records in HDFS to speed up further filtering?

you can build a search tree using ids within each partition to act like an
index, or create a bloom filter to see if current partition would have any hit.

What's your expected qps and response time for the filter request?

On Mon, Apr 17, 2017 at 2:23 PM, MoTao
<mo...@sensetime.com<mailto:mo...@sensetime.com>> wrote:
Hi all,

I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on
average.
In my daily application, I need to filter out 10K BINARY according to an ID
list.
How should I store the whole data to make the filtering faster?

I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro)
and column-based format (orc).
However, both of them require to scan almost ALL records, making the
filtering stage very very slow.
The code block for filtering looks like:

val IDSet: Set[String] = ...
val checkID = udf { ID: String => IDSet(ID) }
spark.read.orc("/path/to/whole/data")
.filter(checkID($"ID"))
.select($"ID", $"BINARY")
.write...

Thanks for any advice!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail:
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

答复: How to store 10M records in HDFS to speed up further filtering?

Reply via email to