Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
, I need to sort the data by id, which > leads to shuffle of 50T data. That's somehow crazy. > > > I'm on the way testing HAR, but the discussion brings me lots of insight > about ORC. > > Thanks for your help! > > > ---------------------- > *发件人:* R

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
时间: 2017年4月17日 16:48:47 收件人: 莫涛 抄送: user 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering? how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that states the ppd

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filt

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
forgive my ignorance, but, what does it mean HAR? a acronym to High available record? Thanks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-04-2

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
Hi Jörn, HAR is a great idea! For POC, I've archived 1M records and stored the id -> path mapping in text (for better readability). Filtering 1K records takes only 2 minutes now (30 seconds to get the path list and 0.5 second per thread to read a record). Such performance is exactly what I

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that states the ppd should work. And I think "only for matched ones the binary data is read" is true if proper index is configured. The row group w

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, The attachment is a screen shot for the spark job and this is the only stage for this job. I've changed the partition size to 1GB by "--conf spark.sql.files.maxPartitionBytes=1073741824". 1. spark-orc seems not that smart. The input size is almost the whole data. I guess "only for