I’m using a spark streaming program to store log message into parquet file 
every 10 mins.
Now, when I query the parquet, it usually takes hundreds of thousands of stages 
to compute a single count.
I looked into the parquet file’s path and find a great amount of small files.

Do the small files caused the problem? Can I merge them, or is there a better 
way to solve it?

Lots of thanks.

________________________________
此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to