Re: Moving column family into new table

2017-01-19 Thread Mark Heppner
I'll check when I'm on site tomorrow, but our (much smaller) local cluster is using the default hbase.hregion.max.filesize of 10 GB for HDP. hbase.hregion.majorcompaction is set to 7 days, so I'm sure it would have ran by now. What would be the best filesize limit? Cloudera suggests having 20-200

Re: Moving column family into new table

2017-01-19 Thread Josh Mahonin
It's a bit peculiar that you've got it pre-split to 10 salt buckets, but seeing 400+ partitions. It sounds like HBase is splitting the regions on you, possibly due to the 'hbase.hregion.max.filesize' setting. You should be able to check the HBase Master UI and see the table details to see how many

Re: Moving column family into new table

2017-01-19 Thread Mark Heppner
Jonathan, I do check the queries using EXPLAIN, but it doesn't work the same in Spark. In Spark, I can only see a very generic plan and it only tells me if certain filters are pushed down to Phoenix or not. Query hints are ignored, since they're first translated by the Spark or Hive query

Re: Moving column family into new table

2017-01-19 Thread Jonathan Leech
Do an explain on your query to confirm that it's doing a full scan and not a skip scan. I typically use an in () clause instead of or, especially with compound keys. I have also had to hint queries to use a skip scan, e.g /*+ SKIP_SCAN */. Phoenix seems to do a very good job not reading data

Re: Moving column family into new table

2017-01-19 Thread Mark Heppner
Thanks for the quick reply, Josh! For our demo cluster, we have 5 nodes, so the table was already set to 10 salt buckets. I know you can increase the salt buckets after the table is created, but how do you change the split points? The repartition in Spark seemed to be extremely inefficient, so we

Re: Moving column family into new table

2017-01-19 Thread Josh Mahonin
Hi Mark, At present, the Spark partitions are basically equivalent to the number of regions in the underlying HBase table. This is typically something you can control yourself, either using pre-splitting or salting ( https://phoenix.apache.org/faq.html#Are_there_any_tips_for_optimizing_Phoenix).

Moving column family into new table

2017-01-19 Thread Mark Heppner
Our use case is to analyze images using Spark. The images are typically ~1MB each, so in order to prevent the small files problem in HDFS, we went with HBase and Phoenix. For 20+ million images and metadata, this has been working pretty well so far. Since this is pretty new to us, we didn't create