Apache Pinot Daily Email Digest (2022-03-24)

Pinot Slack Email Digest Thu, 24 Mar 2022 19:00:44 -0700

#general

@lorinlee1996: @lorinlee1996 has joined the channel
@linkedcarbon: @linkedcarbon has joined the channel
@varun.mukundhan: @varun.mukundhan has joined the channel
@varun.mukundhan640: @varun.mukundhan640 has joined the channel
@varun.mukundhan640: Hi folks, I am new to pinot so would like to apologize in advance for the noob questions that be incoming: 1. Is there any major performance difference between using PQL and SQL? For example, we have a usecase where we need top X aggregations. I can do this through `top X using` PQL and `order by <aggregation> desc LIMIT X` using SQL. Which one do you reccomend? 2. What are the differences between metrics and dimensions? I could see aggregation queries are allowed on non-string dimensions as well
@richard892: hi Varun, 1. PQL is in the process of being deprecated, so don't start using it. There should be no performance problems with SQL. 2. Dimensions are things you might group by, but metrics are things you would aggregate. So in `select type,name,sum(value) from table group by type, name` type and name would be dimensions and value would be a metric. They are stored differently too, metrics don't have dictionaries, which makes them efficient to do things like sums over, and dimensions generally do have dictionaries, which generally compresses the data and makes group bys easier.
@varun.mukundhan640: Thanks so much Richard! So `order by <aggregation> desc LIMIT X` is as efficient as using `TOP X` from PQL?
@richard892: same query engine -> same performance
@samkiller: @samkiller has joined the channel
@tonya: @tonya has joined the channel
@mourad.dlia: Hi team, We want to paginate over a table but the offset keep changing due to new coming events. Is there a way to ignore new events during pagination?
@walterddr: you have to do so by adding in a reasonable timecolumn filter E.g. something like`WHERE timeColumn <= NOW() - 10s`
@walterddr: the -10s is to avoid late arrival kafka messages affect the result coming back. also you could ask questions in <#C011C9JHN7R|troubleshooting> channel, which will expose to more developers.
@mourad.dlia: Ok thanks for you response.
@walterddr: FYI Also noted that if you issue multiple queries against Pinot it doesnt guarantee the result set is in the same order. You will also have to issue an order by Clause (I assumed you are already doing so but just in case you weren't)
@ahsen.m: @ahsen.m has joined the channel
@abhinav.wagle1: Hi, I am a little confused on the `Note` mentioned here : ```NOTE: Please specify StorageClass based on your cloud vendor. For Pinot Server, please don't mount blob store like AzureFile/GoogleCloudStorage/S3 as the data serving file system. Only use Amazon EBS/GCP Persistent Disk/Azure Disk style disks.```
@abhinav.wagle1: For aws the reco is go with `gp2` ?
@abhinav.wagle1: And no S3 ?
@abhinav.wagle1: Any particular reason why?
@bagi.priyank: I think it refers to the storage for completed segments and not the segment store.
@mayanks: Deepstore should be S3 on AWS. For local attached disk on serving nodes, the recommendation is to use EBS
@ahsen.m: what about digital ocean, im on D/O
@revathibalakr: @revathibalakr has joined the channel
@nizar.hejazi: Hi team, one of the requirements for supporting stream ingestion w/ upsert is to . What if the input stream partition key (e.g. company id) is different from the record primary key (e.g. employee id)?
@tisantos: @nizar.hejazi can you elaborate more on your "record primary key"? Are you referring to the column used as the sorted index in your Pinot table?
@nizar.hejazi: No. I am referring to the key defined in in table schema.
@g.kishore: Hi Nizar, you will have to repartition the input stream according to the primary key (employee id in this case)