It sounds like your pipeline is issuing a query rather than reading a whole
table.

Are you using Java or Python? I'm only familiar with the Java SDK so my
answer may be Java-biased.

I would recommend materializing the query results to a table, and then
configuring your pipeline to read that table rather than reading from a
query. In that case, no query job is involved so you incur no query cost.

By default, the read from a table will do an export to avro files. There is
no GCP cost associated with that export, but there is a quota involved,
which you may run into if you run your pipeline repeatedly. So an even
better loop would be to do the export to GCS out of band, and then
reference those avro files. But that would require much more extensive code
changes in your pipeline whereas the switch of reading from a query to
reading from a table is a one-line code change.

You can also avoid the export to avro files by configuring BigQueryIO to
use direct reads from your temporary table rather than file exports. There
is a cost associated with direct reads, but it should generally be much
smaller than the cost of repeatedly running a query.

On Thu, Jul 2, 2020 at 9:28 AM Matt Terwilliger <
[email protected]> wrote:

> Hello,
>
> I'm writing a Beam pipeline that does some relatively expensive reads from
> BigQuery. I want to be able to run the pipeline in a development loop
> without racking up a huge bill.
>
> I know BigQuery has support for query caching, but from the docs, that
> only works if you don't specify a destination table.
>
> For the purposes of development, I don't mind trading off stale data (i.e.
> reusing an existing destination table if it exists) to save money.
>
> Is there any way to do this now, or relevant any open issues? I did a
> quick pass through JIRA but couldn't find anything.
>
> Thanks,
> Matt
>

Reply via email to