Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread Damien Hawes
Hi Eila,

To my knowledge the BigQuerySink makes use of BigQuery's streaming insert
functionality. This means that if your data is successfully written to
BigQuery it will not be immediately previewable (as you already know), but
it should be immediately queryable. If you look at the table details, you
should see records in the streaming buffer.

Kind Regards,

Damien

On Mon, 13 Aug 2018, 20:00 OrielResearch Eila Arich-Landkof, <
e...@orielresearch.org> wrote:

> [the previous email was send too early by mistake]
> update:
>
> I tried the following options:
>
> 1. return dict from DnFn and error was fired:
> newRowDictlist = newRowDict  #[newRowDict]
> return(newRowDictlist)
> and the following warning:
>
> *Returning a dict from a ParDo or FlatMap is discouraged. Please use 
> list("...*
>
>
> 2. return list with dict in it
> newRowDictlist = [newRowDict]
> return(newRowDictlist)
>
> No error was generated. I see the table but the data hasn't been populated
> yet. BQ normal delay as far as I know
>
> Since I can not see the BQ full resultcould you please let me know if
> I am writing the data at the right format to BQ (I had no issues writing it
> to other type of outputs)
>
> Thanks for any help,
> Eila
>
> On Mon, Aug 13, 2018 at 1:55 PM, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> update:
>>
>> I tried the following options:
>>
>> 1. return dict from DnFn and error was fired:
>> newRowDictlist = newRowDict  #[newRowDict]
>> return(newRowDictlist)
>>
>> 2. return list with dict in it
>> newRowDictlist = [newRowDict]
>> return(newRowDictlist)
>>
>>
>>
>> On Mon, Aug 13, 2018 at 12:51 PM, OrielResearch Eila Arich-Landkof <
>> e...@orielresearch.org> wrote:
>>
>>> Hello,
>>>
>>> I am generating a data to be written in new BQ table with a specific
>>> schema. The data is generated at DoFn function.
>>>
>>> My question is: what is the recommended format of data that I should
>>> return from DnFn (getValuesStrFn bellow) ? is it dictionary? list?
>>> other?
>>> I tried list and str and it fired an error.
>>>
>>>
>>> The pipeline is:
>>> p =  beam.Pipeline(options=options)
>>> (p | 'Read From Data Frame' >>
>>> beam.Create(cellLinesTable.values.tolist())
>>>| 'call Get Value Str'  >> beam.ParDo(getValuesStrFn(colList))
>>>| 'write to BQ' >> 
>>> beam.io.Write(beam.io.BigQuerySink(dataset='dataset_cell_lines',table='cell_lines_table',
>>> schema=schema_bq)))
>>> Thanks,
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu 
>>> p.co 
>>> m/Deep-Learning-In-Production/
>>> 
>>>
>>>
>>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 
>> p.co 
>> m/Deep-Learning-In-Production/
>> 
>>
>>
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>


Re: Running beam script using SparkRunner

2018-08-13 Thread Mahesh Vangala
Please ignore my question :)
This URL did it! https://beam.apache.org/documentation/runners/spark/
Thank you.

*--*
*Mahesh Vangala*
*(Ph) 443-326-1957*
*(web) mvangala.com *


On Mon, Aug 13, 2018 at 2:31 PM Mahesh Vangala 
wrote:

> Hello all -
>
> I have a barebones word_count script that runs locally but I don't have a
> clue how to run this using SparkRunner.
>
> For example,
> Local:
> mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
> -Dexec.args="--runner=DirectRunner"
> Spark:
> mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
> -Dexec.args="--runner=SparkRunner"
>
> My code takes runner from args,
>
> PipelineOptions opts = PipelineOptionsFactory.fromArgs(args).create();
>
> I have local spark cluster, but what additional parameters need to be
> given to make beam code run on spark.
> (Sorry, there seems to be not so great documentation for this use case, or
> perhaps, I overlooked?)
>
> Thank you for your help.
>
> *--*
> *Mahesh Vangala*
> *(Ph) 443-326-1957*
> *(web) mvangala.com *
>


Re: Running beam script using SparkRunner

2018-08-13 Thread Lukasz Cwik
Have you looked at https://beam.apache.org/get-started/quickstart-java/?

It suggests using:
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
 -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts"
-Pspark-runner

This will launch a job that runs against a local Spark cluster.

If you want to use an existing Spark cluster, I believe you'll need to
build an uber jar of your application and use Spark submit.

On Mon, Aug 13, 2018 at 11:32 AM Mahesh Vangala 
wrote:

> Hello all -
>
> I have a barebones word_count script that runs locally but I don't have a
> clue how to run this using SparkRunner.
>
> For example,
> Local:
> mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
> -Dexec.args="--runner=DirectRunner"
> Spark:
> mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
> -Dexec.args="--runner=SparkRunner"
>
> My code takes runner from args,
>
> PipelineOptions opts = PipelineOptionsFactory.fromArgs(args).create();
>
> I have local spark cluster, but what additional parameters need to be
> given to make beam code run on spark.
> (Sorry, there seems to be not so great documentation for this use case, or
> perhaps, I overlooked?)
>
> Thank you for your help.
>
> *--*
> *Mahesh Vangala*
> *(Ph) 443-326-1957*
> *(web) mvangala.com *
>


Running beam script using SparkRunner

2018-08-13 Thread Mahesh Vangala
Hello all -

I have a barebones word_count script that runs locally but I don't have a
clue how to run this using SparkRunner.

For example,
Local:
mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
-Dexec.args="--runner=DirectRunner"
Spark:
mvn exec:java -Dexec.mainClass="com.apache.beam.learning.WordCount"
-Dexec.args="--runner=SparkRunner"

My code takes runner from args,

PipelineOptions opts = PipelineOptionsFactory.fromArgs(args).create();

I have local spark cluster, but what additional parameters need to be given
to make beam code run on spark.
(Sorry, there seems to be not so great documentation for this use case, or
perhaps, I overlooked?)

Thank you for your help.

*--*
*Mahesh Vangala*
*(Ph) 443-326-1957*
*(web) mvangala.com *


Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
[the previous email was send too early by mistake]
update:

I tried the following options:

1. return dict from DnFn and error was fired:
newRowDictlist = newRowDict  #[newRowDict]
return(newRowDictlist)
and the following warning:

*Returning a dict from a ParDo or FlatMap is discouraged. Please use list("...*


2. return list with dict in it
newRowDictlist = [newRowDict]
return(newRowDictlist)

No error was generated. I see the table but the data hasn't been populated
yet. BQ normal delay as far as I know

Since I can not see the BQ full resultcould you please let me know if I
am writing the data at the right format to BQ (I had no issues writing it
to other type of outputs)

Thanks for any help,
Eila

On Mon, Aug 13, 2018 at 1:55 PM, OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> update:
>
> I tried the following options:
>
> 1. return dict from DnFn and error was fired:
> newRowDictlist = newRowDict  #[newRowDict]
> return(newRowDictlist)
>
> 2. return list with dict in it
> newRowDictlist = [newRowDict]
> return(newRowDictlist)
>
>
>
> On Mon, Aug 13, 2018 at 12:51 PM, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hello,
>>
>> I am generating a data to be written in new BQ table with a specific
>> schema. The data is generated at DoFn function.
>>
>> My question is: what is the recommended format of data that I should
>> return from DnFn (getValuesStrFn bellow) ? is it dictionary? list? other?
>> I tried list and str and it fired an error.
>>
>>
>> The pipeline is:
>> p =  beam.Pipeline(options=options)
>> (p | 'Read From Data Frame' >> beam.Create(cellLinesTable.val
>> ues.tolist())
>>| 'call Get Value Str'  >> beam.ParDo(getValuesStrFn(colList))
>>| 'write to BQ' >> beam.io.Write(beam.io.BigQuery
>> Sink(dataset='dataset_cell_lines',table='cell_lines_table',
>> schema=schema_bq)))
>> Thanks,
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 
>> p.co m/Deep-Le
>> arning-In-Production/
>> 
>>
>>
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co m/Deep-
> Learning-In-Production/
> 
>
>
>


-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
update:

I tried the following options:

1. return dict from DnFn and error was fired:
newRowDictlist = newRowDict  #[newRowDict]
return(newRowDictlist)

2. return list with dict in it
newRowDictlist = [newRowDict]
return(newRowDictlist)



On Mon, Aug 13, 2018 at 12:51 PM, OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hello,
>
> I am generating a data to be written in new BQ table with a specific
> schema. The data is generated at DoFn function.
>
> My question is: what is the recommended format of data that I should
> return from DnFn (getValuesStrFn bellow) ? is it dictionary? list? other?
> I tried list and str and it fired an error.
>
>
> The pipeline is:
> p =  beam.Pipeline(options=options)
> (p | 'Read From Data Frame' >> beam.Create(cellLinesTable.values.tolist())
>| 'call Get Value Str'  >> beam.ParDo(getValuesStrFn(colList))
>| 'write to BQ' >> beam.io.Write(beam.io.BigQuerySink(dataset='dataset_
> cell_lines',table='cell_lines_table', schema=schema_bq)))
> Thanks,
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co m/Deep-
> Learning-In-Production/
> 
>
>
>


-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



[Feedback Request] Long term support releases

2018-08-13 Thread Ahmet Altay
Hi all,

In order to increase the predictability of Beam releases, I proposed
introducing long term support releases with 12 month defined support
periods [1]. I would like open this discussion to our user@ community and
receive your feedback on it. We will appreciate any input.

Here is a draft PR [2] with the actual proposal.

Thank you,
Ahmet

[1]
https://lists.apache.org/thread.html/b2e2e79e0e146ed5264a12831d9e40cabfcb75651a994484c58a6b01@%3Cdev.beam.apache.org%3E
[2] https://github.com/apache/beam-site/pull/537


Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
Hello,

I am generating a data to be written in new BQ table with a specific
schema. The data is generated at DoFn function.

My question is: what is the recommended format of data that I should return
from DnFn (getValuesStrFn bellow) ? is it dictionary? list? other?
I tried list and str and it fired an error.


The pipeline is:
p =  beam.Pipeline(options=options)
(p | 'Read From Data Frame' >> beam.Create(cellLinesTable.values.tolist())
   | 'call Get Value Str'  >> beam.ParDo(getValuesStrFn(colList))
   | 'write to BQ' >>
beam.io.Write(beam.io.BigQuerySink(dataset='dataset_cell_lines',table='cell_lines_table',
schema=schema_bq)))
Thanks,
-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



Re: Pass file metadata to TextIO

2018-08-13 Thread Lukasz Cwik
Not with the current capabilities of TextIO since it only provides the
contents of the file and none of the metadata.

FileIO allows you to use a ReadableByteChannel so you can read the file
incrementally in a loop and as you already have figured out you'll need to
parse and emit on every new line that you see yourself. Consider combining
the ReadableByteChannel with a BufferedReader and leverage the readLine[1]
method in a for loop.

1:
https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html#readLine--

On Mon, Aug 13, 2018 at 7:32 AM Akash Patel 
wrote:

> Hi,
>
> I have a large amount of CSV files stored in a GCS bucket which are
> timestamped according to their file pattern, i.e
> “ gs://deathstar/2017-10-05/plans.csv”
> “ gs://deathstar/2017-11-01/plans.csv”
>
> So basically I want to utilise the functionality of TextIO.read() where
> each line in every file is read into a PCollection but I also need to
> extract the timestamp from the file pattern and link it to each line (KV or
> something similar). However it doesn’t seem possible to extract this
> metadata unless I use FileIO. However the problem here is that the entire
> file is read not split into individual lines. Is it possible to read each
> line in the specified globed file pattern and have some parsed file
> metadata (i.e timestamp from filepattern) linked to respective element?
>
> Kind Regards,
> Akash
>
> --
>
> This message and any attachment(s) hereto are confidential and may be
> privileged or otherwise protected from disclosure. If you are not the
> intended recipient you are hereby notified that you have received this
> message in error and that you must not - in whole or in part - review,
> copy, distribute, retain copies or disclose the contents of this message or
> any attachments hereto. If you are not the intended recipient, please
> notify the sender immediately by return e-mail and delete this message and
> any attachment from your system.
>


Pass file metadata to TextIO

2018-08-13 Thread Akash Patel
Hi, 

I have a large amount of CSV files stored in a GCS bucket which are timestamped 
according to their file pattern, i.e 
“ gs://deathstar/2017-10-05/plans.csv ”
“ gs://deathstar/2017-11-01/plans.csv ”

So basically I want to utilise the functionality of TextIO.read() where each 
line in every file is read into a PCollection but I also need to extract the 
timestamp from the file pattern and link it to each line (KV or something 
similar). However it doesn’t seem possible to extract this metadata unless I 
use FileIO. However the problem here is that the entire file is read not split 
into individual lines. Is it possible to read each line in the specified globed 
file pattern and have some parsed file metadata (i.e timestamp from 
filepattern) linked to respective element?

Kind Regards, 
Akash 
-- 










This message and any attachment(s) hereto are confidential and 
may be privileged or otherwise protected from disclosure. If you are not 
the intended recipient you are hereby notified that you have received this 
message in error and that you must not - in whole or in part - review, 
copy, distribute, retain copies or disclose the contents of this message or 
any attachments hereto. If you are not the intended recipient, please 
notify the sender immediately by return e-mail and delete this message and 
any attachment from your system.