streaming output in just one files

2017-08-08 Thread Claire Yuan
Hi all,  I am currently running some jobs coded in Beam in streaming mode on 
Yarn session by Flink. My data sink was CSV files like the one in examples of 
TfIdf. And I noticed that the output format for Beam is to produce one file for 
every record, and also temp files for them. That would result in my space used 
exceed maximum.   I am not sure whether is the problem that I used the API 
incorrectly but I am wondering if there any way I can put all those records 
into one file, or keep updating in that file, or delete those tempt files by 
windowing or triggering?
Claire

Slack invite request

2017-08-08 Thread Steve Anderson
Hi there, can i please get an invite to the beam slack channel?

Thanks!
- Steve

-- 

Steven Anderson
Software Developer
Mobile: 650.455.6530
Email: st...@maestro.io
Website: http://www.maestro.io


Re: Slack invite request

2017-08-08 Thread Jason Kuster
Done!

On Tue, Aug 8, 2017 at 1:56 PM, Steve Anderson  wrote:

> Hi there, can i please get an invite to the beam slack channel?
>
> Thanks!
> - Steve
>
> --
>
> Steven Anderson
> Software Developer
> Mobile: 650.455.6530
> Email: st...@maestro.io
> Website: http://www.maestro.io
>



-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow


Re: Two example pipelines built by Yahoo intern

2017-08-08 Thread Jesse Anderson
Claire,

Interesting work.

In section 5, you talk about the Java language being difficult. Was there a
reason you didn't use Java lambdas for your work?

Thanks,

Jesse

On Tue, Aug 8, 2017 at 3:40 PM Claire Yuan  wrote:

> Hi folks,
>   We are a two-members team interning in Yahoo! Inc who are currently
> evaluating the performances and functionalities of Beam API. We built two
> pipelines using Beam API referencing the default examples. One is sentiment
> analysis and the other one is flight performance analysis. Here attached
> the codes written for the two pipelines and instructions in README about
> how to run it in our framework. We would like to share them with you. Also
> there is a paper we wrote about our evaluation results and our experiences
> about using Beam in the last two months during internship. It will be a
> great help if you can have a look at it and maybe have some comments to us.
> Thanks!
>
> --
Thanks,

Jesse


Re: Two example pipelines built by Yahoo intern

2017-08-08 Thread Eugene Kirpichov
Hi Claire,

Thank you - happy to see a paper with such a detailed description of your
experience with both usability of Beam per se and the execution on the
Flink runner!
The paper looks well-written, and, from a quick look at the code, it seems
to be using the Beam API properly without obvious opportunities for large
improvement. Great work!

A couple of suggestions:
- I think it would be useful to mention explicitly in the paper abstract /
introduction that you are testing Flink and Apex runners, and mention which
other runners are currently available, and mention why you're testing
specifically Flink and Apex. This would be useful to people reading the
paper without much background in Beam, who might not realize that Beam has
many different runners with potentially very different performance or level
of support for features.
- As a member of the Dataflow team, I'm curious :) Have you considered also
benchmarking these pipelines on the Dataflow runner? (especially streaming)
- For the issues you found that are clearly not "intended behavior" (e.g.
unacceptably low performance in streaming mode; pipelines not working at
all with Apex runner, etc.), would it be possible to add JIRA IDs to the
paper, so that people who read the paper later can look at the JIRA and see
if it was already resolved?

Thanks.

On Tue, Aug 8, 2017 at 3:46 PM Jesse Anderson 
wrote:

> Claire,
>
> Interesting work.
>
> In section 5, you talk about the Java language being difficult. Was there
> a reason you didn't use Java lambdas for your work?
>
> Thanks,
>
> Jesse
>
> On Tue, Aug 8, 2017 at 3:40 PM Claire Yuan 
> wrote:
>
>> Hi folks,
>>   We are a two-members team interning in Yahoo! Inc who are currently
>> evaluating the performances and functionalities of Beam API. We built two
>> pipelines using Beam API referencing the default examples. One is sentiment
>> analysis and the other one is flight performance analysis. Here attached
>> the codes written for the two pipelines and instructions in README about
>> how to run it in our framework. We would like to share them with you. Also
>> there is a paper we wrote about our evaluation results and our experiences
>> about using Beam in the last two months during internship. It will be a
>> great help if you can have a look at it and maybe have some comments to us.
>> Thanks!
>>
>> --
> Thanks,
>
> Jesse
>


Re: Two example pipelines built by Yahoo intern

2017-08-08 Thread Eugene Kirpichov
+Aljoscha Krettek  for comments on Flink runner
+Thomas Weise  likewise for Apex runner

On Tue, Aug 8, 2017 at 4:52 PM Eugene Kirpichov 
wrote:

> Hi Claire,
>
> Thank you - happy to see a paper with such a detailed description of your
> experience with both usability of Beam per se and the execution on the
> Flink runner!
> The paper looks well-written, and, from a quick look at the code, it seems
> to be using the Beam API properly without obvious opportunities for large
> improvement. Great work!
>
> A couple of suggestions:
> - I think it would be useful to mention explicitly in the paper abstract /
> introduction that you are testing Flink and Apex runners, and mention which
> other runners are currently available, and mention why you're testing
> specifically Flink and Apex. This would be useful to people reading the
> paper without much background in Beam, who might not realize that Beam has
> many different runners with potentially very different performance or level
> of support for features.
> - As a member of the Dataflow team, I'm curious :) Have you considered
> also benchmarking these pipelines on the Dataflow runner? (especially
> streaming)
> - For the issues you found that are clearly not "intended behavior" (e.g.
> unacceptably low performance in streaming mode; pipelines not working at
> all with Apex runner, etc.), would it be possible to add JIRA IDs to the
> paper, so that people who read the paper later can look at the JIRA and see
> if it was already resolved?
>
> Thanks.
>
> On Tue, Aug 8, 2017 at 3:46 PM Jesse Anderson 
> wrote:
>
>> Claire,
>>
>> Interesting work.
>>
>> In section 5, you talk about the Java language being difficult. Was there
>> a reason you didn't use Java lambdas for your work?
>>
>> Thanks,
>>
>> Jesse
>>
>> On Tue, Aug 8, 2017 at 3:40 PM Claire Yuan 
>> wrote:
>>
>>> Hi folks,
>>>   We are a two-members team interning in Yahoo! Inc who are currently
>>> evaluating the performances and functionalities of Beam API. We built two
>>> pipelines using Beam API referencing the default examples. One is sentiment
>>> analysis and the other one is flight performance analysis. Here attached
>>> the codes written for the two pipelines and instructions in README about
>>> how to run it in our framework. We would like to share them with you. Also
>>> there is a paper we wrote about our evaluation results and our experiences
>>> about using Beam in the last two months during internship. It will be a
>>> great help if you can have a look at it and maybe have some comments to us.
>>> Thanks!
>>>
>>> --
>> Thanks,
>>
>> Jesse
>>
>