Stream and Batch Use Case

2017-02-15 Thread ankit beohar
Hi All, I have a use case where I have kafka and flat files so can I write one code and run for both or I have to create two different pipelines or use pipeline join in a one pipeline. Which one is better? Best Regards, ANKIT BEOHAR

Re: Stream and Batch Use Case

2017-02-15 Thread ankit beohar
Amit Thanks for your fast response now I got it, my use case will solve using composite transforms (which is in progress I guess). But if I twist my logic and put in a way you mentioned to just use different I/O and run on top of SPARK then I guess BEAM will handle batch and streaming performance

Re: Metrics for Beam IOs.

2017-02-15 Thread Stas Levin
+1 to making the IO metrics (e.g. producers, consumers) available as part of the Beam pipeline metrics tree for debugging and visibility. As it has already been mentioned, many IO clients have a metrics mechanism in place, so in these cases I think it could be beneficial to mirror their metrics

Re: Stream and Batch Use Case

2017-02-15 Thread Amit Sela
You can write one pipeline and simply replace the IO, for example: To read from (text) files you can use: *PCollection lines = p.apply(TextIO.Read.from("file://some/inputData.txt"));* and from Kafka (I'm adding a generic key here because Kafka messages are keyed): *PCollection>

Re: Stream and Batch Use Case

2017-02-15 Thread Amit Sela
Oh, missed your question on which one is better it really depends on your use case. If the data is homogenous, and you want to write to the same IO, I don't see a reason not to Flatten them into one PCollection. If you want to write files-to-files and Kafka-to-Kafka you might be better off

Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Dipti Kulkarni
Hello there! I am working on writing a Read IO for Hadoop InputFormat. This will enable reading from any datasource which supports Hadoop InputFormat, i.e. provides source to read from InputFormat for integration with Hadoop. It makes sense for the HadoopInputFormatIO to share some code with the

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Raghu Angadi
I skimmed through HdfsIO and I think it is essentially HahdoopInpuFormatIO with FileInputFormat. I would pretty much move most of the code to HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO). On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com>

Re: Better developer instructions for using Maven?

2017-02-15 Thread Jean-Baptiste Onofré
On Jenkins it's possible to run several jobs in the same time but on different executor. That's the easiest way. Regards JB On Feb 15, 2017, 10:15, at 10:15, "Ismaël Mejía" wrote: >This question got lost in the discussion, but there is a small >improvement >that we can do: >

Re: Better developer instructions for using Maven?

2017-02-15 Thread Ismaël Mejía
This question got lost in the discussion, but there is a small improvement that we can do: > Just to check, are we doing parallel builds? We are on jenkins, not in travis, there is an ongoing PR to fix this. What we can improve is to check if we can run some of the test suites in parallel to