Large number of sources in Flink Job

2018-05-27 Thread Chirag Dewan
Hi,
I am working on a use case where my Flink job needs to collect data from 
thousands of sources. 
As an example, I want to collect data from more than 2000 File Directories, 
process(filter, transform) the data and distribute the processed data streams 
to 200 different directories.
Are there any caveats I should know with such large number of sources, also 
taking into account per operator parallelism? 
Regards,
Chirag  


Clarification in TumblingProcessing TimeWindow Documentation

2018-05-27 Thread Dhruv Kumar
Hi

I was looking at TumblingProcessingTimeWindows.java 

 and was a bit confused with the documentation at the start of this class. It 
says the following:

/**
 * A {@link WindowAssigner} that windows elements into windows based on the 
current
 * system time of the machine the operation is running on. Windows cannot 
overlap.
 *
 * For example, in order to window into windows of 1 minute, every 10 
seconds:
 *  {@code
 * DataStream> in = ...;
 * KeyedStream> keyed = in.keyBy(...);
 * WindowedStream, String, TimeWindows> windowed =
 *   keyed.window(TumblingProcessingTimeWindows.of(Time.of(1, MINUTES), 
Time.of(10, SECONDS));
 * } 
 */


It says one can have tumbling windows of 1 minute, every 10 seconds. Doesn’t 
this become a sliding window then? The SlidingProcessTimeWindows.java 

 has the exact same documentation with just one tiny change (“Windows can 
possibly overlap”). It seems to me that in the above documentation, the second 
Time argument of 10 seconds is for providing the window offset (as confirmed 
here 
)
 and not for starting the tumbling window every 10 seconds.

Thanks


--
Dhruv Kumar
PhD Candidate
Department of Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me



Writing Table API results to a csv file

2018-05-27 Thread chrisr123
I'm using Flink 1.4.0

I'm trying to save the results of a Table API query to a CSV file, but I'm
getting an error.
Here are the details:

My Input file looks like this:
id,species,color,weight,name
311,canine,golden,75,dog1
312,canine,brown,22,dog2
313,feline,gray,8,cat1

I run a query on this to select canines only, and I want to save this to a
csv file:

ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv =
TableEnvironment.getTableEnvironment(env); 

String inputPath = "location-of-source-file";
CsvTableSource petsTableSource = 
CsvTableSource.builder()
.path(inputPath)
.ignoreFirstLine()
.fieldDelimiter(",")
.field("id", Types.INT())
.field("species", Types.STRING())
.field("color", Types.STRING())
.field("weight", Types.DOUBLE())
.field("name", Types.STRING())
.build();

// Register our table source
tableEnv.registerTableSource("pets", petsTableSource);
Table pets = tableEnv.scan("pets");

Table counts = pets
.groupBy("species")
.select("species, species.count as count")
.filter("species === 'canine'");

DataSet result = tableEnv.toDataSet(counts, 
Row.class);
result.print();

// Write Results to File
TableSink sink = new 
CsvTableSink("/home/hadoop/output/pets", ",");
counts.writeToSink(sink);

When I run this, I get the output from the result.print() call as this:

canine,2

but I do not see any results written
to the file, and I see the error below.
How can I save the results I'm seeing in stdout to a CSV file?
Thanks!



2018-05-27 12:49:44,411 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1873) - class org.apache.flink.types.Row does not
contain a getter for field fields
2018-05-27 12:49:44,411 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1876) - class org.apache.flink.types.Row does not
contain a setter for field fields
2018-05-27 12:49:44,411 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1911) - class org.apache.flink.types.Row is not a valid
POJO type because not all fields are valid POJO fields.
2018-05-27 12:49:44,411 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1873) - class org.apache.flink.types.Row does not
contain a getter for field fields
2018-05-27 12:49:44,412 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1876) - class org.apache.flink.types.Row does not
contain a setter for field fields
2018-05-27 12:49:44,412 INFO  [main] typeutils.TypeExtractor
(TypeExtractor.java:1911) - class org.apache.flink.types.Row is not a valid
POJO type because not all fields are valid POJO fields.









--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Checkpointing when reading from files?

2018-05-27 Thread Padarn Wilson
I'm a bit confused about this too actually. I think the above would work as
a solution if you want to continuously monitor a directory, but for a
"PROCESS_ONCE" readFile source I don't think you will get a checkpoint
emitted indicating the end of the stream.

My understanding of this is that there can be no checkpoints created
while the file directory

Trying to dig into the java code I found this:

case PROCESS_ONCE:
   synchronized (checkpointLock) {

  // the following check guarantees that if we restart
  // after a failure and we managed to have a successful
  // checkpoint, we will not reprocess the directory.

  if (globalModificationTime == Long.MIN_VALUE) {
 monitorDirAndForwardSplits(fileSystem, context);
 globalModificationTime = Long.MAX_VALUE;
  }
  isRunning = false;
   }
   break;

My understanding of this is that there can be no checkpoints created
while the file directory is read, and then once it is read the
isRunning flat is set to false, which means no new checkpoints are
emitted.

Is this correct? If so, is it possible to somehow force a checkpoint
to be emitted on the completion of the source?



On Tue, May 22, 2018 at 3:24 AM Amit Jain  wrote:

> Hi Alex,
>
> StreamingExecutionEnvironment#readFile is a helper function to create
> file reader data streaming source. It uses
> ContinuousFileReaderOperator and ContinuousFileMonitoringFunction
> internally.
>
> As both file reader operator and monitoring function uses
> checkpointing so is readFile [1], you can go with first approach.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readFile-org.apache.flink.api.common.io.FileInputFormat-java.lang.String-org.apache.flink.streaming.api.functions.source.FileProcessingMode-long-org.apache.flink.api.common.typeinfo.TypeInformation-
>
>
> --
> Thanks,
> Amit
>
>
> On Mon, May 21, 2018 at 9:39 PM, NEKRASSOV, ALEXEI  wrote:
> > I want to add checkpointing to my program that reads from a set of files
> in
> > a directory. Without checkpointing I use readFile():
> >
> >
> >
> >   DataStream text = env.readFile(
> >
> >new TextInputFormat(new Path(inputPath)),
> >
> >inputPath,
> >
> >   inputProcessingMode,
> >
> >   1000);
> >
> >
> >
> > Should I use ContinuousFileMonitoringFunction /
> ContinuousFileReaderOperator
> > to add checkpointing? Or is there an easier way?
> >
> >
> >
> > How do I go from splits (that ContinuousFileMonitoringFunction provides)
> to
> > actual strings? I’m not clear how ContinuousFileReaderOperator can be
> used.
> >
> >
> >
> >   DataStreamSource split =
> > env.addSource(
> >
> >new ContinuousFileMonitoringFunction(
> >
> >  new TextInputFormat(new
> > Path(inputPath)),
> >
> >  inputProcessingMode,
> >
> >  1,
> >
> >  1000)
> >
> >   );
> >
> >
> >
> > Thanks,
> > Alex
>


Re: sharebuffer prune code

2018-05-27 Thread Dawid Wysakowicz
The logic for SharedBuffer and in result for prunning will be changed in
FLINK-9418 [1]. We plan to make it backwards compatible. There is
already open PR[2] (in review), you can check if the problem persists.

Regards,
Dawid

[1] https://issues.apache.org/jira/browse/FLINK-9418
[2] https://github.com/apache/flink/pull/6059


On 24.05.2018 12:21, aitozi wrote:
> Can you explain it more explictly?
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/




signature.asc
Description: OpenPGP digital signature