Re: Hive bug? about no such table

2015-12-18 Thread Philip Lee
Opps, sorry

I was supposed to email this one to hive mailiing list.


On Fri, Dec 18, 2015 at 2:19 AM, Philip Lee  wrote:

> I think It is from Hive Bug about something related to metastore.
>
> Here is the thing.
>
> After I generated scale factor 300 named bigbench300 and bigbench100,
> which already existed before,
> I run "hive job with bigbench300". At first it was really fine.
> Then I run hive job with bigbench100 again. It was still okay.
> *but then when I run bigbench300 again, the error "no such table"
> happened.*
>
> << FAILED: SemanticException [Error 10001]: Line 8:7 Table not found
> 'product_reviews' >>
>
> I tried to delete every "metastore_db" in a whole folder, but it still
> happens now.
>
> Have you ever seen this kind of issue before?
> I think there is something, belonging to hive metastore conf, leading this
> problem.
> I did not see much infomation about this in Stackoverflow yet.
>
> Best,
> Philip
>


RE: Reading Parquet/Hive

2015-12-18 Thread Gwenhael Pasquiers
I'll answer to myself :)

I think i've managed to make it work by creating my "WrappingReadSupport" that 
wraps the DataWritableReadSupport but I also insert my "WrappingMaterializer" 
that converts the ArrayWritable produced by the original Materializer to 
String[]. Then later on, the String[] poses no issues with Tuple and it seems 
to be OK.

Now ... Let's write those String[] in parquet too :)


From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com]
Sent: vendredi 18 décembre 2015 10:04
To: user@flink.apache.org
Subject: Reading Parquet/Hive

Hi,

I'm trying to read Parquet/Hive data using parquet's ParquetInputFormat and 
hive's DataWritableReadSupport.

I have an error when the TupleSerializer tries to create an instance of 
ArrayWritable, using reflection because ArrayWritable has no no-args 
constructor.

I've been able to make it work when executing in a local cluster by copying the 
ArrayWritable class in my own sources and adding the constructor. I guess that 
the classpath built by maven puts my code first and allows me to override the 
original class. However when running into the real cluster (yarn@cloudera) the 
exception comes back (I guess that the original class is first in the 
classpath).

So you have an idea of how I could make it work ?

I'm think I'm tied to the ArrayWritable type because of the 
DataWritableReadSupport that extends ReadSupport.

Would it be possible (and not too complicated) to make a DataSource that would 
not generate Tuples and allow me to convert the ArrayWritable to a more 
friendly type like String[] ... Or if you have any other idea, they are welcome 
!

B.R.

Gwenhaël PASQUIERS


Re: Size of a window without explicit trigger/evictor

2015-12-18 Thread Fabian Hueske
Hi Nirmalya,

sorry for the delayed answer.

First of all, Flink does not take care that our windows fit into memory.
The default trigger depends on the way in which you define a window. Given
a KeyedStream you can define a window in the following ways:

KeyedStream s = ...
s.timeWindow() // this will use an EventTimeTrigger or
ProcessingTimeTrigger, depending on the time characteristics of the stream
s.countWindow() // this will use a CountTrigger
s.window(? extends WindowAssigner) // this will use the default trigger as
defined by the WindowAssigner

None of these triggers monitors the JVM heap to prevent OOMs. If you define
a TimeTrigger for one hour and receive too much data, the program will
crash. IMO, this behavior is preferable over early triggering which would
cause semantically wrong results. If you use a ReduceFunction to compute
the result of a window (and no Evictor), the window result can be partially
aggregated and its state does not grow.

Best, Fabian

2015-12-10 2:47 GMT+01:00 Nirmalya Sengupta :

> Hello Fabian 
>
> A small question: during the course of our recent conversation on the
> behaviour of window,trigger and evictor, you had mentioned that if I - the
> application programmer - do not attach a trigger to a window, Flink will
> attach one by itself. This trigger ensures that the size of the window
> never grows beyond a threshold, thereby ensuring that a burgeoning window
> never inflicts a OOM on Flink.
>
> Is this a special Trigger? What's the name of the class? Moreover, how is
> that threshold size (of the window) determined? Is it configurable?
>
> TIA.
>
> -- Nirmalya
>
> --
> Software Technologist
> http://www.linkedin.com/in/nirmalyasengupta
> "If you have built castles in the air, your work need not be lost. That is
> where they should be.
> Now put the foundation under them."
>


Re: Usecase for Flink

2015-12-18 Thread Stephan Ewen
If I understand you correctly, you want to write something like:

--

[cassandra]
  ^
  |
  V
(event source) > (Add event and lookup) ---> (further ops)

--

That should work with Flink, yes. You can communicate with an external
Cassandra service inside functions.

We are also working on making larger-than-memory state easily supported in
Flink, so future versions may allow you to
do this without any external service.






On Thu, Dec 17, 2015 at 8:54 AM, igor.berman  wrote:

> Hi,
> We are looking at Flink and trying to understand if our usecase is relevant
> to it.
>
> We need process stream of events. Each event is for some id(e.g. device
> id),
> when each event should be
> 1. stored in some persistent storage(e.g. cassandra)
> 2. previously persisted events should be fetched and some computation over
> whole history may or may not trigger some other events(e.g. sending email)
>
> so yes we have stream of events, but we need persistent store(aka external
> service) in the middle
> and there is no aggregation of those events into something smaller which
> could be stored in memory, i.e. number of ids might be huge and previous
> history of events per each id can be considerable so that no way to store
> everything in memory
>
> I was wondering if akka stream is sort of optional solution too
>
> please share your ideas :)
> thanks in advance,
> Igor
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Usecase-for-Flink-tp4076.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: Problem to show logs in task managers

2015-12-18 Thread Till Rohrmann
In which log file are you exactly looking for the logging statements? And
on what machine? You have to look on the machines on which the yarn
container were started. Alternatively if you have log aggregation
activated, then you can simply retrieve the log files via yarn logs.

Cheers,
Till

On Fri, Dec 18, 2015 at 3:49 PM, Ana M. Martinez  wrote:

> Hi Till,
>
> Many thanks for your quick response.
>
> I have modified the WordCountExample to re-reproduce my problem in a
> simple example.
>
> I run the code below with the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -ys 4 -yjm 1024 -ytm 1024 -c
> mypackage.WordCountExample ../flinklink.jar
>
> And if I check the log file I see all logger messages except the one in
> the flatMap function of the inner LineSplitter class, which is actually the
> one I am most interested in.
>
> Is that an expected behaviour?
>
> Thanks,
> Ana
>
> import org.apache.flink.api.common.functions.FlatMapFunction;
> import org.apache.flink.api.java.DataSet;
> import org.apache.flink.api.java.ExecutionEnvironment;
> import org.apache.flink.api.java.tuple.Tuple2;
> import org.apache.flink.util.Collector;
> import org.slf4j.Logger;
> import org.slf4j.LoggerFactory;
>
> import java.io.Serializable;
> import java.util.ArrayList;
> import java.util.List;
>
> public class WordCountExample {
> static Logger logger = LoggerFactory.getLogger(WordCountExample.class);
>
> public static void main(String[] args) throws Exception {
> final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment();
>
> logger.info("Entering application.");
>
> DataSet text = env.fromElements(
> "Who's there?",
> "I think I hear them. Stand, ho! Who's there?");
>
> List elements = new ArrayList();
> elements.add(0);
>
>
> DataSet set = env.fromElements(new TestClass(elements));
>
> DataSet> wordCounts = text
> .flatMap(new LineSplitter())
> .withBroadcastSet(set, "set")
> .groupBy(0)
> .sum(1);
>
> wordCounts.print();
>
>
> }
>
> public static class LineSplitter implements FlatMapFunction Tuple2> {
>
> static Logger loggerLineSplitter = 
> LoggerFactory.getLogger(LineSplitter.class);
>
> @Override
> public void flatMap(String line, Collector> 
> out) {
> loggerLineSplitter.info("Logger in LineSplitter.flatMap");
> for (String word : line.split(" ")) {
> out.collect(new Tuple2(word, 1));
> }
> }
> }
>
> public static class TestClass implements Serializable {
> private static final long serialVersionUID = -2932037991574118651L;
>
> static Logger loggerTestClass = 
> LoggerFactory.getLogger("WordCountExample.TestClass");
>
> List integerList;
> public TestClass(List integerList){
> this.integerList=integerList;
> loggerTestClass.info("Logger in TestClass");
> }
>
>
> }
> }
>
>
>
>
> On 17 Dec 2015, at 16:08, Till Rohrmann  wrote:
>
> Hi Ana,
>
> you can simply modify the `log4j.properties` file in the `conf` directory.
> It should be automatically included in the Yarn application.
>
> Concerning your logging problem, it might be that you have set the logging
> level too high. Could you share the code with us?
>
> Cheers,
> Till
>
> On Thu, Dec 17, 2015 at 1:56 PM, Ana M. Martinez  wrote:
>
>> Hi flink community,
>>
>> I am trying to show log messages using log4j.
>> It works fine overall except for the messages I want to show in an inner
>> class that implements
>> org.apache.flink.api.common.aggregators.ConvergenceCriterion.
>> I am very new to this, but it seems that I’m having problems to show the
>> messages included in the isConverged function, as it runs in the task
>> managers?
>> E.g. the log messages in the outer class (before map-reduce operations)
>> are properly shown.
>>
>> I am also interested in providing my own log4j.properties file. I am
>> using the ./bin/flink run -m yarn-cluster on Amazon clusters.
>>
>> Thanks,
>> Ana
>
>
>
>


Re: Problem to show logs in task managers

2015-12-18 Thread Ana M. Martinez
Hi Till,

Many thanks for your quick response.

I have modified the WordCountExample to re-reproduce my problem in a simple 
example.

I run the code below with the following command:
./bin/flink run -m yarn-cluster -yn 1 -ys 4 -yjm 1024 -ytm 1024 -c 
mypackage.WordCountExample ../flinklink.jar

And if I check the log file I see all logger messages except the one in the 
flatMap function of the inner LineSplitter class, which is actually the one I 
am most interested in.

Is that an expected behaviour?

Thanks,
Ana


import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;

public class WordCountExample {
static Logger logger = LoggerFactory.getLogger(WordCountExample.class);

public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();

logger.info("Entering application.");

DataSet text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?");

List elements = new ArrayList();
elements.add(0);


DataSet set = env.fromElements(new TestClass(elements));

DataSet> wordCounts = text
.flatMap(new LineSplitter())
.withBroadcastSet(set, "set")
.groupBy(0)
.sum(1);

wordCounts.print();


}

public static class LineSplitter implements FlatMapFunction> {

static Logger loggerLineSplitter = 
LoggerFactory.getLogger(LineSplitter.class);

@Override
public void flatMap(String line, Collector> 
out) {
loggerLineSplitter.info("Logger in LineSplitter.flatMap");
for (String word : line.split(" ")) {
out.collect(new Tuple2(word, 1));
}
}
}

public static class TestClass implements Serializable {
private static final long serialVersionUID = -2932037991574118651L;

static Logger loggerTestClass = 
LoggerFactory.getLogger("WordCountExample.TestClass");

List integerList;
public TestClass(List integerList){
this.integerList=integerList;
loggerTestClass.info("Logger in TestClass");
}


}
}



On 17 Dec 2015, at 16:08, Till Rohrmann 
> wrote:

Hi Ana,

you can simply modify the `log4j.properties` file in the `conf` directory. It 
should be automatically included in the Yarn application.

Concerning your logging problem, it might be that you have set the logging 
level too high. Could you share the code with us?

Cheers,
Till

On Thu, Dec 17, 2015 at 1:56 PM, Ana M. Martinez 
> wrote:
Hi flink community,

I am trying to show log messages using log4j.
It works fine overall except for the messages I want to show in an inner class 
that implements org.apache.flink.api.common.aggregators.ConvergenceCriterion.
I am very new to this, but it seems that I’m having problems to show the 
messages included in the isConverged function, as it runs in the task managers?
E.g. the log messages in the outer class (before map-reduce operations) are 
properly shown.

I am also interested in providing my own log4j.properties file. I am using the 
./bin/flink run -m yarn-cluster on Amazon clusters.

Thanks,
Ana




Re: Streaming to db question

2015-12-18 Thread Flavio Pompermaier
I was thinking to something more like
http://www.infoq.com/articles/key-lessons-learned-from-transition-to-nosql
that basically implement what you call Out-of-core state at
https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing.
Riak provide
some feature to handle the eventually consistent nature of that use
case...or are you more likely to go with the current proposed soluion (the
one in the Flink wiki...)?

On Mon, Dec 14, 2015 at 8:18 PM, Stephan Ewen  wrote:

> Hi!
>
> If the sink that writes to the Database executes partitioned by the
> primary key, then this should naturally prevent row conflicts.
>
> Greetings,
> Stephan
>
>
> On Mon, Dec 14, 2015 at 11:32 AM, Flavio Pompermaier  > wrote:
>
>> Hi flinkers,
>> I was going to evaluate if Flink streaming could fit a use case we have,
>> where data comes into the system, gets transformed and then added to a db
>> (a very common problem..).
>> In such use case you have to manage the merge of existing records as new
>> data come in. How can you ensure that only one row/entity of the db is
>> updated at a time with Flink?
>> Is there any example?
>>
>> Best,
>> Flavio
>>
>