Re: Hive bug? about no such table
Opps, sorry I was supposed to email this one to hive mailiing list. On Fri, Dec 18, 2015 at 2:19 AM, Philip Leewrote: > I think It is from Hive Bug about something related to metastore. > > Here is the thing. > > After I generated scale factor 300 named bigbench300 and bigbench100, > which already existed before, > I run "hive job with bigbench300". At first it was really fine. > Then I run hive job with bigbench100 again. It was still okay. > *but then when I run bigbench300 again, the error "no such table" > happened.* > > << FAILED: SemanticException [Error 10001]: Line 8:7 Table not found > 'product_reviews' >> > > I tried to delete every "metastore_db" in a whole folder, but it still > happens now. > > Have you ever seen this kind of issue before? > I think there is something, belonging to hive metastore conf, leading this > problem. > I did not see much infomation about this in Stackoverflow yet. > > Best, > Philip >
RE: Reading Parquet/Hive
I'll answer to myself :) I think i've managed to make it work by creating my "WrappingReadSupport" that wraps the DataWritableReadSupport but I also insert my "WrappingMaterializer" that converts the ArrayWritable produced by the original Materializer to String[]. Then later on, the String[] poses no issues with Tuple and it seems to be OK. Now ... Let's write those String[] in parquet too :) From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: vendredi 18 décembre 2015 10:04 To: user@flink.apache.org Subject: Reading Parquet/Hive Hi, I'm trying to read Parquet/Hive data using parquet's ParquetInputFormat and hive's DataWritableReadSupport. I have an error when the TupleSerializer tries to create an instance of ArrayWritable, using reflection because ArrayWritable has no no-args constructor. I've been able to make it work when executing in a local cluster by copying the ArrayWritable class in my own sources and adding the constructor. I guess that the classpath built by maven puts my code first and allows me to override the original class. However when running into the real cluster (yarn@cloudera) the exception comes back (I guess that the original class is first in the classpath). So you have an idea of how I could make it work ? I'm think I'm tied to the ArrayWritable type because of the DataWritableReadSupport that extends ReadSupport. Would it be possible (and not too complicated) to make a DataSource that would not generate Tuples and allow me to convert the ArrayWritable to a more friendly type like String[] ... Or if you have any other idea, they are welcome ! B.R. Gwenhaël PASQUIERS
Re: Size of a window without explicit trigger/evictor
Hi Nirmalya, sorry for the delayed answer. First of all, Flink does not take care that our windows fit into memory. The default trigger depends on the way in which you define a window. Given a KeyedStream you can define a window in the following ways: KeyedStream s = ... s.timeWindow() // this will use an EventTimeTrigger or ProcessingTimeTrigger, depending on the time characteristics of the stream s.countWindow() // this will use a CountTrigger s.window(? extends WindowAssigner) // this will use the default trigger as defined by the WindowAssigner None of these triggers monitors the JVM heap to prevent OOMs. If you define a TimeTrigger for one hour and receive too much data, the program will crash. IMO, this behavior is preferable over early triggering which would cause semantically wrong results. If you use a ReduceFunction to compute the result of a window (and no Evictor), the window result can be partially aggregated and its state does not grow. Best, Fabian 2015-12-10 2:47 GMT+01:00 Nirmalya Sengupta: > Hello Fabian > > A small question: during the course of our recent conversation on the > behaviour of window,trigger and evictor, you had mentioned that if I - the > application programmer - do not attach a trigger to a window, Flink will > attach one by itself. This trigger ensures that the size of the window > never grows beyond a threshold, thereby ensuring that a burgeoning window > never inflicts a OOM on Flink. > > Is this a special Trigger? What's the name of the class? Moreover, how is > that threshold size (of the window) determined? Is it configurable? > > TIA. > > -- Nirmalya > > -- > Software Technologist > http://www.linkedin.com/in/nirmalyasengupta > "If you have built castles in the air, your work need not be lost. That is > where they should be. > Now put the foundation under them." >
Re: Usecase for Flink
If I understand you correctly, you want to write something like: -- [cassandra] ^ | V (event source) > (Add event and lookup) ---> (further ops) -- That should work with Flink, yes. You can communicate with an external Cassandra service inside functions. We are also working on making larger-than-memory state easily supported in Flink, so future versions may allow you to do this without any external service. On Thu, Dec 17, 2015 at 8:54 AM, igor.bermanwrote: > Hi, > We are looking at Flink and trying to understand if our usecase is relevant > to it. > > We need process stream of events. Each event is for some id(e.g. device > id), > when each event should be > 1. stored in some persistent storage(e.g. cassandra) > 2. previously persisted events should be fetched and some computation over > whole history may or may not trigger some other events(e.g. sending email) > > so yes we have stream of events, but we need persistent store(aka external > service) in the middle > and there is no aggregation of those events into something smaller which > could be stored in memory, i.e. number of ids might be huge and previous > history of events per each id can be considerable so that no way to store > everything in memory > > I was wondering if akka stream is sort of optional solution too > > please share your ideas :) > thanks in advance, > Igor > > > > -- > View this message in context: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Usecase-for-Flink-tp4076.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. >
Re: Problem to show logs in task managers
In which log file are you exactly looking for the logging statements? And on what machine? You have to look on the machines on which the yarn container were started. Alternatively if you have log aggregation activated, then you can simply retrieve the log files via yarn logs. Cheers, Till On Fri, Dec 18, 2015 at 3:49 PM, Ana M. Martinezwrote: > Hi Till, > > Many thanks for your quick response. > > I have modified the WordCountExample to re-reproduce my problem in a > simple example. > > I run the code below with the following command: > ./bin/flink run -m yarn-cluster -yn 1 -ys 4 -yjm 1024 -ytm 1024 -c > mypackage.WordCountExample ../flinklink.jar > > And if I check the log file I see all logger messages except the one in > the flatMap function of the inner LineSplitter class, which is actually the > one I am most interested in. > > Is that an expected behaviour? > > Thanks, > Ana > > import org.apache.flink.api.common.functions.FlatMapFunction; > import org.apache.flink.api.java.DataSet; > import org.apache.flink.api.java.ExecutionEnvironment; > import org.apache.flink.api.java.tuple.Tuple2; > import org.apache.flink.util.Collector; > import org.slf4j.Logger; > import org.slf4j.LoggerFactory; > > import java.io.Serializable; > import java.util.ArrayList; > import java.util.List; > > public class WordCountExample { > static Logger logger = LoggerFactory.getLogger(WordCountExample.class); > > public static void main(String[] args) throws Exception { > final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); > > logger.info("Entering application."); > > DataSet text = env.fromElements( > "Who's there?", > "I think I hear them. Stand, ho! Who's there?"); > > List elements = new ArrayList(); > elements.add(0); > > > DataSet set = env.fromElements(new TestClass(elements)); > > DataSet > wordCounts = text > .flatMap(new LineSplitter()) > .withBroadcastSet(set, "set") > .groupBy(0) > .sum(1); > > wordCounts.print(); > > > } > > public static class LineSplitter implements FlatMapFunction Tuple2 > { > > static Logger loggerLineSplitter = > LoggerFactory.getLogger(LineSplitter.class); > > @Override > public void flatMap(String line, Collector > > out) { > loggerLineSplitter.info("Logger in LineSplitter.flatMap"); > for (String word : line.split(" ")) { > out.collect(new Tuple2 (word, 1)); > } > } > } > > public static class TestClass implements Serializable { > private static final long serialVersionUID = -2932037991574118651L; > > static Logger loggerTestClass = > LoggerFactory.getLogger("WordCountExample.TestClass"); > > List integerList; > public TestClass(List integerList){ > this.integerList=integerList; > loggerTestClass.info("Logger in TestClass"); > } > > > } > } > > > > > On 17 Dec 2015, at 16:08, Till Rohrmann wrote: > > Hi Ana, > > you can simply modify the `log4j.properties` file in the `conf` directory. > It should be automatically included in the Yarn application. > > Concerning your logging problem, it might be that you have set the logging > level too high. Could you share the code with us? > > Cheers, > Till > > On Thu, Dec 17, 2015 at 1:56 PM, Ana M. Martinez wrote: > >> Hi flink community, >> >> I am trying to show log messages using log4j. >> It works fine overall except for the messages I want to show in an inner >> class that implements >> org.apache.flink.api.common.aggregators.ConvergenceCriterion. >> I am very new to this, but it seems that I’m having problems to show the >> messages included in the isConverged function, as it runs in the task >> managers? >> E.g. the log messages in the outer class (before map-reduce operations) >> are properly shown. >> >> I am also interested in providing my own log4j.properties file. I am >> using the ./bin/flink run -m yarn-cluster on Amazon clusters. >> >> Thanks, >> Ana > > > >
Re: Problem to show logs in task managers
Hi Till, Many thanks for your quick response. I have modified the WordCountExample to re-reproduce my problem in a simple example. I run the code below with the following command: ./bin/flink run -m yarn-cluster -yn 1 -ys 4 -yjm 1024 -ytm 1024 -c mypackage.WordCountExample ../flinklink.jar And if I check the log file I see all logger messages except the one in the flatMap function of the inner LineSplitter class, which is actually the one I am most interested in. Is that an expected behaviour? Thanks, Ana import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.java.DataSet; import org.apache.flink.api.java.ExecutionEnvironment; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.util.Collector; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.Serializable; import java.util.ArrayList; import java.util.List; public class WordCountExample { static Logger logger = LoggerFactory.getLogger(WordCountExample.class); public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); logger.info("Entering application."); DataSet text = env.fromElements( "Who's there?", "I think I hear them. Stand, ho! Who's there?"); List elements = new ArrayList(); elements.add(0); DataSet set = env.fromElements(new TestClass(elements)); DataSet> wordCounts = text .flatMap(new LineSplitter()) .withBroadcastSet(set, "set") .groupBy(0) .sum(1); wordCounts.print(); } public static class LineSplitter implements FlatMapFunction > { static Logger loggerLineSplitter = LoggerFactory.getLogger(LineSplitter.class); @Override public void flatMap(String line, Collector > out) { loggerLineSplitter.info("Logger in LineSplitter.flatMap"); for (String word : line.split(" ")) { out.collect(new Tuple2 (word, 1)); } } } public static class TestClass implements Serializable { private static final long serialVersionUID = -2932037991574118651L; static Logger loggerTestClass = LoggerFactory.getLogger("WordCountExample.TestClass"); List integerList; public TestClass(List integerList){ this.integerList=integerList; loggerTestClass.info("Logger in TestClass"); } } } On 17 Dec 2015, at 16:08, Till Rohrmann > wrote: Hi Ana, you can simply modify the `log4j.properties` file in the `conf` directory. It should be automatically included in the Yarn application. Concerning your logging problem, it might be that you have set the logging level too high. Could you share the code with us? Cheers, Till On Thu, Dec 17, 2015 at 1:56 PM, Ana M. Martinez > wrote: Hi flink community, I am trying to show log messages using log4j. It works fine overall except for the messages I want to show in an inner class that implements org.apache.flink.api.common.aggregators.ConvergenceCriterion. I am very new to this, but it seems that I’m having problems to show the messages included in the isConverged function, as it runs in the task managers? E.g. the log messages in the outer class (before map-reduce operations) are properly shown. I am also interested in providing my own log4j.properties file. I am using the ./bin/flink run -m yarn-cluster on Amazon clusters. Thanks, Ana
Re: Streaming to db question
I was thinking to something more like http://www.infoq.com/articles/key-lessons-learned-from-transition-to-nosql that basically implement what you call Out-of-core state at https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing. Riak provide some feature to handle the eventually consistent nature of that use case...or are you more likely to go with the current proposed soluion (the one in the Flink wiki...)? On Mon, Dec 14, 2015 at 8:18 PM, Stephan Ewenwrote: > Hi! > > If the sink that writes to the Database executes partitioned by the > primary key, then this should naturally prevent row conflicts. > > Greetings, > Stephan > > > On Mon, Dec 14, 2015 at 11:32 AM, Flavio Pompermaier > wrote: > >> Hi flinkers, >> I was going to evaluate if Flink streaming could fit a use case we have, >> where data comes into the system, gets transformed and then added to a db >> (a very common problem..). >> In such use case you have to manage the merge of existing records as new >> data come in. How can you ensure that only one row/entity of the db is >> updated at a time with Flink? >> Is there any example? >> >> Best, >> Flavio >> >