Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Guillermo Ortiz
> > > The question is: what do you want to do with that count afterwards? > > -kr, Gerard. > > > On Wed, Dec 17, 2014 at 5:11 PM, Guillermo Ortiz > wrote: >> >> I'm a newbie with Spark,,, a simple question >> >> val errorLines = lines.

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Guillermo Ortiz
).size() > X){ //iterate values and do something for each element. } I think that it must be pretty basic,, argg. 2014-12-17 18:43 GMT+01:00 Guillermo Ortiz : > What I would like to do it's to count the number of elements and if > it's greater than a number, I have to iterat

Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
I'm trying to make some operation with windows and intervals. I get data every15 seconds, and want to have a windows of 60 seconds with batch intervals of 15 seconds. I''m injecting data with ncat. if I inject 3 logs in the same interval I get into the "do something" each 15 secods during one min

Re: Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
println("fin foreachRdd"); ///Just one time, when I start the program Why it's just executing the println("4...")?? shouldn't it execute all the code each 15 seconds that it's what it's defined on the context (val ssc = new StreamingCont

Re: Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
Oh, I didn't understand what I was doing, my fault (too much parties these xmas). Thought windows works in another weird way. Sorry for the questions.. 2014-12-26 13:42 GMT+01:00 Guillermo Ortiz : > I'm trying to understand why it's not working and I typed some println > t

Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
I'm trying to execute Spark from a Hadoop Cluster, I have created this script to try it: #!/bin/bash export HADOOP_CONF_DIR=/etc/hadoop/conf SPARK_CLASSPATH="" for lib in `ls /user/local/etc/lib/*.jar` do SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib done /home/spark-1.1.1-bin-hadoop2.4/bin/spark

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
Shixiong Zhu > > 2015-01-08 19:23 GMT+08:00 Guillermo Ortiz : >> >> I'm trying to execute Spark from a Hadoop Cluster, I have created this >> script to try it: >> >> #!/bin/bash >> >> export HADOOP_CONF_DIR=/etc/hadoop/conf >> SPAR

Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
When I try to execute my task with Spark it starts to copy the jars it needs to HDFS and it finally fails, I don't know exactly why. I have checked HDFS and it copies the files, so, it seems to work that part. I changed the log level to debug but there's nothing else to help. What else does Spark n

Re: Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
I was adding some bad jars I guess. I deleted all the jars and copied them again and it works. 2015-01-08 14:15 GMT+01:00 Guillermo Ortiz : > When I try to execute my task with Spark it starts to copy the jars it > needs to HDFS and it finally fails, I don't know exactly why. I hav

Measure performance time in some spark transformations.

2018-05-12 Thread Guillermo Ortiz Fernández
I want to measure how long it takes some different transformations in Spark as map, joinWithCassandraTable and so on. Which one is the best aproximation to do it? def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println("Elap

Reset the offsets, Kafka 0.10 and Spark

2018-06-07 Thread Guillermo Ortiz Fernández
I'm consuming data from Kafka with createDirectStream and store the offsets in Kafka ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself ) val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, S

Refresh broadcast variable when it isn't the value.

2018-08-19 Thread Guillermo Ortiz Fernández
Hello, I want to set data in a broadcast (Map) variable in Spark. Sometimes there are new data so I have to update/refresh the values but I'm not sure how I could do this. My idea is to use accumulators like a flag when a cache error occurs, in this point I could read the data and reload the broa

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed. at org.apache.kafka.clients

java.lang.OutOfMemoryError: Java heap space - Spark driver.

2018-08-29 Thread Guillermo Ortiz Fernández
I got this error from spark driver, it seems that I should increase the memory in the driver although it's 5g (and 4 cores) right now. It seems weird to me because I'm not using Kryo or broadcast in this process but in the log there are references to Kryo and broadcast. How could I figure out the r

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrot

deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
I want to execute my processes in cluster mode. As I don't know where the driver has been executed I have to do available all the file it needs. I undertand that they are two options. Copy all the files to all nodes of copy them to HDFS. My doubt is,, if I want to put all the files in HDFS, isn't

Re: deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
-2.0.2.jar:lib/kafka-clients-0.10.0.1.jar \ --files conf/${1}Conf.json iris-core-0.0.1-SNAPSHOT.jar conf/${1}Conf.json El mié., 5 sept. 2018 a las 11:11, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió: > I want to execute my processes in cluster mode. As I don'

Trying to improve performance of the driver.

2018-09-13 Thread Guillermo Ortiz Fernández
I have a process in Spark Streamin which lasts 2 seconds. When I check where the time is spent I see about 0.8s-1s in processing time although the global time is 2s. This one second is spent in the driver. I reviewed the code which is executed by the driver and I commented some of this code with th

Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
I'm executing a load process into HBase with spark. (around 150M record). At the end of the process there are a lot of fail tasks. I get this error: 19/05/28 11:02:31 ERROR client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.TableNotFoundException: my_table at org.

Re: Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/05/28 11:11:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 369 El mar., 28 may. 2019 a las 12:12, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escri

Parse RDD[Seq[String]] to DataFrame with types.

2019-07-17 Thread Guillermo Ortiz Fernández
I'm trying to parse a RDD[Seq[String]] to Dataframe. ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on. For example, a line could be: "hello", "1", "bye", "1.1" "hello1", "11", "bye1", "2.1" ... First column is going to be always a String,

<    1   2