Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-05-03 Thread Olivier Girardot
r vendors ? Also on > the kubelet nodes did you notice any pressure on the DNS side? > > Li > > > On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> I have ~300 spark job on Kubernetes (GKE) using the

Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-04-29 Thread Olivier Girardot
sed in the kubernetes packing) - We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the Initial allocation d

Back to SQL

2018-10-03 Thread Olivier Girardot
Hi everyone, Is there any known way to go from a Spark SQL Logical Plan (optimised ?) Back to a SQL query ? Regards, Olivier.

Spark Structured Streaming and compacted topic in Kafka

2017-09-06 Thread Olivier Girardot
Hi everyone, I'm aware of the issue regarding direct stream 0.10 consumer in spark and compacted topics (c.f. https://issues.apache.org/jira/browse/SPARK-17147). Is there any chance that spark structured-streaming kafka is compatible with compacted topics ? Regards, -- *Olivier Girardot*

Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Olivier Girardot
JIRA or is there a workaround ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com

Re: Pyspark 2.1.0 weird behavior with repartition

2017-03-11 Thread Olivier Girardot
errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > > > > Input file contents: > a > b > c > d > e > f > g > h > i > j > k > l > > > > -- > View this message in context: h

Re: Nested ifs in sparksql

2017-01-10 Thread Olivier Girardot
n. 2017 um 20:04 Uhr: I have of around 41 level of nested if else in spark sql. I have programmed it using apis on dataframe. But it takes too much time. Is there anything I can do to improve on time here? Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
ichael Gummelt mgumm...@mesosphere.io wrote: What do you mean your driver has all the dependencies packaged?  What are "all the dependencies"?  Is the distribution you use to launch your driver built with -Pmesos? On Tue, Jan 10, 2017 at 12:18 PM, Olivier Girardot < o.girar...@lateral-thoughts

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
in the final dist of my app…So everything should work in theory. On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io wrote: Just build with -Pmesos http://spark.apache.org/docs/latest/building-spark.html#building-with-mesos-support On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
Email: abhis...@valent-software.com Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark SQL - Applying transformation on a struct inside an array

2017-01-05 Thread Olivier Girardot
utations, but that's bound to be inefficient * or to generate bytecode using the schema to do the nested "getRow,getSeq…" and re-create the rows once transformation is applied I'd like to open an issue regarding that use case because it's not the first or last time it comes up and I still don'

Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
---++|           alarmUUID|           alarmUUID|+++|7d33a516-5532-410...|                null||                null|2439d6db-16a2-44b...| +++ -- Thanks and Regards, Saurav Sinha Contact: 9742879062 -- Thanks and Regards, Saurav

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Olivier Girardot
that are used are all the same across these versions. That would be the thing that makes you need multiple versions of the artifact under multiple classifiers. On Wed, Sep 28, 2016 at 1:16 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: ok, don't you think it could be pub

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
chance publications of Spark 2.0.0 with different classifier according to different versions of Hadoop available ? Thanks for your time ! Olivier Girardot Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Olivier Girardot
according to different versions of Hadoop available ? Thanks for your time ! Olivier Girardot

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-16 Thread Olivier Girardot
icks.com wrote: Is what you are looking for a withColumn that support in place modification of nested columns? or is it some other problem? On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: I tried to use the RowEncoder but got stuck along the way :Th

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-15 Thread Olivier Girardot
y common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: Hi everyone,I'm currently trying to create a

Spark SQL - Applying transformation on a struct inside an array

2016-09-13 Thread Olivier Girardot
to find a way to apply a transformation on complex nested datatypes (arrays and struct) on a Dataframe updating the value itself. Regards, Olivier Girardot

Re: Aggregations with scala pairs

2016-08-18 Thread Olivier Girardot
=> strToExpr(pairExpr._2)(df(pairExpr._1).expr) }.toSeq) } regards -- Ing. Ivaldi Andres Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-18 Thread Olivier Girardot
ethod-Will-it-save-data-to-disk-tp27533p27551.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org Olivier Girardot | Associé o.girar.

Re: error when running spark from oozie launcher

2016-08-18 Thread Olivier Girardot
but this not help enough for me. Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark SQL 1.6.1 issue

2016-08-18 Thread Olivier Girardot
l Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: tpcds for spark2.0

2016-07-29 Thread Olivier Girardot
List$SerializationProxy to field org.apache.spark.rdd.RDD.org $apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: OOM exception during Broadcast

2016-03-08 Thread Olivier Girardot
jectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> >>>> >>>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark >>>> property maximizeResourceAllocation is set to true (executor.memory = 48G >>>> according to spark ui environment). We're also using kryo serialization and >>>> Yarn is the resource manager. >>>> >>>> Any ideas as what might be going wrong and how to debug this? >>>> >>>> Thanks, >>>> Arash >>>> >>>> >>>> >>> >>> >> >> > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark Certification

2016-02-14 Thread Olivier Girardot
t;> To: "user@spark.apache.org" <user@spark.apache.org> >>> Subject: Spark Certification >>> >>> Hello All, >>> >>> I am planning on taking Spark Certification and I was wondering If one >>> has to be well equipped with MLib & GraphX as well or not ? >>> >>> Please advise >>> >>> Thanks >>> >> >> > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Olivier Girardot
We know our data is >> skewed so some of the executor will have large data (~2M RDD objects) to >> process. I used following as executorJavaOpts but it doesn't seem to work. >> -XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p' >> -XX:HeapDumpPath=/opt/cores/spark >> >> >> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> > > > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
Hi everyone, considering the new Datasets API, will there be Encoders defined for reading and writing Avro files ? Will it be possible to use already generated Avro classes ? Regards, -- *Olivier Girardot*

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
. 2016-01-05 19:01 GMT+01:00 Michael Armbrust <mich...@databricks.com>: > You could try with the `Encoders.bean` method. It detects classes that > have getters and setters. Please report back! > > On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot < > o.girar...@lateral-tho

Re: Lookup / Access of master data in spark streaming

2015-10-06 Thread Olivier Girardot
time Regards, 2015-10-05 23:49 GMT+02:00 Tathagata Das <t...@databricks.com>: > Yes, when old broacast objects are not referenced any more in the driver, > then associated data in the driver AND the executors will get cleared. > > On Mon, Oct 5, 2015 at 1:40 PM, Olivier Gir

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-08-31 Thread Olivier Girardot
n$8$$anon$1.next(Window.scala:252) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 2015-08-26 11:47 GMT+02:00 Olivier Girardot <ssab...@gmail.com>: > Hi everyone, > I know this "post title" doesn't seem very logical and I agree, > we have a very com

Re: Spark stages very slow to complete

2015-08-25 Thread Olivier Girardot
commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Olivier Girardot* | Associé o.girar...@lateral

Re: Classifier for Big Data Mining

2015-07-21 Thread Olivier Girardot
depends on your data and I guess the time/performance goals you have for both training/prediction, but for a quick answer : yes :) 2015-07-21 11:22 GMT+02:00 Chintan Bhatt chintanbhatt...@charusat.ac.in: Which classifier can be useful for mining massive datasets in spark? Decision Tree can be

Re: coalesce on dataFrame

2015-07-01 Thread Olivier Girardot
PySpark or Spark (scala) ? When you use coalesce with anything but a column you must use a literal like that in PySpark : from pyspark.sql import functions as F F.coalesce(df.a, F.lit(True)) Le mer. 1 juil. 2015 à 12:03, Ewan Leith ewan.le...@realitymine.com a écrit : It's in spark 1.4.0, or

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Olivier Girardot
I must admit I've been using the same back to SQL strategy for now :p So I'd be glad to have insights into that too. Le mar. 30 juin 2015 à 23:28, pedro ski.rodrig...@gmail.com a écrit : I am trying to find what is the correct way to programmatically check for null values for rows in a

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-26 Thread Olivier Girardot
Nop I have not but I'm glad I'm not the only one :p Le ven. 26 juin 2015 07:54, Tao Li litao.bupt...@gmail.com a écrit : Hi Olivier, have you fix this problem now? I still have this fasterxml NoSuchMethodError. 2015-06-18 3:08 GMT+08:00 Olivier Girardot o.girar...@lateral-thoughts.com

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
I would pretty much need exactly this kind of feature too Le ven. 26 juin 2015 à 21:17, Dave Ariens dari...@blackberry.com a écrit : Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos

Re: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-06-22 Thread Olivier Girardot
Hi, I can't get this to work using CDH 5.4, Spark 1.4.0 in yarn cluster mode. @andrew did you manage to get it work with the latest version ? Le mar. 21 avr. 2015 à 00:02, Andrew Lee alee...@hotmail.com a écrit : Hi Marcelo, Exactly what I need to track, thanks for the JIRA pointer. Date:

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
classpath would be great. Regards, Olivier. Le mer. 17 juin 2015 à 11:37, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Hi everyone, After copying the hive-site.xml from a CDH5 cluster, I can't seem to connect to the hive metastore using spark-shell, here's a part of the stack

Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
Hi everyone, After copying the hive-site.xml from a CDH5 cluster, I can't seem to connect to the hive metastore using spark-shell, here's a part of the stack trace I get : 15/06/17 04:41:57 ERROR TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Olivier Girardot
You can use it as a broadcast variable, but if it's too large (more than 1Gb I guess), you may need to share it joining this using some kind of key to the other RDDs. But this is the kind of thing broadcast variables were designed for. Regards, Olivier. Le jeu. 4 juin 2015 à 23:50, dgoldenberg

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
which expose hiveUdaf's as Spark SQL AggregateExpressions, but they are private. On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg(percentile(key,0.5

Re: Best strategy for Pandas - Spark

2015-06-02 Thread Olivier Girardot
, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
, value: String) val df=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(table) sqlContext.sql(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Olivier Girardot
://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.-Xiangrui -Xiangrui https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.-Xiangrui On Thu, May 7, 2015 at 8:39 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, is there any best practice

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-18 Thread Olivier Girardot
PR is opened : https://github.com/apache/spark/pull/6237 Le ven. 15 mai 2015 à 17:55, Olivier Girardot ssab...@gmail.com a écrit : yes, please do and send me the link. @rxin I have trouble building master, but the code is done... Le ven. 15 mai 2015 à 01:27, Haopu Wang hw...@qilinsoft.com

Re: Why so slow

2015-05-12 Thread Olivier Girardot
can you post the explain too ? Le mar. 12 mai 2015 à 12:11, Jianshi Huang jianshi.hu...@gmail.com a écrit : Hi, I have a SQL query on tables containing big Map columns (thousands of keys). I found it to be very slow. select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as

Re: value toDF is not a member of RDD object

2015-05-12 Thread Olivier Girardot
` but the error remains. Do I need to import modules other than `import org.apache.spark.sql.{ Row, SQLContext }`? On Tue, May 12, 2015 at 5:56 PM Olivier Girardot ssab...@gmail.com wrote: toDF is part of spark SQL so you need Spark SQL dependency + import sqlContext.implicits._ to get

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Olivier Girardot
Hi Haopu, actually here `key` is nullable because this is your input's schema : scala result.printSchema root |-- key: string (nullable = true) |-- SUM(value): long (nullable = true) scala df.printSchema root |-- key: string (nullable = true) |-- value: long (nullable = false) I tried it with a

RandomSplit with Spark-ML and Dataframe

2015-05-07 Thread Olivier Girardot
Hi, is there any best practice to do like in MLLib a randomSplit of training/cross-validation set with dataframes and the pipeline API ? Regards Olivier.

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Olivier Girardot
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet

Re: Re: sparksql running slow while joining 2 tables.

2015-05-05 Thread Olivier Girardot
. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Olivier Girardot ssab...@gmail.com 收件人:luohui20...@sina.com, user user@spark.apache.org 主题:Re: sparksql running slow while joining 2 tables. 日期:2015年05月04日 20点46分 Hi, What is you Spark version ? Regards, Olivier. Le lun

Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, luohui20...@sina.com a écrit : hi guys when i am running a sql like select a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name = b.name) where (b.startpoint a.startpoint + 25); I

Re: AJAX with Apache Spark

2015-05-04 Thread Olivier Girardot
Hi Sergio, you shouldn't architecture it this way, rather update a storage with Spark Streaming that your Play App will query. For example a Cassandra table, or Redis, or anything that will be able to answer you in milliseconds, rather than querying the Spark Streaming program. Regards, Olivier.

Re: Drop a column from the DataFrame.

2015-05-03 Thread Olivier Girardot
great thx Le sam. 2 mai 2015 à 23:58, Ted Yu yuzhih...@gmail.com a écrit : This is coming in 1.4.0 https://issues.apache.org/jira/browse/SPARK-7280 On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote: Sounds like a patch for a drop method... Le sam. 2 mai 2015 à 21:03

Re: Can I group elements in RDD into different groups and let each group share some elements?

2015-05-02 Thread Olivier Girardot
Did you look at the cogroup transformation or the cartesian transformation ? Regards, Olivier. Le sam. 2 mai 2015 à 22:01, Franz Chien franzj...@gmail.com a écrit : Hi all, Can I group elements in RDD into different groups and let each group share elements? For example, I have 10,000

Re: to split an RDD to multiple ones?

2015-05-02 Thread Olivier Girardot
I guess : val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_) val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_) val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_) Regards, Olivier. Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com a écrit : Hi, I have an RDD *srdd*

Re: Drop a column from the DataFrame.

2015-05-02 Thread Olivier Girardot
Sounds like a patch for a drop method... Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit : Just use select() to create a new DataFrame with only the columns you want. Sort of the opposite of what you want -- but you can select all but the columns you want minus the one you

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread Olivier Girardot
Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit : Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException:

Best strategy for Pandas - Spark

2015-04-30 Thread Olivier Girardot
Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes. What would be

Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
Hi everyone, what is the most efficient way to filter a DataFrame on a column from another Dataframe's column. The best idea I had, was to join the two dataframes : val df1 : Dataframe val df2: Dataframe df1.join(df2, df1(id) === df2(id), inner) But I end up (obviously) with the id column

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
You mean after joining ? Sure, my question was more if there was any best practice preferred to joining the other dataframe for filtering. Regards, Olivier. Le mer. 29 avr. 2015 à 13:23, Olivier Girardot ssab...@gmail.com a écrit : Hi everyone, what is the most efficient way to filter

How to distribute Spark computation recipes

2015-04-27 Thread Olivier Girardot
Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to serialize/deserialize full RDD computations ? @rxin Spark SQL is, in a way, already doing this but the parsers are private[sql], is there any way to

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Olivier Girardot
Hi Sourav, Can you post your updateFunc as well please ? Regards, Olivier. Le mar. 21 avr. 2015 à 12:48, Sourav Chandra sourav.chan...@livestream.com a écrit : Hi, We are building a spark streaming application which reads from kafka, does updateStateBykey based on the received message type

Re: Can a map function return null

2015-04-18 Thread Olivier Girardot
You can return an RDD with null values inside, and afterwards filter on item != null In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else

Re: Build spark failed with maven

2015-02-14 Thread Olivier Girardot
Hi, this was not reproduced for me, what kind of jdk are you using for the zinc server ? Regards, Olivier. 2015-02-11 5:08 GMT+01:00 Yi Tian tianyi.asiai...@gmail.com: Hi, all I got an ERROR when I build spark master branch with maven (commit: 2d1e916730492f5d61b97da6c483d3223ca44315)

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Olivier Girardot
Hi, are you using spark for a java or scala project and can you post your pom file please ? Regards, Olivier. 2014-11-27 7:07 GMT+01:00 Taeyun Kim taeyun@innowireless.com: Hi, An information about the error. On File | Project Structure window, the following error message is

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Olivier Girardot
can you please post the full source of your code and some sample data to run it on ? 2014-11-19 16:23 GMT+01:00 YaoPau jonrgr...@gmail.com: I joined two datasets together, and my resulting logs look like this: (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith)))

Re: default parallelism bug?

2014-10-21 Thread Olivier Girardot
Hi, what do you mean by pretty small ? How big is your file ? Regards, Olivier. 2014-10-21 6:01 GMT+02:00 Kevin Jung itsjb.j...@samsung.com: I use Spark 1.1.0 and set these options to spark-defaults.conf spark.scheduler.mode FAIR spark.cores.max 48 spark.default.parallelism 72 Thanks,

Re: Convert Iterable to RDD

2014-10-21 Thread Olivier Girardot
I don't think this is provided out of the box, but you can use toSeq on your Iterable and if the Iterable is lazy, it should stay that way for the Seq. And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your RDD. For the Iterable[Iterable[T]] you can flatten it and then create

Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
If you already know your keys the best way would be to extract one RDD per key (it would not bring the content back to the master and you can take advantage of the caching features) and then execute a registerTempTable by Key. But I'm guessing, you don't know the keys in advance, and in this