Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
I am trying to build a DataFrame from a list, here is the code: private void start() { SparkConf conf = new SparkConf().setAppName("Data Set from Array").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext

Re: Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
teStructType(Collections.singletonList( > DataTypes.createStructField("int_field", DataTypes.IntegerType, true))); > > DataFrame intDataFrame = sqlContext.createDataFrame(rows, schema); > > > > On Fri, Jul 22, 2016 at 7:53 AM, Jean Georges Perrin <j...@jg

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin
Hi, Here is my UDF that should build a VectorUDT. How do I actually make that the value is in the vector? package net.jgp.labs.spark.udf; import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.sql.api.java.UDF1; public class VectorBuilder implements UDF1

java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin
I try to build a simple DataFrame that can be used for ML SparkConf conf = new SparkConf().setAppName("Simple prediction from Text File").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext = new SQLContext(sc);

MLlib, Java, and DataFrame

2016-07-21 Thread Jean Georges Perrin
? Do you know/have any example like that? Thanks! jg Jean Georges Perrin j...@jgp.net <mailto:j...@jgp.net> / @jgperrin

Java Recipes for Spark

2016-07-29 Thread Jean Georges Perrin
Sorry if this looks like a shameless self promotion, but some of you asked me to say when I'll have my Java recipes for Apache Spark updated. It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in the GitHub repo. Enjoy / have a

Re: Java Recipes for Spark

2016-07-31 Thread Jean Georges Perrin
lly some Java love :-) > > Thank you. > > > Il 29/07/2016 22:30, Jean Georges Perrin ha scritto: > Sorry if this looks like a shameless self promotion, but some of you asked me > to say when I'll have my Java recipes for Apache Spark updated. It's done > here: http://jgp

Raleigh, Durham, and around...

2016-08-04 Thread Jean Georges Perrin
Hi, With some friends, we try to develop the Apache Spark community in the Triangle area of North Carolina, USA. If you are from there, feel free to join our Slack team: http://oplo.io/td. Danny Siegle has also organized a lot of meet ups around the edX courses (see

Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Hi, I have a Java memory issue with Spark. The same application working on my 8GB Mac crashes on my 72GB Ubuntu server... I have changed things in the conf file, but it looks like Spark does not care, so I wonder if my issues are with the driver or executor. I set: spark.driver.memory

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I have added: SparkConf conf = new SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g") .setMaster("spark://10.0.100.120:7077"); but it did not change a thing > On Jul 13, 2016, at

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
er-memory and —executor-memory > because in local mode your setting about executor and driver didn’t work that > you expected. > > > > >> On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin <j...@jgp.net >> <mailto:j...@jgp.net>> wrote: >> >

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Looks like replacing the setExecutorEnv() by set() did the trick... let's see how fast it'll process my 50x 10ˆ15 data points... > On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin <j...@jgp.net> wrote: > > I have added: > > SparkConf conf = new > Spa

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Jean Georges Perrin
Hey, I love when questions are numbered, it's easier :) 1) Yes (but I am not an expert) 2) You don't control... One of my process is going to 8k tasks, so... 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it makes 24. 4) From my understanding: Slave is the logical

spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hi, Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores My process uses all the cores of my server (good), but I am trying to limit it so I can actually submit a second job. I tried SparkConf conf = new SparkConf().setAppName("NC Eatery

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
conf = conf.set("spark.executor.cores", "2"); > } > JavaSparkContext javaSparkContext = new JavaSparkContext(conf); > > On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wro

Re: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Jean Georges Perrin
Do you need it on disk or just push it to memory? Can you try to increase memory or # of cores (I know it sounds basic) > On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi > wrote: > > Hello, > > I have 400K json messages pulled from Kafka into spark streaming

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
n.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it a

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 15 July 2016 at 13:48, Jean Georges Perrin <j...@jgp.net > &l

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such

log traces

2016-07-04 Thread Jean Georges Perrin
Hi, I have installed Apache Spark via Maven. How can I control the volume of log it displays on my system? I tried different location for a log4j.properties, but none seems to work for me. Thanks for help... - To unsubscribe

Re: log traces

2016-07-04 Thread Jean Georges Perrin
nched by using the > $SPARK_HOME/bin/spark-submit script. It might be helpful to provide us more > details on how you are running your application. > > Regards, > Luis > > On 4 July 2016 at 16:57, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrot

Re: log traces

2016-07-04 Thread Jean Georges Perrin
imer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from

Re: log traces

2016-07-04 Thread Jean Georges Perrin
k.SparkContext > > <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext> > > Hope this helps, > Anupam > > > > > On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.ne

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
What are you doing it on right now? > On Jul 6, 2016, at 3:25 PM, dabuki wrote: > > I was thinking about to replace a legacy batch job with Spark, but I'm not > sure if Spark is suited for this use case. Before I start the proof of > concept, I wanted to ask for opinions. >

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
and and almost for free using a > cloud infrastructure. > > > > > On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote: >> What are you doing it on right now? >> >> > On Jul 6, 2016, at 3:25 PM, dabuki wrote: >> > >

Re: Processing json document

2016-07-07 Thread Jean Georges Perrin
do you want id1, id2, id3 to be processed similarly? The Java code I use is: df = df.withColumn(K.NAME, df.col("fields.premise_name")); the original structure is something like {"fields":{"premise_name":"ccc"}} hope it helps > On Jul 7, 2016, at 1:48 AM, Lan Jiang

Re: "client / server" config

2016-07-10 Thread Jean Georges Perrin
s looking for local file on driver (ie your mac) @ > location: file:/Users/jgp/Documents/Data/restaurants-data.json > >> On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin <j...@jgp.net> wrote: >> >> I have my dev environment on my Mac. I have a dev Spark server on

Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address and still connection refused... but no luck > On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin <j...@jgp.net> wrote: > > Hi, > > So far I have been using Spark "embedded" in my app. Now, I'd l

Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
It appears like i had issues in my /etc/hosts... it seems ok now > On Jul 10, 2016, at 2:13 PM, Jean Georges Perrin <j...@jgp.net> wrote: > > I tested that: > > I set: > > _JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true > SPARK_LOCAL_IP=10.0.100.120 > I still hav

"client / server" config

2016-07-10 Thread Jean Georges Perrin
I have my dev environment on my Mac. I have a dev Spark server on a freshly installed physical Ubuntu box. I had some connection issues, but it is now all fine. In my code, running on the Mac, I have: 1 SparkConf conf = new

Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
Hi, So far I have been using Spark "embedded" in my app. Now, I'd like to run it on a dedicated server. I am that far: - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 2.10, installed Spark 1.6.2, recompiled - Pi test works - UI on port 8080 works Log says: Spark

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
ntroducing-dataframes-in-spark-for-large-scale-data-science.html> > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > > > Sent from my iPhone > Pardon the dumb thumb typos :) > On Jul 21, 2016, at 8:41 PM, Jean Georges Perr

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
RegressionWithElasticNetExample.java> > > This example uses a Dataset, which is type equivalent to a DataFrame. > > > On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrote: > Hi, > > I am looking for some

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
process is littlebit cumbersome > > > 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint] > 2. run your ML model > > i'd suggest you stick to DataFrame + ml package :) > > hth > > > > On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <j...@jgp.

Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jean Georges Perrin
Hi fellow Sparkans, I am building a UDF (in Java) that can return various data types, basically the signature of the function itself is: public Object call(String a, Object b, String c, Object d, String e) throws Exception When I register my function, I need to provide a type, e.g.:

Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
age there, and the next operations all depend on the checkpointed > DataFrame. If you don't checkpoint, you continue to build the lineage, > therefore while that lineage is being resolved, you may hit the > StackOverflowException. > > HTH, > Burak > > On Thu,

eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers, Trying to understand the Dataframe's checkpoint (not in the context of streaming) https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
lar multi-line JSON file will most often fail. > > > > On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrote: > Hi folks, > > I am trying to parse JSON arrays and it’s getting a little crazy (for me at > leas

JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Hi folks, I am trying to parse JSON arrays and it’s getting a little crazy (for me at least)… 1) If my JSON is: {"vals":[100,500,600,700,800,200,900,300]} I get: ++ |vals| ++ |[100, 500, 600, 7...| ++ root |-- vals:

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
to see what he does not like? the JSON parser has been pretty good to me until recently. > On Oct 10, 2016, at 12:59 PM, Sudhanshu Janghel <> wrote: > > As far as my experience goes spark can parse only certain types of Json > correctly not all and has strict Parsing rul

Re: Inserting New Primary Keys

2016-10-10 Thread Jean Georges Perrin
Is there only one process adding rows? because this seems a little risky if you have multiple threads doing that… > On Oct 8, 2016, at 1:43 PM, Benjamin Kim wrote: > > Mich, > > After much searching, I found and am trying to use “SELECT ROW_NUMBER() > OVER() + b.id_max

Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
Hi, I am trying to build a custom file data source for Spark, in Java. I have found numerous examples in Scala (including the CSV and XML data sources from Databricks), but I cannot bring Scala in this project. We also already have the parser itself written in Java, I just need to build the

Re: Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
ly in Java a data source that returns always a row with one > column containing a String. I fear in any case you need to import some Scala > classes in Java and/or have some wrappers in Scala. > If you use fileformat that you need at least spark 2.0. > > On 22 Mar 2017, at 20:27

Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be able to transform to Python? http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ jg > On Apr 14, 2017, at 07:18, issues solution wrote: > > Hi > somone can give me an

Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hi Sparkians, I understand the lazy evaluation mechanism with transformations and actions. My question is simpler: 1) are show() and/or printSchema() actions? I would assume so... and optional question: 2) is there a way to know if there are transformations "pending"? Thanks! jg

Re: Is there a way to run Spark SQL through REST?

2017-07-22 Thread Jean Georges Perrin
There's Livi but it's pretty resource intensive. I know it's not helpful but my company has developed its own and I am trying to Open Source it. Looks like there are quite a few companies who had the need and custom build. jg > On Jul 22, 2017, at 04:01, kant kodali

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Jean Georges Perrin
Awesome! Congrats! Can't wait!! jg > On Jul 11, 2017, at 18:48, Michael Armbrust wrote: > > Hi all, > > Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release > removes the experimental tag from Structured Streaming. In addition, this > release

Re: Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
gt; > What do you mean by pending ? You can see the status of the job in the UI. > >> On 2. Aug 2017, at 14:16, Jean Georges Perrin <j...@jgp.net> wrote: >> >> Hi Sparkians, >> >> I understand the lazy evaluation mechanism with transforma

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Jean Georges Perrin
I use CIFS and it works reasonably well and easily cross platform, well documented... > On Aug 4, 2017, at 6:50 AM, Steve Loughran wrote: > > >> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: >> >> Hello >> my 2 cents here, hope it helps >> If

Re: Quick one on evaluation

2017-08-04 Thread Jean Georges Perrin
<daniel.dara...@lynxanalytics.com> > wrote: > > > On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrote: > Hi Sparkians, > > I understand the lazy evaluation mechanism with transformations and actions. > My question

Spark 2 | Java | Dataset

2017-08-17 Thread Jean Georges Perrin
Hey, I was wondering if it would make sense to have a Dataset of something else than Row? Does anyone has an example (in Java) or use case? My use case would be to use Spark on existing objects we have and benefit from the distributed processing on those objects. jg

Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
After investigation, it looks like my Spark 2.1.1 jars got corrupted during download - all good now... ;) > On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin <j...@jgp.net> wrote: > > Hey all, > > i was giving a run to 2.1.1 and got an error on one of my test p

org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
Hey all, i was giving a run to 2.1.1 and got an error on one of my test program: package net.jgp.labs.spark.l000_ingestion; import java.util.Arrays; import java.util.List; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import

Re: "Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
ver#persistent-context-mode---faster--required-for-related-jobs> > https://github.com/cloudera/livy#post-sessions > <https://github.com/cloudera/livy#post-sessions> > > On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>

Re: "Sharing" dataframes...

2017-06-21 Thread Jean Georges Perrin
ting jobs is in general a bad idea, since it breaks the DAG > and thus prevents some potential push-down optimizations. > > On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrote: > Thanks Vadim & Jörn... I will look into thos

"Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Hey, Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this. I was thinking of

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Jean Georges Perrin
Do you have some other security in place like Kerberos or impersonation? It may affect your access. jg > On Jun 7, 2017, at 02:15, Patrik Medvedev wrote: > > Hello guys, > > I need to execute hive queries on remote hive server from spark, but for some > reasons

Re: Nested RDD operation

2017-09-15 Thread Jean Georges Perrin
Hey Daniel, not sure this will help, but... I had a similar need where i wanted the content of a dataframe to become a "cell" or a row in the parent dataframe. I grouped by the child dataframe, then collect it as a list in the parent dataframe after a join operation. As I said, not sure it

[Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-18 Thread Jean Georges Perrin
Hi, I am trying to connect to a new cluster I just set up. And I get... [Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I must have forgotten something really super obvious. My

Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Context.parallelize(Seq(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(0)) > }) > }) > > Wondering if there is any better/faster way to do this ? > > Thanks. > > > > On Fri, 15 Sep 2017 at 13:

Re: Spark code to get select firelds from ES

2017-09-20 Thread Jean Georges Perrin
Same issue with RDBMS ingestion (I think). I solved it with views. Can you do views on ES? jg > On Sep 20, 2017, at 09:22, Kedarnath Dixit > wrote: > > Hi, > > I want to get only select fields from ES using Spark ES connector. > > I have done some code

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Hey, I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I recently worked on such problems, so there's definitely a solution there or I'll be happy to write one for you. Look in l250 map... jg > On Sep 10, 2017, at 20:51, ayan guha wrote: > >

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Sorry - more likely l700 save. jg > On Sep 10, 2017, at 20:56, Jean Georges Perrin <j...@jgp.net> wrote: > > Hey, > > I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I > recently worked on such problems, so there's definitely a solution t

Re: Spark 2.0.0 and Hive metastore

2017-08-29 Thread Jean Georges Perrin
Sorry if my comment is not helping, but... why do you need Hive? Can't you save your aggregation using parquet for example? jg > On Aug 29, 2017, at 08:34, Andrés Ivaldi wrote: > > Hello, I'm using Spark API and with Hive support, I dont have a Hive > instance, just

Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin
Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use some AWS API from Amazon SDK, which version is like from yesterday :) so it’s confused and AMZ is changing its version like crazy so it’s a little difficult to follow. Right now I went back to

Re: Hi all,

2017-11-03 Thread Jean Georges Perrin
Hi Oren, Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion... > On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.sha...@gmail.com⁩> wrote: > > I have 2 spark jobs one is pre-process and

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
Write a UDF? > On Oct 31, 2017, at 11:48, Aakash Basu > wrote: > > Hey all, > > Any help in the below please? > > Thanks, > Aakash. > > > -- Forwarded message -- > From: Aakash Basu

Re: How to get the data url

2017-11-03 Thread Jean Georges Perrin
I am a little confused by your question… Are you trying to ingest a file from S3? If so… look for net.jgp.labs.spark on GitHub and look for net.jgp.labs.spark.l000_ingestion.l001_csv_in_progress.S3CsvToDataset You can modify the file as the keys are yours… If you want to download first: look

Re: learning Spark

2017-12-05 Thread Jean Georges Perrin
When you pick a book, make sure it covers the version of Spark you want to deploy. There are a lot of books out there that focus a lot on Spark 1.x. Spark 2.x generalizes the dataframe API, introduces Tungsten, etc. All might not be relevant to a pure “sys admin” learning, but it is good to

A code example of Catalyst optimization

2018-06-04 Thread Jean Georges Perrin
Hi there, I am looking for an example of optimization through Catalyst, that you can demonstrate via code. Typically, you load some data in a dataframe, you do something, you do the opposite operation, and, when you collect, it’s super fast because nothing really happened to the data.

Re: submitting dependencies

2018-06-27 Thread Jean Georges Perrin
Have you tried to build a uber jar to bundle all your classes together? > On Jun 27, 2018, at 01:27, amin mohebbi > wrote: > > Could you please help me to understand how I should submit my spark > application ? > > I have used this connector

Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Jean Georges Perrin
Just hints: Repartition in 10? Get the RDD from the dataframe? What about a forEach row and send every 100? (I just did that actually) jg > On Oct 26, 2017, at 13:37, Noorul Islam Kamal Malmiyoda > wrote: > > Hi all, > > I have a Dataframe with 1000 records. I want to

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jean Georges Perrin
May I ask what is the use case? Although it is a very interesting question, but I would be concerned about going further than a proof of concept. A lot of the enterprises I see and visit are barely on Java8, so starting to talk JDK 9 might be a slight overkill but if you have a good story, I’m

Storage at node or executor level

2017-12-22 Thread Jean Georges Perrin
Hi all, This is more of a general architecture question, I have my idea, but wanted to confirm/infirm... When your executor is accessing data, where is it stored: at the executor level or at the worker level? jg - To

Re: Custom Data Source for getting data from Rest based services

2017-12-24 Thread Jean Georges Perrin
If you need Java code, you can have a look @: https://github.com/jgperrin/net.jgp.labs.spark.datasources and: https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source

Re: S3 token times out during data frame "write.csv"

2018-01-25 Thread Jean Georges Perrin
Are you writing from an Amazon instance or from a on premise install to S3? How many partitions are you writing from? Maybe you can try to “play” with repartitioning to see how it behaves? > On Jan 23, 2018, at 17:09, Vasyl Harasymiv wrote: > > It is about 400

Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Jean Georges Perrin
You can try to create new columns with the nested value, > On Jan 29, 2018, at 15:26, Arnav kumar wrote: > > Hello Experts, > > I would need your advice in resolving the below issue when I am trying to > retrieving the data from a dataframe. > > Can you please let me

Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians, Can someone tell me what is the purpose of DataTypes.NullType, specially as you are building a schema? Thanks jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians, Can someone tell me what is the purpose of DataTypes.NullType, specially as you are building a schema? Thanks jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Jean Georges Perrin
Sure, use withColumn()... jg > On Feb 1, 2018, at 05:50, kant kodali wrote: > > Hi All, > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > Thanks!

Re: Schema - DataTypes.NullType

2018-02-04 Thread Jean Georges Perrin
Any taker on this one? ;) > On Jan 29, 2018, at 16:05, Jean Georges Perrin <j...@jgp.net> wrote: > > Hi Sparkians, > > Can someone tell me what is the purpose of DataTypes.NullType, specially as > you are building a sch

Re: Schema - DataTypes.NullType

2018-02-12 Thread Jean Georges Perrin
- id: long (nullable = true) > |-- val: null (nullable = true) > > > Nicholas Szandor Hakobian, Ph.D. > Staff Data Scientist > Rally Health > nicholas.hakob...@rallyhealth.com <mailto:nicholas.hakob...@rallyhealth.com> > > > On Sun, Feb 11, 2018 at 5:40 AM,

Re: Schema - DataTypes.NullType

2018-02-11 Thread Jean Georges Perrin
What is the purpose of DataTypes.NullType, specially as you are building a schema? Have anyone used it or seen it as spart of a schema auto-generation? (If I keep asking long enough, I may get an answer, no? :) ) > On Feb 4, 2018, at 13:15, Jean Georges Perrin <j...@jgp.net> wrote:

Re: How to merge multiple rows

2018-08-22 Thread Jean Georges Perrin
How do you do it now? You could use a withColumn(“newDetails”, ) jg > On Aug 22, 2018, at 16:04, msbreuer wrote: > > A dataframe with following contents is given: > > ID PART DETAILS > 11 A1 > 12 A2 > 13 A3 > 21 B1 > 31 C1 > > Target format should be as following: >

Re: spark sql data skew

2018-07-13 Thread Jean Georges Perrin
Just thinking out loud… repartition by key? create a composite key based on company and userid? How big is your dataset? > On Jul 13, 2018, at 06:20, 崔苗 wrote: > > Hi, > when I want to count(distinct userId) by company,I met the data skew and the > task takes too long time,how to count

Re: What's the best way to have Spark a service?

2018-03-15 Thread Jean Georges Perrin
Hi David, I ended building up my own. Livy sounded great on paper, but heavy to manipulate. I found out about Jobserver too late. We did not find too complicated to build ours, with a small Spring boot app that was holding the session (we did not need more than one session). jg > On Mar 15,

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
did not see anything, but curious if you find something. I think one of the big benefit of using Java, for data engineering in the context of Spark, is that you do not have to train a lot of your team to Scala. Now if you want to do data science, Java is probably not the best tool yet... >

Re: Is there any Spark source in Java

2018-11-03 Thread Jean Georges Perrin
I would take this one very closely to my heart :) Look at: https://github.com/jgperrin/net.jgp.labs.spark And if the examples are too weird, have a look at: http://jgp.net/book published at Manning Feedback appreciated! jg > On Nov 3, 2018, at 12:30, Jeyhun Karimov wrote: > > Hi Soheil,

Where is the DAG stored before catalyst gets it?

2018-10-04 Thread Jean Georges Perrin
Hi, I am assuming it is still in the master and when catalyst is finished it sends the tasks to the workers. Correct? tia jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi, Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, Durham, and Chapel Hill in North Carolina, USA. The group started back in July 2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ .

Re: how to generate a larg dataset paralleled

2018-12-13 Thread Jean Georges Perrin
You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing? jg > On Dec 13, 2018, at 21:38, lk_spark wrote: > > hi,all: > I want't to generate some test data , which contained about one hundred > million rows . >

Re: OData compliant API for Spark

2018-12-05 Thread Jean Georges Perrin
I was involved in a project like that and we decided to deploy the data in https://ckan.org/. We used Spark for the data pipeline and transformation. Hih. jg > On Dec 4, 2018, at 21:14, Affan Syed wrote: > > All, > > We have been thinking about exposing our platform for analytics an OData

Multiple sessions in one application?

2018-12-19 Thread Jean Georges Perrin
Hi there, I was curious of what use cases would drive the use of newSession() (as in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#newSession-- ). I

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jean-Georges Perrin
class to extend this class to return a JavaDStream. > This is my real problem. > Tell me if the above description is not clear, because English is > not my native language. > > Thanks in advance > Gary > > On Tue, May 14, 2019 at 11:06 PM Jean Georges Perrin &l

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Jean Georges Perrin
There are a little bit more than the list you specified nevertheless, some data types are not directly compatible between Scala and Java and requires conversion, so it’s good to not pollute your code with plenty of conversion and focus on using the straight API. I don’t remember from the top

Checkpointing and accessing the checkpoint data

2019-06-27 Thread Jean-Georges Perrin
of any performance comparison between the two? On small datasets, caching seems more performant, but I can imagine that there is a sweet spot… Thanks! jgp Jean -Georges Perrin j...@jgp.net