Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L91 > > Pozdrawiam, > Jacek Laskowski > > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitt

spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
would be convenient to get the type signature from org.apache.spark.sql.catalog.Function itself when available. -- Guillaume Massé [Gee-OHM] (马赛卫)

[Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-21 Thread guillaume farcy
mutualize the connection object but without success: _pickle.PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object Can you please tell me how to do this? Or at least give me some advice? Sincerely, FARCY

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps
efore calling spark-submit. Guillaume, thanks for the pointer. > > Timothy, thanks for looking into this. Looking forward to see a fix soon. > > Thanks, > Eran Chinthaka Withana > > On Thu, Mar 10, 2016 at 10:10 AM, Tim Chen wrote: > >> Hi Eran, >> >&g

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps
For an answer to my question see this: http://stackoverflow.com/a/35660466?noredirect=1. But for your problem did you define the Spark.mesos.docker. home or something like that property? Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana a écrit : > Hi > > I'm also having this issue and can

Re: Is Spark right for us?

2016-03-07 Thread Guillaume Bilodeau
ree text fields I would recommend solr or > elastic search, because they have a lot more text analytics capabilities > that do not exist in a relational database or MongoDB and are not likely to > be there in the near future. > > On 06 Mar 2016, at 18:25, Guillaume Bilodeau > wr

Re: Is Spark right for us?

2016-03-06 Thread Guillaume Bilodeau
The data is currently stored in a relational database, but a migration to a document-oriented database such as MongoDb is something we are definitely considering. How does this factor in? On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta wrote: > Hi, > > That depends on a lot of things, but as a

mllib.recommendations.als recommendForAll not ported to ml?

2015-12-06 Thread guillaume
I have experimented very low performance with the ALSModel.transform method when feeding it with even a small cartesian product of user x items. The former mllib implementation has a recommendForAll method to return topn items per users in an efficient way (using the blockify method to distribute

Fwd: pyspark: Error when training a GMM with an initial GaussianMixtureModel

2015-11-25 Thread Guillaume Maze
to use a Kmean result for instance), we simply wanted to test case this scenario. So we try to initialize a 2nd training using the GaussianMixtureModel from the output a 1st training. But this trivial scenario throws an error. Could you please help us determine what's going on here ? Thanks a l

Model Save function (ML-Lib)

2015-07-17 Thread Guillaume Guy
SVM SVMModel NO What is the recommended route to save a logistic regression or SVM ? I tried to pickle the SVM but it failed at loading it back. Any advice appreciated. Thanks ! Best, Guillaume Guy * +1 919 - 972 - 8750*

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-23 Thread Guillaume Pitel
Hi, So I've done this "Node-centered accumulator", I've written a small piece about it : http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ Hope it can help someone Guillaume 2015-06-18 15:17 GMT+02:00 Guillaume Pitel <mailto:gui

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
ed to take care of the concurrency problem for my sketch. Guillaume Yeah thats the problem. There is probably some "perfect" num of partitions that provides the best balance between partition size and memory and merge overhead. Though it's not an ideal solution :( There could

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
Hi, Thank you for this confirmation. Coalescing is what we do now. It creates, however, very big partitions. Guillaume Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and

Re: Best way to randomly distribute elements

2015-06-18 Thread Guillaume Pitel
d to use mapPartitionsWithIndex first and seed your rand with the partition id (or compute your rand from a seed based on x). Guillaume Hello, In the context of a machine learning algorithm, I need to be able to randomly distribute the elements of a large RDD across partitions (i.e., essentially assign each

Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
mulator is initialized locally, updated, then sent back to the driver for merging ? So I guess, accumulators may not be the way to go, finally. Any advice ? Guillaume -- eXenSa *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. <http://www.exensa.com/> 41, rue Périer -

Re: Random pairs / RDD order

2015-04-16 Thread Guillaume Pitel
distribution due to collisions, but I don't think it should hurt too much. Guillaume Hi everyone, However I am not happy with this solution because each element is most likely to be paired with elements that are "closeby" in the partition. This is because sample returns an &qu

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Guillaume Pitel
Right, I remember now, the only problematic case is when things go bad and the cleaner is not executed. Also, it can be a problem when reusing the same sparkcontext for many runs. Guillaume It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
cess is selected for a sacrifice because it is the main culprit of memory consumption. Guillaume Linux OOM throws SIGTERM, but if I remember correctly JVM handles heap memory limits differently and throws OutOfMemoryError and eventually sends SIGINT. Not sure what happened but the worker s

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
Very likely to be this : http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2 Your worker ran out of memory => maybe you're asking for too much memory for the JVM, or something else is running on the worker Guillaume Any idea what this means, man

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Guillaume Pitel
Does it also cleanup spark local dirs ? I thought it was only cleaning $SPARK_HOME/work/ Guillaume I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=" On 11.04.2015

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Guillaume Pitel
both on the executors AND on the driver (the spark local dir of the driver can be heavily used if using a lot of broadcast) I think in recent versions of Spark, the $SPARK_HOME/work is correctly cleaned up, but adding a cron won't hurt. Guillaume Does anybody have an answer for this?

Re: Pairwise computations within partition

2015-04-09 Thread Guillaume Pitel
by blocks is probably not what you want, since it would restrict the scope of a vector to the vectors in the same block. Guillaume Hello everyone, I am a Spark novice facing a nontrivial problem to solve with Spark. I have an RDD consisting of many elements (say, 60K), where each element

Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel
join takes forever, it makes no sense as it, IMO. Guillaume Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with >10GB RAM each, but the join seems to be taking too long (> 2hrs). I am using Spark 0.8.0 but I have also t

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Guillaume Pitel
transiting on the wire during shuffles, the probability of an error occuring during deserialization or uncompression is relatively high. In general, reducing the memory pressure also helps a lot. Guillaume Hi, I'm facing an issue when I try to run my Spark application. I keep getting the foll

Re: Incremently load big RDD file into Memory

2015-04-08 Thread Guillaume Pitel
takes 100bytes, the per element cartesian will produce N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie X*N*100*2 bytes. Guillaume Hi Guillaume, Thanks for you reply. Can you please tell me how can i improve for Top-k nearest points. P.S. My post is not accepted on the

Re: Error when using multiple python files spark-submit

2015-03-20 Thread Guillaume Charhon
I see. I will try the other way around. On Thu, Mar 19, 2015 at 8:06 PM, Davies Liu wrote: > the options of spark-submit should come before main.py, or they will > become the options of main.py, so it should be: > > ../hadoop/spark-install/bin/spark-submit --py-files > > > /home/poiuytrez

Re: Spark-submit and multiple files

2015-03-20 Thread Guillaume Charhon
Hi Davies, I am already using --py-files. The system does use the other file. The error I am getting is not trivial. Please check the error log. On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu wrote: > You could submit additional Python source via --py-files , for example: > > $ bin/spark-submit

Re: Speed Benchmark

2015-03-04 Thread Guillaume Guy
Sorry for the confusion. All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and 3 are datanodes. Best, Guillaume Guy * +1 919 - 972 - 8750* On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen wrote: > Is machine 1 the only one running an HDFS data node? You describe it

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
t; very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will > be fixed very soon. > > Davies > > On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy > > wrote: > > Hi Sean: > > > > Thanks for your feedback. Scala is much faster. The count is performed >

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
Hi Sean: Thanks for your feedback. Scala is much faster. The count is performed in ~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap seems to be more than that. Is that also your conclusion? Thanks. Best, Guillaume Guy * +1 919 - 972 - 8750* On Fri, Feb 27, 2015 at

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
, Guillaume Guy * +1 919 - 972 - 8750* On Fri, Feb 27, 2015 at 9:06 AM, Jason Bell wrote: > How many machines are on the cluster? > And what is the configuration of those machines (Cores/RAM)? > > "Small cluster" is very subjective statement. > > > > Guillaum

Speed Benchmark

2015-02-27 Thread Guillaume Guy
f, I would appreciate any pointers as to ways to improve performance. Thanks. Best, Guillaume - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Movie Recommendation tutorial

2015-02-24 Thread Guillaume Charhon
I am using Spark 1.2.1. Thank you Krishna, I am getting almost the same results as you so it must be an error in the tutorial. Xiangrui, I made some additional tests with lambda to 0.1 and I am getting a much better rmse: RMSE (validation) = 0.868981 for the model trained with rank = 8, lambda =

Re: Spark can't pickle class: error cannot lookup attribute

2015-02-19 Thread Guillaume Guy
Thanks Davies and Eric. I followed Davies' instructions and it works wonderful. I would add that you can also add these scripts in the pyspark shell too: pyspark --py-files support.py where support.py is your script containing your class as Davies described. Best, Guillaume Guy * +

Spark can't pickle class: error cannot lookup attribute

2015-02-18 Thread Guillaume Guy
Hi, This is a duplicate of the stack-overflow question here . I hope to generate more interest on this mailing list. *The problem:* I am running into some attribute lookup problems when trying to

Maven profile in MLLib netlib-lgpl not working (1.1.1)

2014-12-10 Thread Guillaume Pitel
help. Guillaume +1 with 1.3-SNAPSHOT. On Mon, Dec 1, 2014 at 5:49 PM, agg212 <mailto:alexander_galaka...@brown.edu>> wrote: Thanks for your reply, but I'm still running into issues installing/configuring the native libraries for MLlib. Here are the steps I've t

Re: Mllib native netlib-java/OpenBLAS

2014-12-10 Thread Guillaume Pitel
, but the -Pnetlib-lgpl seems not to be transmitted to the child from the parent pom.xml I don't know how to fix that cleanly (I just added true in mllib's pom.xml), maybe it's just a problem with my maven version (3.0.5) Guillaume I tried building Spark from the source, by dow

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Guillaume Pitel
llelism, make sure that your combineByKey has enough different keys, and see what happens. Guillaume Thank you, Guillaume, my dataset is not that large, it's totally ~2GB 2014-10-20 16:58 GMT+08:00 Guillaume Pitel <mailto:guillaume.pi...@exensa.com>>: Hi, It happened t

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Guillaume Pitel
Hi, The array size you (or the serializer) tries to allocate is just too big for the JVM. No configuration can help : https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit The only option is to split you problem further by increasing parallelism. Guillaume Hi, I’m using

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Guillaume Pitel
Hi Could it be due to GC ? I read it may happen if your program starts with a small heap. What are your -Xms and -Xmx values ? Print GC stats with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps Guillaume Hello spark users and developers! I am using hdfs + spark sql + hive schema

Problem with very slow behaviour of TorrentBroadcast vs. HttpBroadcast

2014-10-01 Thread Guillaume Pitel
ect a configuration error from our side, but are unable to pin it down. Does someone have any idea of the origin of the problem ? For now we're sticking with the HttpBroadcast workaround. Guillaume -- eXenSa *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. <http://www.exensa.

Re: Kyro deserialisation error

2014-07-24 Thread Guillaume Pitel
4¸C4P4ڻ _o4lbʂԛ4각 4^x4ڻ Clearly a stream corruption problem. We've been running fine (afaik) on 1.0.0 for two weeks, switch to 1.0.1 this Monday, and since, this kind of problem randomly occur. Guillaume Pitel Not sure if this helps, but it does seem to

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
Sorry, there's a typo in my previous post, the line should read: VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null | grep -v "INFO" | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME/g') On Tue, Jul 1, 2014 at 10:35 AM, Guillaume Ballet wrote: > You

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
You can specify a custom name with the --name option. It will still contain 1.1.0-SNAPSHOT, but at least you can specify your company name. If you want to replace SNAPSHOT with your company name, you will have to edit make-distribution.sh and replace the following line: VERSION=$(mvn help:evaluat

Re: Huge matrix

2014-04-13 Thread Guillaume Pitel
On 04/12/2014 06:35 PM, Xiaoli Li wrote: Hi Guillaume, This sounds a good idea to me. I am a newbie here. Could you further explain how will you determine which clusters to keep? According to the distance between each

Re: Huge matrix

2014-04-12 Thread Guillaume Pitel
, and how many to keep. As a rule of thumb, I generally want 300-1 elements per cluster, and use 5-20 clusters. Guillaume  I am implementing an algorithm using Spark. I have one million users. I need to compute the similarity be

Spark powered wikipedia analysis and exploration

2014-03-27 Thread Guillaume Pitel
kipedia in your spare time :) Guillaume -- Guillaume PITEL, Président +33(0)6 2

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Guillaume Pitel (eXenSa)
Maybe with "MEMORY_ONLY", spark has to recompute the RDD several times because they don't fit in memory. It makes things run slower. As a general safe rule, use MEMORY_AND_DISK_SER Guillaume Pitel - Président d'eXenSa Prashant Sharma a écrit : >I think Mahout use

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
? in SPARK_JAVA_OPTS during SparkContext creation ? It should probably be passed in the spark-env.sh because it can differ on each node Guillaume On 13 Mar, 2014, at 5:33 pm, Guillaume Pitel <guillaume.pi...@exensa.com>

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
Also, I think the jetty connector will create a small file or directory in /tmp regardless of the spark.local.dir It's very small, about 10KB Guillaume I'm not 100% sure but I think it goes

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
e the jars and the log output of the executors is placed (default $SPARK_HOME/work/) and it should be cleaned regularly In $SPARK_HOME/logs are found the logs of the workers and master Guillaume Hi, I'm confused about the -Dspark.loc

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Guillaume Pitel
n the disk. Guillaume Hi, I asked a similar question a while ago, didn't get any answers. I'd like to share a 10 gb double arr