Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote: In spark-1.0.2, I have come across an error when I

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread Davies Liu
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230 On Wed, Nov 12, 2014 at 7:20 AM, rprabhu rpra...@ufl.edu wrote: Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Davies Liu
rdd1.union(rdd2).groupByKey() On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5,

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-13 Thread Davies Liu
worker nodes with a total of about 80 cores. Thanks again for the tips! On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden

Re: pyspark and hdfs file name

2014-11-13 Thread Davies Liu
One option maybe call HDFS tools or client to rename them after saveAsXXXFile(). On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am running pyspark job. I need serialize final result to hdfs in binary files and having ability to give a name for output

Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread Davies Liu
You could use the following as compressionCodecClass: DEFLATE org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.BZip2Codec LZO com.hadoop.compression.lzo.LzopCodec for gzip,

Re: Pyspark Error

2014-11-18 Thread Davies Liu
It seems that `localhost` can not be resolved in your machines, I had filed https://issues.apache.org/jira/browse/SPARK-4475 to track it. On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi aminn_...@yahoo.com.invalid wrote: Hi there, I have already downloaded Pre-built spark-1.1.0, I want to run

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Could you point some link about

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
I see, thanks! On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote: On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote: On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run

Re: Python Scientific Libraries in Spark

2014-11-24 Thread Davies Liu
These libraries could be used in PySpark easily. For example, MLlib uses Numpy heavily, it can accept np.array or sparse matrix in SciPy as vectors. On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: Since spark exposes python bindings and allows you to

Re: numpy arrays and spark sql

2014-12-01 Thread Davies Liu
applySchema() only accept RDD of Row/list/tuple, it does not work with numpy.array. After applySchema(), the Python RDD will be pickled and unpickled in JVM, so you will not have any benefit by using numpy.array. It will work if you convert ndarray into list: schemaRDD =

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread Davies Liu
inferSchema() will work better than jsonRDD() in your case, from pyspark.sql import Row srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x))) srdd.first() Row( field1=5, field2='string', field3={'a'=1, 'c'=2}) On Wed, Dec 3, 2014 at 12:11 AM, sahanbull sa...@skimlinks.com wrote: Hi

Re: cannot submit python files on EC2 cluster

2014-12-03 Thread Davies Liu
On Wed, Dec 3, 2014 at 8:17 PM, chocjy jiyanyan...@gmail.com wrote: Hi, I am using spark with version number 1.1.0 on an EC2 cluster. After I submitted the job, it returned an error saying that a python module cannot be loaded due to missing files. I am using the same command that used to

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread Davies Liu
Which version of Spark are you using? inferSchema() is improved to support empty dict in 1.2+, could you try the 1.2-RC1? Also, you can use applySchema(): from pyspark.sql import * fields = [StructField('field1', IntegerType(), True), StructField('field2', StringType(), True),

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread Davies Liu
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to convert the Row object into dict. On Mon, Dec 8, 2014 at 6:38 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file.

Re: PySprak and UnsupportedOperationException

2014-12-09 Thread Davies Liu
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: While trying simple examples of PySpark code, I systematically get these failures when I try this.. I dont see any prior exceptions in the output... How can I debug further to find root cause? es_rdd =

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Davies Liu
Thinking about that any task could be launched concurrently in different nodes, so in order to make sure the generated files are valid, you need some atomic operation (such as rename) to do it. For example, you could generate a random name for output file, writing the data into it, rename it to

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? thanks! On Tue, Dec 16, 2014 at 5:49 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int})

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
I had created https://issues.apache.org/jira/browse/SPARK-4866, it will be fixed by https://github.com/apache/spark/pull/3714. Thank you for reporting this. Davies On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu dav...@databricks.com wrote: It's a bug, could you file a JIRA for this? thanks

Re: spark streaming python + kafka

2014-12-22 Thread Davies Liu
There is a WIP pull request[1] working on this, it should be merged into master soon. [1] https://github.com/apache/spark/pull/3715 On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I've just seen that streaming spark supports python from 1.2 version.

Re: python: module pyspark.daemon not found

2014-12-30 Thread Davies Liu
Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It

Re: Python:Streaming Question

2014-12-30 Thread Davies Liu
There is a known bug with local scheduler, will be fixed by https://github.com/apache/spark/pull/3779 On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I’m trying to run the stateful network word count at

Re: spark crashes on second or third call first() on file

2015-01-15 Thread Davies Liu
What's the version of Spark you are using? On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw linda.terl...@icris.nl wrote: I'm new to Spark. When I use the Movie Lens dataset 100k (http://grouplens.org/datasets/movielens/), Spark crashes when I run the following code. The first call to

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
We have not meet this issue, so not sure there are bugs related to reused worker or not. Could provide more details about it? On Wed, Jan 21, 2015 at 2:27 AM, critikaled isasmani@gmail.com wrote: I'm also facing the same issue. is this a bug? -- View this message in context:

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Davies Liu
If the dataset is not huge (in a few GB), you can setup NFS instead of HDFS (which is much harder to setup): 1. export a directory in master (or anyone in the cluster) 2. mount it in the same position across all slaves 3. read/write from it by file:///path/to/monitpoint On Tue, Jan 20, 2015 at

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Davies Liu
among all the tasks within the same executor. 2015-01-21 15:04 GMT+08:00 Davies Liu dav...@databricks.com: Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
for it, thanks! On Wed, Jan 21, 2015 at 4:56 PM, Tassilo Klein tjkl...@gmail.com wrote: I set spark.python.worker.reuse = false and now it seems to run longer than before (it has not crashed yet). However, it is very very slow. How to proceed? On Wed, Jan 21, 2015 at 2:21 AM, Davies Liu dav

Re: Scala vs Python performance differences

2015-01-16 Thread Davies Liu
Hey Phil, Thank you sharing this. The result didn't surprise me a lot, it's normal to do the prototype in Python, once it get stable and you really need the performance, then rewrite part of it in C or whole of it in another language does make sense, it will not cause you much time. Davies On

Re: Processing .wav files in PySpark

2015-01-16 Thread Davies Liu
I think you can not use textFile() or binaryFile() or pickleFile() here, it's different format than wav. You could get a list of paths for all the files, then sc.parallelize(), and foreach(): def process(path): # use subprocess to launch a process to do the job, read the stdout as result

Re: Using third party libraries in pyspark

2015-01-22 Thread Davies Liu
You need to install these libraries on all the slaves, or submit via spark-submit: spark-submit --py-files xxx On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries.

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1.

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
. The theano function itself is a broadcast variable. Let me know if you need more information. Best, Tassilo On Wed, Jan 21, 2015 at 1:17 AM, Davies Liu dav...@databricks.com wrote: Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Davies Liu
Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO raofeng...@gmail.com wrote: Currently we are migrating from spark 1.1 to spark

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote: I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it iteratively -- but how can I modify an RDD iteratively? I'm trying something like : rdd =

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
are you comparing? PySpark will try to combine the multiple map() together, then you will get a task which need all the lookup_tables (the same size as before). You could add a checkpoint after some of the iterations. On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
. On Mon, Feb 16, 2015 at 4:04 PM, Davies Liu dav...@databricks.com wrote: It seems that the jar for cassandra is not loaded, you should have them in the classpath. On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello all, Trying the example code from

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
It seems that the jar for cassandra is not loaded, you should have them in the classpath. On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello all, Trying the example code from this package (https://github.com/Parsely/pyspark-cassandra) , I always get

Re: stack map functions in a loop (pyspark)

2015-02-19 Thread Davies Liu
On Thu, Feb 19, 2015 at 7:57 AM, jamborta jambo...@gmail.com wrote: Hi all, I think I have run into an issue on the lazy evaluation of variables in pyspark, I have to following functions = [func1, func2, func3] for counter in range(len(functions)): data = data.map(lambda value:

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Davies Liu
For the last question, you can trigger GC in JVM from Python by : sc._jvm.System.gc() On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: thanks, that looks promissing but can't find any reference giving me more details - can you please point me to something? Also

Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu
? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid iras...@cloudera.com wrote: I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
a lot, Mohamed. On Mon, Feb 16, 2015 at 5:46 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Oh, I don't know. thanks a lot Davies, gonna figure that out now On Mon, Feb 16, 2015 at 5:31 PM, Davies Liu dav...@databricks.com wrote: It also need the Cassandra jar

Re: Spark can't pickle class: error cannot lookup attribute

2015-02-18 Thread Davies Liu
Currently, PySpark can not support pickle a class object in current script ( '__main__'), the workaround could be put the implementation of the class into a separate module, then use bin/spark-submit --py-files xxx.py in deploy it. in xxx.py: class test(object): def __init__(self, a, b):

Re: [documentation] Update the python example ALS of the site?

2015-01-27 Thread Davies Liu
will be fixed by https://github.com/apache/spark/pull/4226 On Tue, Jan 27, 2015 at 8:17 AM, gen tang gen.tan...@gmail.com wrote: Hi, In the spark 1.2.0, it requires the ratings should be a RDD of Rating or tuple or list. However, the current example in the site use still RDD[array] as the

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread Davies Liu
Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote: I've got an dataset saved with saveAsPickleFile using pyspark -- it saves without problems. When I try to read it back

Re: Define size partitions

2015-01-30 Thread Davies Liu
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case. [1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords Davies On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process some

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Davies Liu
It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
, Rok On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu dav...@databricks.com wrote: Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote: I've got an dataset saved

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed to come up with a repro of the error with test data only

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Davies Liu
In the current implementation of TorrentBroadcast, the blocks are fetched one-by-one in single thread, so it can not fully utilize the network bandwidth. Davies On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang yangjun...@gmail.com wrote: Guys, I have a question regarding to Spark 1.1 broadcast

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
at 12:46 AM, Davies Liu dav...@databricks.com wrote: I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Davies Liu
Spark is an framework to do things in parallel very easy, it definitely will help your cases. def read_file(path): lines = open(path).readlines() # bzip2 return lines filesRDD = sc.parallelize(path_to_files, N) lines = filesRDD.flatMap(read_file) Then you could do other transforms on

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-13 Thread Davies Liu
large -- I've now split it up into many smaller operations but it's still not quite there -- see http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html Thanks, Rok On Wed, Feb 11, 2015, 19:59 Davies Liu dav...@databricks.com wrote: Could you share

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
-- but the dictionary is large, it's 8 Gb pickled on disk. On Feb 10, 2015, at 10:01 PM, Davies Liu dav...@databricks.com wrote: Could you paste the NPE stack trace here? It will better to create a JIRA for it, thanks! On Tue, Feb 10, 2015 at 10:42 AM, rok rokros...@gmail.com wrote: I'm trying

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Davies Liu
:29 PM, Davies Liu dav...@databricks.com wrote: I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also

Re: spark there is no space on the disk

2015-03-19 Thread Davies Liu
Is it possible that `spark.local.dir` is overriden by others? The docs say: NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for

Re: Error when using multiple python files spark-submit

2015-03-19 Thread Davies Liu
the options of spark-submit should come before main.py, or they will become the options of main.py, so it should be: ../hadoop/spark-install/bin/spark-submit --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 main.py

Re: Spark 1.3 createDataframe error with pandas df

2015-03-19 Thread Davies Liu
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl kevin.d...@gmail.com wrote: kevindahl wrote I'm trying to create a spark data frame from a pandas data frame, but for even the most trivial of datasets I get an error along the lines of this:

Re: Spark-submit and multiple files

2015-03-19 Thread Davies Liu
You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote: Hello guys, I am having a hard time to understand how spark-submit behave with multiple files. I

Re: Spark 1.2. loses often all executors

2015-03-20 Thread Davies Liu
Maybe this is related to a bug in 1.2 [1], it's fixed in 1.2.2 (not released), could checkout the 1.2 branch and verify that? [1] https://issues.apache.org/jira/browse/SPARK-5788 On Fri, Mar 20, 2015 at 3:21 AM, mrm ma...@skimlinks.com wrote: Hi, I recently changed from Spark 1.1. to Spark

Re: Spark-submit and multiple files

2015-03-20 Thread Davies Liu
the error log. On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com wrote: You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote: Hello guys

Re: Python script runs fine in local mode, errors in other modes

2015-03-08 Thread Davies Liu
?filter=-1 On Tue, Aug 19, 2014 at 12:12 PM, Davies Liu dav...@databricks.com wrote: This script run very well without your CSV file. Could download you CSV file into local disks, and narrow down to the lines which triggle this issue? On Tue, Aug 19, 2014 at 12:02 PM, Aaron aaron.doss

Re: How to consider HTML files in Spark

2015-03-12 Thread Davies Liu
sc.wholeTextFile() is what you need. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles On Thu, Mar 12, 2015 at 9:26 AM, yh18190 yh18...@gmail.com wrote: Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + Beautifulsoup to

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
@gmail.com wrote: It is a simple text file. I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it? On Friday, February 27, 2015, Davies Liu dav...@databricks.com wrote: What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will be fixed very soon. Davies On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy guillaume.c@gmail.com wrote:

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote: Thanks for the

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Davies Liu
Another way to see the Python docs: $ export PYTHONPATH=$SPARK_HOME/python $ pydoc pyspark.sql On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin r...@databricks.com wrote: The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by

Re: Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Davies Liu
This will be fixed in https://github.com/apache/spark/pull/5230/files On Fri, Mar 27, 2015 at 9:13 AM, Peter Mac peter.machar...@noaa.gov wrote: I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
(spark.kryoserializer.buffer.mb,512)) sc = SparkContext(conf=conf ) sqlContext = SQLContext(sc) On Thu, Mar 26, 2015 at 2:29 PM, Davies Liu dav...@databricks.com wrote: Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa eduardo.c

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Davies Liu
On Thu, Jan 29, 2015 at 6:36 PM, QiuxuanZhu ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def getRow(data): return data.msg first_sql = select * from logs.event where dt = '20150120' and et = 'ppc' LIMIT 10#error

Re: PySpark: slicing issue with dataframes

2015-05-17 Thread Davies Liu
Yes, it's a bug, please file a JIRA. On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote: Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa

Re: how to set random seed

2015-05-17 Thread Davies Liu
The python workers used for each stage may be different, this may not work as expected. You can create a Random object, set the seed, use it to do the shuffle(). r = random.Random() r.seek(my_seed) def f(x): r.shuffle(l) rdd.map(f) On Thu, May 14, 2015 at 6:21 AM, Charles Hayden

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread Davies Liu
SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18,

Re: pass configuration parameters to PySpark job

2015-05-18 Thread Davies Liu
In PySpark, it serializes the functions/closures together with used global values. For example, global_param = 111 def my_map(x): return x + global_param rdd.map(my_map) - Davies On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am looking a way to

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Davies Liu
I will look into this today. On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan parme...@spaziodati.eu wrote: Did anybody by any chance had a look at this bug? It keeps on happening to me, and it's quite blocking, I would like to understand if there's something wrong in what I'm doing, or

Re: make two rdd co-partitioned in python

2015-04-09 Thread Davies Liu
In Spark 1.3+, PySpark also support this kind of narrow dependencies, for example, N = 10 a1 = a.partitionBy(N) b1 = b.partitionBy(N) then a1.union(b1) will only have N partitions. So, a1.join(b1) do not need shuffle anymore. On Thu, Apr 9, 2015 at 11:57 AM, pop xia...@adobe.com wrote: In

Re: Is this a good use case for Spark?

2015-05-20 Thread Davies Liu
Spark is a great framework to do things in parallel with multiple machines, will be really helpful for your case. Once you can wrap your entire pipeline into a single Python function: def process_document(path, text): # you can call other tools or services here return xxx then you can

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread Davies Liu
AM, Davies Liu dav...@databricks.com wrote: SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show

Re: Does Python 2.7 have to be installed on every cluster node?

2015-05-19 Thread Davies Liu
PySpark work with CPython by default, and you can specify which version of Python to use by: PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py When you do the upgrade, you could install python 2.7 on every machine in the cluster, test it with PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Davies Liu
It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl wrote: Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread Davies Liu
The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning Warning (from warnings module): File

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Davies Liu
W dniu 19.05.2015 o 23:56, Davies Liu pisze: It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
crashes. On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu dav...@databricks.com wrote: Could you run the single thread version in worker machine to make sure that OpenCV is installed and configured correctly? On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote: I've verified

Re: SparkSQL nested dictionaries

2015-06-08 Thread Davies Liu
I think it works in Python ``` df = sqlContext.createDataFrame([(1, {'a': 1})]) df.printSchema() root |-- _1: long (nullable = true) |-- _2: map (nullable = true) ||-- key: string ||-- value: long (valueContainsNull = true) df.select(df._2.getField('a')).show() +-+ |_2[a]|

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
, 2015 at 2:43 PM, Davies Liu dav...@databricks.com wrote: Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote: I've

Re: SQL vs. DataFrame API

2015-06-22 Thread Davies Liu
Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame (for example, two `value`). A workaround could be: numbers2 = numbers.select(df.name, df.value.alias('other')) rows = numbers.join(numbers2,

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies Liu dav...@databricks.com wrote: Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame (for example, two `value`). A workaround could

Re: Java Constructor Issues

2015-06-21 Thread Davies Liu
The compiled jar is not consistent with Python source, maybe you are using a older version pyspark, but with assembly jar of Spark Core 1.4? On Sun, Jun 21, 2015 at 7:24 AM, Shaanan Cohney shaan...@gmail.com wrote: Hi all, I'm having an issue running some code that works on a build of spark

Re: Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Davies Liu
This seems be a bug, could you file a JIRA for it? RDD should be serializable for Streaming job. On Thu, Jun 18, 2015 at 4:25 AM, Groupme grou...@gmail.com wrote: Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream

Re: SparkR - issue when starting the sparkR shell

2015-06-19 Thread Davies Liu
Yes, right now, we only tested SparkR with R 3.x On Fri, Jun 19, 2015 at 5:53 AM, Kulkarni, Vikram vikram.kulka...@hp.com wrote: Hello, I am seeing this issue when starting the sparkR shell. Please note that I have R version 2.14.1. [root@vertica4 bin]# sparkR R version 2.14.1

Re: Cassandra - Spark 1.3 - reading data from cassandra table with PYSpark

2015-06-19 Thread Davies Liu
On Fri, Jun 19, 2015 at 7:33 AM, Koen Vantomme koen.vanto...@gmail.com wrote: Hello, I'm trying to read data from a table stored in cassandra with pyspark. I found the scala code to loop through the table : cassandra_rdd.toArray.foreach(println) How can this be translated into PySpark ?

Re: ERROR in withColumn method

2015-06-19 Thread Davies Liu
This is an known issue: https://issues.apache.org/jira/browse/SPARK-8461?filter=-1 Will be fixed soon by https://github.com/apache/spark/pull/6898 On Fri, Jun 19, 2015 at 5:50 AM, Animesh Baranawal animeshbarana...@gmail.com wrote: I am trying to perform some insert column operations in

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
/675968d2e4be68958df8 2015-06-23 23:11 GMT+02:00 Davies Liu dav...@databricks.com: I think it also happens in DataFrames API of all languages. On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco elnopin...@gmail.com wrote: That issue happens only in python dsl? El 23/6/2015 5:05 p. m., Bob Corsaro

Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote: Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the

<    1   2   3   4   >