Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi, You can use sqlContext.sql("use ") before use dataframe.saveAsTable Hope it could be helpful Cheers Gen On Sun, Feb 21, 2016 at 9:55 AM, Glen wrote: > For dataframe in spark, so the table can be visited by hive. > > -- > Jacky Wang >

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result. In fact, the code has been running for a more one month. Before 1.5.0, the performance is quite the same, So I doubt that it is causd by tungsten. Gen On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz wrote: > Something to check (just in case): > Are you g

dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
tungsten turn on), it takes a about 2 hours to finish the same job. I checked the detail of tasks, almost all the time is consumed by computation. Any idea about why this happens? Thanks a lot in advance for your help. Cheers Gen

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Gen On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath wrote: > While it's true locality might speed things up, I'd say it's a very bad > idea to mix your Spark and ES clusters - if your ES cluster is serving > production queries (and in particular using aggregations), you&#

Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error. Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang wrote: > Hi, >

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
record is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang wrote: > Hi, > > I might have a stupid question about sparks

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
can be helpful Cheers Gen On Thu, Aug 6, 2015 at 2:24 AM, praveen S wrote: > I was wondering when one should go for MLib or SparkR. What is the > criteria or what should be considered before choosing either of the > solutions for data analysis? > or What is the advantages of Spa

Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
are trying to use this new python script with old jar. You can clone the newest code of spark from github and build examples jar. Then you can get correct result. Cheers Gen On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless wrote: > I’m having some difficulty getting the desired results from

Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi, Spark UI or logs don't provide the situation of cluster. However, you can use Ganglia to monitor the situation of cluster. In spark-ec2, there is an option to install ganglia automatically. If you use CDH, you can also use Cloudera manager. Cheers Gen On Sat, Aug 8, 2015 at 6:06 AM,

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
, it is not scheduler delay. When computation finishes, UI will show correct scheduler delay time. Cheers Gen On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu wrote: > On Mon, Aug 3, 2015 at 9:00 AM, gen tang wrote: > > Hi, > > > > Recently, I met some problems about schedule

large scheduler delay in pyspark

2015-08-03 Thread gen tang
tring) or Spark on Yarn. But the first code works fine on the same data. Is there any way to find out the log when spark stall in scheduler delay, please? Or any ideas about this problem? Thanks a lot in advance for your help. Cheers Gen

Strange behavoir of pyspark with --jars option

2015-07-14 Thread gen tang
this interesting problem happens? Thanks a lot for your help in advance. Cheers Gen

Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel wrote: > I am attempting to read an hbase table in pyspark with a range scan. >

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
eover, even the program pass, the time of treatment will be very long. Maybe you should try to reduce the set to predict for each client, as in practice, you never need predict the preference of all products to make a recommendation. Hope this will be helpful. Cheers Gen On Wed, Mar 18, 2015 at 12:

Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
quite good. Hope it would be helpful Cheers Gen On Wed, Mar 4, 2015 at 6:51 PM, sandeep vura wrote: > Hi Sparkers, > > How do i integrate hbase on spark !!! > > Appreciate for replies !! > > Regards, > Sandeep.v >

Re: Spark on EC2

2015-02-24 Thread gen tang
familiar with spark. You can do this on your laptop as well as on ec2. In fact, running ./ec2/spark-ec2 means launching spark standalone mode on a cluster, you can find more details here: https://spark.apache.org/docs/latest/spark-standalone.html Cheers Gen On Tue, Feb 24, 2015 at 4:07 PM, Deep

Re: Spark on EC2

2015-02-24 Thread gen tang
, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan wrote: > Hi, > I have just signed up for Amazon AWS because I learnt that it provides > service for free for the first 12 months. > I want to run Spark on EC2 cluste

Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Cheers Gen On Mon, Feb 16, 2015 at 12:39 AM, pankaj channe wrote: > Hi, > > I am new to spark and planning on writing a machine learning application > with Spark mllib. My dataset is in json format. Is it possible to load data > into spark without using any external json libraries?

Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi, You can use -a or --ami to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh wrote: > Hi there, > > Is there a way to specify the AWS AMI with particula

Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi, Please take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html Cheers Gen On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu wrote: > Hi I am very new both in spark and aws stuff.. > Say, I want to install pandas on ec2.. (pip install pandas) > How do

Re: no space left at worker node

2015-02-08 Thread gen tang
problem and find out the specific reason. Cheers Gen On Sun, Feb 8, 2015 at 10:45 PM, ey-chih chow wrote: > Thanks Gen. How can I check if /dev/sdc is well mounted or not? In > general, the problem shows up when I submit the second or third job. The > first job I submit most li

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen

Re: no space left at worker node

2015-02-08 Thread gen tang
. Cheers Gen On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow wrote: > Hi, > > I submitted a spark job to an ec2 cluster, using spark-submit. At a worker > node, there is an exception of 'no space left on device' as follows. > >

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu

Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
heers Gen On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Install virtual box which run Linux? That does not help us. We have > business reason to run it on Windows operating system, e.g. Windows 2008 R2. > > > > If anybod

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I deployed spark-1.1.0 on Windows 7 and

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
nks a lot. Cheers Gen

Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
le more,, the script will finish the launch of cluster. Cheers Gen On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy wrote: > Originally posted here: > http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script > > I'm trying to la

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
helps Cheers Gen On Fri, Jan 9, 2015 at 10:12 AM, Xuelin Cao wrote: > > Thanks, but, how to increase the tasks per core? > > For example, if the application claims 10 cores, is it possible to launch > 100 tasks concurrently? > > > > On Fri, Jan 9, 2015 at 2:57 PM

Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin wrote: > Depending on your use cases. If the use case is to extract small amount o

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi, I am sorry to bother you, but I couldn't find any information about online test of spark certification managed through Kryterion. Could you please give me the link about it? Thanks a lot in advance. Cheers Gen On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan wrote: > Hi Saurabh, >

Spark on teradata?

2015-01-07 Thread gen tang
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen

Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
. Fork https://github.com/mesos/spark-ec2 and make a change in ./spark/init.sh (add wget ) 3. Change line 638 in ec2 launch script: git clone Hope this can be helpful. Cheers Gen On Tue, Jan 6, 2015 at 11:51 PM, Ganon Pierce wrote: > Is there a way to use the ec2 launch script with a loca

Re: Why so many tasks?

2014-12-16 Thread Gen
partitions). Cheers Gen bethesda wrote > Our job is creating what appears to be an inordinate number of very small > tasks, which blow out our os inode and file limits. Rather than > continually upping those limits, we are seeking to understand whether our > real problem is that to

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread Gen
Hi,How many clients and how many products do you have?CheersGen jaykatukuri wrote > Hi all,I am running into an out of memory error while running ALS using > MLLIB on a reasonably small data set consisting of around 6 Million > ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap

Re: RDD.aggregate?

2014-12-12 Thread Gen
result is ([0,1, ..., 99], 4950, 100) Hope that it could help you. Cheers Gen ll wrote > can someone please explain how RDD.aggregate works? i looked at the > average example done with aggregate() but i'm still confused about this > function... much appreciated. -- View t

Cannot pickle DecisionTreeModel in the pyspark

2014-12-12 Thread Gen
model in pyspark that we cannot pickle. FYI: I use spark 1.1.1 Do you have any idea to solve this problem?(I dont know whether using scala can solve this problem or not.) Thanks a lot in advance for your help. Cheers Gen -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Does filter on an RDD scan every data item ?

2014-12-02 Thread Gen
Hi, For your first question, I think that we can use /sc.parallelize(rdd.take(1000))/ For your second question, I am not sure. But I don't think that we can restricted filter within certain partition without scan every element. Cheers Gen nsareen wrote > Hi , > > I wanted som

Re: driver memory

2014-11-21 Thread Gen
Hi, I am sorry for distributing you and thank you for your explication. However, I find "spark.driver.memory" is used also for standalone.(I set this in spark/conf/spark-defaults.conf). Cheers Gen Andrew Or-2 wrote > Hi Maria, > > SPARK_MEM is actually a deprecated because

Re: --executor-cores cannot change vcores in yarn?

2014-11-03 Thread Gen
ores. And I used top command to monitor the cpu utilization during the spark task. The spark can use all cpu even I leave --executor-cores as default(1). Hope that it can be a help. Cheers Gen Gen wrote > Hi, > > Maybe it is a stupid question, but I am running spark on yarn. I request &g

--executor-cores cannot change vcores in yarn?

2014-11-01 Thread Gen
-status ID / to monitor the situation of cluster. It shows that the number of Vcores used for each container is always 1 no matter what number I pass by --executor-cores. Any ideas how to solve this problem? Thanks a lot in advance for your help. Cheers Gen -- View this message in context: http

Re: Executor and BlockManager memory size

2014-10-31 Thread Gen
.compute.internal:38770 with 1294.1 MB RAM* So, according to the documentation, just 2156.83m is allocated to executor. Moreover, according to yarn 3072m memory is used for this container. Do you have any ideas about this? Thanks a lot Cheers Gen Boromir Widas wrote > Hey Larry, > &g

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Gen
https://issues.apache.org/jira/browse/SPARK-2652?filter=-2 <https://issues.apache.org/jira/browse/SPARK-2652?filter=-2> . Cheers. Gen sid wrote > Hi , I am new to spark and I am trying to use pyspark. > > I am trying to find mean of 128 dimension vectors present in a file . &

Re: ALS implicit error pyspark

2014-10-20 Thread Gen
go to there for more information or make a contribution to fix this problem. Cheers Gen Gen wrote > Hi, > > I am trying to use ALS.trainImplicit method in the > pyspark.mllib.recommendation. However it didn't work. So I tried use the > example in the python API docume

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Gen
Hi, I will write the code in python {code:title=test.py} data = sc.textFile(...).map(...) ## Please make sure that the rdd is like[[id, c1, c2, c3], [id, c1, c2, c3],...] keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) keypair = keypair.reduceByKey(add) out = keypair.map(lambda l: l

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, I created an issue in JIRA. https://issues.apache.org/jira/browse/SPARK-3990 <https://issues.apache.org/jira/browse/SPARK-3990> I uploaded the error information in JIRA. Thanks in advance for your help. Best Gen Davies Liu-2 wrote > It seems a bug, Could you create a JIRA for i

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
nt at ALS.scala:314 . I will take a look at the log and try to find the problem. Best Gen Davies Liu-2 wrote > I can run the following code against Spark 1.1 > > sc = SparkContext() > r1 = (1, 1, 1.0) > r2 = (1, 2, 2.0) > r3 = (2, 1, 2.0) > ratings = sc.parallelize([r1, r2, r3]) &

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
, for example, ALS.trainImplicit(ratings, rank, 10) and it didn't work. After several test, I found only iterations = 1 works for pyspark. But for scala, all the value works. Best Gen Davies Liu-2 wrote > On Thu, Oct 16, 2014 at 9:53 AM, Gen < > gen.tang86@ > > wrot

Re: ALS implicit error pyspark

2014-10-16 Thread Gen
; intentionally) 14/10/16 19:22:44 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 975.0, whose tasks have all completed, from pool Gen wrote > Hi, > > I am trying to use ALS.trainImplicit method in the > pyspark.mllib.recommendation. However it didn't work. So I tried use the > e

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-16 Thread Gen
Hi, You just need add list() in the sorted function. For example, map((lambda (x,y): (x, (list(y[0]), list(y[1], sorted(list(rdd1.cogroup(rdd2).collect( I think you just forget the list... PS: your post has NOT been accepted by the mailing list yet. Best Gen pm wrote >

ALS implicit error pyspark

2014-10-16 Thread Gen
Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: /r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-15 Thread Gen
What results do you want? If your pair is like (a, b), where "a" is the key and "b" is the value, you can try rdd1 = rdd1.flatMap(lambda l: l) and then use cogroup. Best Gen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp co

Re: SparkSQL: select syntax

2014-10-14 Thread Gen
l cause doublecolumn error. Cheers Gen Hao Ren wrote > Update: > > This syntax is mainly for avoiding retyping column names. > > Let's take the example in my previous post, where * > a * > is a table of 15 columns, * > b * > has 5 columns, after a join, I have

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, Are you sure that the id/key that you used can access to s3? You can try to use the same id/key through python boto package to test it. Because, I have almost the same situation as yours, but I can access to s3. Best -- View this message in context: http://apache-spark-user-list.1001560.

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while callin

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as 'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)' The error information is as follow: py4j.protocol.Py4JJavaEr

Re: partitions number with variable number of cores

2014-10-03 Thread Gen
Maybe I am wrong, but how many resource that a spark application can use depends on the mode of deployment(the type of resource manager), you can take a look at https://spark.apache.org/docs/latest/job-scheduling.html . For your case, I

Re: pyspark on python 3

2014-10-03 Thread Gen
According to the official site of spark, for the latest version of spark(1.1.0), it does not work with python 3 Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, so C libraries like NumPy can be used. -- View this message in context: htt

Re: Spark Monitoring with Ganglia

2014-10-03 Thread TANG Gen
Maybe you can follow the instruction in this link https://github.com/mesos/spark-ec2/tree/v3/ganglia . For me it works well -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-t

Re: The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
I have taken a look at the code of mesos spark-ec2 and documentation of AWS. I think that maybe I found the answer. In fact, there are two types AMI in AWS EBS backed AMI and instance store backed AMI. For EBS backed AMI, we can add instance store volume when we create the images(The details can

The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
disks are only mounted if the instance begins with r3. For other instance types, are their ephemeral disk mounted or not? If yes, which script mounts them or they are mounted automatically by AWS? Thanks a lot for your help in advance. Best regards Gen -- View this message in context: http