RowMatrix PCA out of heap space error

2014-10-13 Thread Yang
I got this error when trying to perform PCA on a sparse matrix, each row has a nominal length of 8000, and there are 36k rows. each row has on average 3 elements being non-zero. I guess the total size is not that big. Exception in thread main java.lang.OutOfMemoryError: Java heap space at

buffer overflow when running Kmeans

2014-10-21 Thread Yang
this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98

version mismatch issue with spark breeze vector

2014-10-22 Thread Yang
artifactIdscala-library/artifactId version2.10.4/version /dependency Thanks a lot Yang

how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Yang
during tests, I often modify my code a little bit and want to see the result. but spark-submit requires the full fat-jar, which takes quite a lot of time to build. I just need to run in --master local mode. is there a way to run it without rebuilding the fat jar? thanks Yang

how to start reading the spark source code?

2015-07-19 Thread Yang
. thanks! yang

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
: %s.format(numAs, numBs)) } } then I debug through this and it became fairly clear On Sun, Jul 19, 2015 at 10:13 PM, Yang tedd...@gmail.com wrote: thanks, my point is that earlier versions are normally much simpler so it's easier to follow. and the basic structure should at least bare great

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
(Task[] ) through serialization. On Mon, Jul 20, 2015 at 12:38 AM, Yang tedd...@gmail.com wrote: ok got some headstart: pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first release so that I could at least build it) then build it according to README.md, then get

Re: how to start reading the spark source code?

2015-07-19 Thread Yang
why you started with such an early commit. Spark project has evolved quite fast. I suggest you clone Spark project from github.com/apache/spark/ and start with core/src/main/scala/org/apache/spark/rdd/RDD.scala Cheers On Sun, Jul 19, 2015 at 7:44 PM, Yang tedd...@gmail.com wrote: I'm

how do I set TBLPROPERTIES in dataFrame.saveAsTable()?

2016-06-15 Thread Yang
I tried df.options(MAP(prop_name->prop_value)).saveAsTable(tb_name) doesn't seem to work thanks a lot!

type-safe join in the new DataSet API?

2016-11-10 Thread Yang
the new DataSet API is supposed to provide type safety and type checks at compile time https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations It does this indeed for a lot of places, but I found it still doesn't have a type safe join: val ds1 =

Re: question about the new Dataset API

2016-10-19 Thread Yang
2| +-++ On Tue, Oct 18, 2016 at 11:30 PM, Yang <tedd...@gmail.com> wrote: > scala> val a = sc.parallelize(Array((1,2),(3,4))) > a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at > parallelize at :38 > > scala> val a_ds = hc.di.createDa

question about the new Dataset API

2016-10-19 Thread Yang
scala> val a = sc.parallelize(Array((1,2),(3,4))) a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at parallelize at :38 scala> val a_ds = hc.di.createDataFrame(a).as[(Long,Long)] a_ds: org.apache.spark.sql.Dataset[(Long, Long)] = [_1: int, _2: int] scala>

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-19 Thread Yang
>> > per-iteration time). >> >> > >> >> > Note that the current impl forces dense arrays for intermediate data >> >> > structures, increasing the communication cost significantly. See this >> PR for >> >> > info: https://githu

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
ame wrapping your RDD, and $"id" % 10 with the key > to group by, then you can get the RDD from shuffled and do the following > operations you want. > > Cheng > > > > On 10/20/16 10:53 AM, Yang wrote: > >> in my application, I group by same training sa

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
uld avoid it for large groups. > > The key is to never materialize the grouped and shuffled data. > > To see one approach to do this take a look at > https://github.com/tresata/spark-sorted > > It's basically a combination of smart partitioning and secondary sort. &g

previous stage results are not saved?

2016-10-17 Thread Yang
while making small changes to the code. any idea what part of the spark framework might have caused this ? thanks Yang

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang
in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD

task not serializable in case of groupByKey() + mapGroups + map?

2016-10-31 Thread Yang
with the following simple code val a = sc.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{

question on the structured DataSet API join

2016-10-17 Thread Yang
I'm trying to use the joinWith() method instead of join() since the former provides type checked result while the latter is a straight DataFrame. the signature is DataSet[(T,U)] joinWith(other:DataSet[U], col:Column) here the second arg, col:Column is normally provided by

L1 regularized Logistic regression ?

2017-01-04 Thread Yang
does mllib support this? I do see Lasso impl here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala if it supports LR , could you please show me a link? what algorithm does it use? thanks

Re: L1 regularized Logistic regression ?

2017-01-04 Thread Yang
regression.html#logistic-regression > > You'd set elasticnetparam = 1 for Lasso > > On Wed, Jan 4, 2017 at 7:13 PM, Yang <tedd...@gmail.com> wrote: > >> does mllib support this? >> >> I do see Lasso impl here https://github.com/apache >> /spark/blob/maste

spark-shell fails to redefine values

2016-12-21 Thread Yang
summary: Spark-shell fails to redefine values in some cases, this is at least found in a case where "implicit" is involved, but not limited to such cases run the following in spark-shell, u can see that the last redefinition does not take effect. the same code runs in plain scala REPL without

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
. This way when I encode the wrapper, the bean encoder simply encodes the getContent() output, I think. encoding a list of tuples is very fast. Yang On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust <mich...@databricks.com> wrote: > I think you are supposed to set BeanProperty on a var a

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
> <https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang <tedd...@gmail.com> wrote: > >> I'm trying to use Encoders.bean() t

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
2.0.2 with scala 2.11 On Tue, May 9, 2017 at 11:30 AM, Michael Armbrust <mich...@databricks.com> wrote: > Which version of Spark? > > On Tue, May 9, 2017 at 11:28 AM, Yang <tedd...@gmail.com> wrote: > >> actually with var it's the same:

how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
I'm trying to use Encoders.bean() to create an encoder for my custom class, but it fails complaining about can't find the schema: class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {} @scala. beans.BeanProperty def getX():Int = {1} } val personEncoder = Encoders.bean[

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
s.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/908554720841389/2840265927289860/latest.html> > in > Spark 2.1. > > On Tue, May 9, 2017 at 12:10 PM, Yang <tedd...@gmail.com> wrote: > >> somehow the schema check is here >> >> https://g

Re: Master registers itself at startup?

2014-04-13 Thread YouPeng Yang
Hi The 512MB is the default memory size which each executor needs. and actually, your job does not need as much as the default memory size. you can create a SparkContext with sc = new SparkContext(local-cluster[2,1,512], test) // suppose you use the local-cluster model. Here the 512 is the

question about the SocketReceiver

2014-04-20 Thread YouPeng Yang
Hi I am studing the structure of the Spark Streaming(my spark version is 0.9.0). I have a question about the SocketReceiver.In the onStart function: --- protected def onStart() { logInfo(Connecting to + host + : + port) val socket

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread YouPeng Yang
Hi I am also curious about this question. The textFile function was supposed to read a hdfs file? In this case ,It is on local filesystem that the file was taken in.There are any recognization ways to identify the local filesystem and the hdfs in the textFile function? Beside, the OOM

question about the license of akka and Spark

2014-05-20 Thread YouPeng Yang
Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread Zongheng Yang
Hi, Just wondering if you can try this: val obj = sql(select manufacturer, count(*) as examcount from pft group by manufacturer order by examcount desc) obj.collect() obj.queryExecution.executedPlan.executeCollect() and time the third line alone. It could be that Spark SQL taking some time to

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
I may be wrong, but I think RDDs must be created inside a SparkContext. To somehow preserve the order of the list, perhaps you could try something like: sc.parallelize((1 to xs.size).zip(xs)) On Fri, Jun 13, 2014 at 6:08 PM, SK skrishna...@gmail.com wrote: Hi, I have a List[ (String, Int,

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
Sorry I wasn't being clear. The idea off the top of my head was that you could append an original position index to each element (using the line above), and modified what ever processing functions you have in mind to make them aware of these indices. And I think you are right that RDD collections

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Zongheng Yang
If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s pretty neat! So I guess if you

Re: SparkR Installation

2014-06-19 Thread Zongheng Yang
Hi Stuti, Yes, you do need to install R on all nodes. Furthermore the rJava library is also required, which can be installed simply using 'install.packages(rJava)' in the R shell. Some more installation instructions after that step can be found in the README here:

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin, I just tried this example (nice data, by the way!), *with each JSON object on one line*, and it worked fine: scala rdd.printSchema() root |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef ||-- friends:

Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming, For your spark-submit question: can you try using an assembly jar (sbt/sbt assembly will build it for you)? Another thing to check is if there is any package structure that contains your SimpleApp; if so you should include the hierarchal name. Zongheng On Thu, Jul 10, 2014 at 11:33

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. I think you are seeing some weirdness with partitioned tables that

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath nick.pentre...@gmail.com wrote: You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote: Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
- user@incubator Hi Keith, I did reproduce this using local-cluster[2,2,1024], and the errors look almost the same. Just wondering, despite the errors did your program output any result for the join? On my machine, I could see the correct output. Zongheng On Tue, Jul 15, 2014 at 1:46 PM,

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
Hi Keith gorenuru, This patch (https://github.com/apache/spark/pull/1423) solves the errors for me in my local tests. If possible, can you guys test this out to see if it solves your test programs? Thanks, Zongheng On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang zonghen...@gmail.com wrote

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Zongheng Yang
One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here: http://spark.apache.org/docs/latest/configuration.html On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I used to use

error from DecisonTree Training:

2014-07-18 Thread Jack Yang
Hi All, I got an error while using DecisionTreeModel (my program is written in Java, spark 1.0.1, scala 2.10.1). I have read a local file, loaded it as RDD, and then sent to decisionTree for training. See below for details: JavaRDDLabeledPoint Points = lines.map(new ParsePoint()).cache();

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang
So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52 PM To: user@spark.apache.org Subject: error from DecisonTree Training: Hi All, I got an error while using DecisionTreeModel (my program is written in Java, spark 1.0.1, scala

RE: error from DecisonTree Training:

2014-07-21 Thread Jack Yang
is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote: So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52 PM To: user@spark.apache.org Subject: error from DecisonTree Training: Hi All

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Do you mean that the texts of the SQL queries being hardcoded in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
, Siyuan On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang zonghen...@gmail.com wrote: Do you mean that the texts of the SQL queries being hardcoded in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys

Re: SparkSQL can not use SchemaRDD from Hive

2014-07-29 Thread Zongheng Yang
As Hao already mentioned, using 'hive' (the HiveContext) throughout would work. On Monday, July 28, 2014, Cheng, Hao hao.ch...@intel.com wrote: In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Zongheng Yang
To add to this: for this many (= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: William, The error you are seeing is misleading. There is no need to terminate the cluster and start over. Just re-run your

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could try something like: r.select('keyword, countDistinct('userId)).groupBy('keyword) On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote: I'm looking to write a select statement

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
, Buntu Dev buntu...@gmail.com wrote: Thanks Zongheng for the pointer. Is there a way to achieve the same in 1.0.0 ? On Thu, Jul 31, 2014 at 1:43 PM, Zongheng Yang zonghen...@gmail.com wrote: countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could

Re: Visualizing stage task dependency graph

2014-08-04 Thread Zongheng Yang
I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark Summit 2014 [2]), but it'd be great (and I imagine somewhat challenging) to visualize the *physical execution* graph of a Spark job. [1] http://pr01.uml.edu/ [2]

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Zongheng Yang
Hi Pranay, If this is data format is to be assumed, then I believe the issue starts at lines - textFile(sc,/sparkdev/datafiles/covariance.txt) totals - lapply(lines, function(lines) After the first line, `lines` becomes an RDD of strings, each of which is a line of the form 1,1.

Spark on Mesos cause mesos-master OOM

2014-08-22 Thread Chengwei Yang
Hi List, We're recently trying to running spark on Mesos, however, we encountered a fatal error that mesos-master process will continuousely consume memory and finally killed by OOM Killer, this situation only happening if has spark job (fine-grained mode) running. We finally root caused the

dealing with large values in kv pairs

2014-11-10 Thread YANG Fan
Hi, I've got a huge list of key-value pairs, where the key is an integer and the value is a long string(around 1Kb). I want to concatenate the strings with the same keys. Initially I did something like: pairs.reduceByKey((a, b) = a+ +b) Then tried to save the result to HDFS. But it was

Questions Regarding to MPI Program Migration to Spark

2014-11-16 Thread Jun Yang
Guys, Recently we are migrating our backend pipeline from to Spark. In our pipeline, we have a MPI-based HAC implementation, to ensure the result consistency of migration, we also want to migrate this MPI-implemented code to Spark. However, during the migration process, I found that there are

Re: k-means clustering

2014-11-20 Thread Jun Yang
Guys, As to the questions of pre-processing, you could just migrate your logic to Spark before using K-means. I only used Scala on Spark, and haven't used Python binding on Spark, but I think the basic steps must be the same. BTW, if your data set is big with huge sparse dimension feature

Re: Book: Data Analysis with SparkR

2014-11-21 Thread Zongheng Yang
Hi Daniel, Thanks for your email! We don't have a book (yet?) specifically on SparkR, but here's a list of helpful tutorials / links you can check out (I am listing them in roughly basic - advanced order): - AMPCamp5 SparkR exercises http://ampcamp.berkeley.edu/5/exercises/sparkr.html. This

spark-ec2 Web UI Problem

2014-12-04 Thread Xingwei Yang
the connection to the port 8080. I could not figure out how to solve it. Any sueggestion is appreciated. Thanks a lot. -- Sincerely Yours Xingwei Yang https://sites.google.com/site/xingweiyang1223/

Unable to run applications on clusters on EC2

2014-12-04 Thread Xingwei Yang
-2.compute.amazonaws.com:7070: akka.remote.EndpointAssociationException: Association failed with [akka.tcp:// sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070] Please let me know if you any any clue about it. Thanks a lot. -- Sincerely Yours Xingwei Yang https://sites.google.com

Transfer from RDD to JavaRDD

2014-12-05 Thread Xingwei Yang
, it shows a error like this: The method fromRDD(RDDT, ClassTagT) in the type JavaRDD is not applicable for the arguments (RDDVector, ClassTagObject) Is there anything wrong with the method? Thanks a lot. -- Sincerely Yours Xingwei Yang https://sites.google.com/site/xingweiyang1223/

How to get driver id?

2014-12-12 Thread Xingwei Yang
Hi Guys: I want to kill an application but I could not find the driver id of the application from web ui. Is there any way to get it from command line? Thanks -- Sincerely Yours Xingwei Yang https://sites.google.com/site/xingweiyang1223/

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread bo yang
suggestion is to build Spark by yourself. Anyway, would like to see your update once you figure out the solution. Best wishes! Bo On Wed, Feb 4, 2015 at 4:47 AM, Corey Nolet cjno...@gmail.com wrote: Bo yang- I am using Spark 1.2.0 and undoubtedly there are older Guava classes which are being

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread bo yang
Corey, Which version of Spark do you use? I am using Spark 1.2.0, and guava 15.0. It seems fine. Best, Bo On Tue, Feb 3, 2015 at 8:56 PM, M. Dale medal...@yahoo.com.invalid wrote: Try spark.yarn.user.classpath.first (see https://issues.apache.org/jira/browse/SPARK-2996 - only works for

WebUI on yarn through ssh tunnel affected by ami filtered

2015-02-06 Thread Qichi Yang
Hi folks, I am new to spark. I just get spark 1.2 to run on emr ami 3.3.1 (hadoop 2.4). I ssh to emr master node and submit the job or start the shell. Everything runs well except the webUI. In order to see the UI, I used ssh tunnel which forward my dev machine port to emr master node webUI

spark fault tolerance mechanism

2015-01-15 Thread YANG Fan
Hi, I'm quite interested in how Spark's fault tolerance works and I'd like to ask a question here. According to the paper, there are two kinds of dependencies--the wide dependency and the narrow dependency. My understanding is, if the operations I use are all narrow, then when one machine

RE: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Yang, Yuhao
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala It can be used through sliding(windowSize: Int) in spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala Yuhao From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, February 12, 2015

Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Jun Yang
Guys, I have a question regarding to Spark 1.1 broadcast implementation. In our pipeline, we have a large multi-class LR model, which is about 1GiB size. To employ the benefit of Spark parallelism, a natural thinking is to broadcast this model file to the worker node. However, it looks that

Re: Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei
Thanks Cheng for the clarification. Looking forward to this new API mentioned below. Yang Sent from my iPad On Mar 17, 2015, at 8:05 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Yang, My comments are in-lined below. Cheng On 3/18/15 6:53 AM, Yang Lei wrote: Hello, I am

Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei
Spark which filters are handled already, so that there is no redundant filtering. Appreciate comments and links to any existing documentation or discussion. Yang

Cloudant as Spark SQL External Datastore on Spark 1.3.0

2015-03-19 Thread Yang Lei
Check this out : https://github.com/cloudant/spark-cloudant. It supports both the DataFrame and SQL approach for reading data from Cloudant and save it . Looking forward to your feedback on the project. Yang

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com

Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
spawn another receiver on another machine or on the same machine. Thanks Best Regards On Mon, Mar 16, 2015 at 1:08 PM, Jun Yang yangjun...@gmail.com wrote: Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join. Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang

Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-20 Thread Yang Lei
? Thanks in advance for any suggestions on how to resolve this. Yang

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Yang Lei
I have no problem running the socket text stream sample in the same environment. Thanks Yang Sent from my iPhone On Apr 25, 2015, at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat

Re: StreamingContext.textFileStream issue

2015-04-24 Thread Yang Lei
I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context:

Re: Spark on Mesos

2015-04-24 Thread Yang Lei
is using ip addresses for all communication by defining spark.driver.host, SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST in the right place. Hope this help. Yang. On Fri, Apr 24, 2015 at 5:15 PM, Stephen Carman scar...@coldlight.com wrote: So I can’t for the life of me to get something even

RE: The explanation of input text format using LDA in Spark

2015-05-08 Thread Yang, Yuhao
Hi Cui, Try to read the scala version of LDAExample, https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala The matrix you're referring to is the corpus after vectorization. One example, given a dict, [apple, orange, banana] 3

RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin, If possible, can you please provide a part of the code? Or perhaps try with the ut in RandomForestSuite to see if the issue repros. Regards, yuhao -Original Message- From: samsudhin [mailto:samsud...@pigstick.com] Sent: Tuesday, June 23, 2015 2:14 PM To:

log file directory

2015-07-28 Thread Jack Yang
Hi all, I have questions with regarding to the log file directory. That say if I run spark-submit --master local[4], where is the log file? Then how about if I run standalone spark-submit --master spark://mymaster:7077? Best regards, Jack

Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
through Spark SQL: https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang Take a look and feel free to let me know for any question. Best, Bo On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote: Hi how do we create DataFrame from a binary

Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
yang bobyan...@gmail.com wrote: You can create your own data schema (StructType in spark), and use following method to create data frame with your own data schema: sqlContext.createDataFrame(yourRDD, structType); I wrote a post on how to do it. You can also get the sample code there: Light

Re: Accessing S3 files with s3n://

2015-08-09 Thread bo yang
Hi Akshat, I find some open source library which implements S3 InputFormat for Hadoop. Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat. The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a little old. I look inside it and make some changes. Then it

assertion failed error with GraphX

2015-07-19 Thread Jack Yang
Hi there, I got an error when running one simple graphX program. My setting is: spark 1.4.0, Hadoop yarn 2.5. scala 2.10. with four virtual machines. if I constructed one small graph (6 nodes, 4 edges), I run: println(triangleCount: %s .format( hdfs_graph.triangleCount().vertices.count() ))

standalone to connect mysql

2015-07-20 Thread Jack Yang
Hi there, I would like to use spark to access the data in mysql. So firstly I tried to run the program using: spark-submit --class sparkwithscala.SqlApp --driver-class-path /home/lib/mysql-connector-java-5.1.34.jar --master local[4] /home/myjar.jar that returns me the correct results. Then I

RE: standalone to connect mysql

2015-07-21 Thread Jack Yang
: sqlContext.sql(sinsert into Table newStu select * from otherStu) that works. Is there any document addressing that? Best regards, Jack From: Terry Hole [mailto:hujie.ea...@gmail.com] Sent: Tuesday, 21 July 2015 4:17 PM To: Jack Yang; user@spark.apache.org Subject: Re: standalone to connect mysql

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
, at 9:21 pm, Jack Yang j...@uow.edu.aumailto:j...@uow.edu.au wrote: No. I did not use hiveContext at this stage. I am talking the embedded SQL syntax for pure spark sql. Thanks, mate. On 21 Jul 2015, at 6:13 pm, Terry Hole hujie.ea...@gmail.commailto:hujie.ea...@gmail.com wrote: Jack, You can

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
July 2015 4:17 PM To: Jack Yang; user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: standalone to connect mysql Maybe you can try: spark-submit --class sparkwithscala.SqlApp --jars /home/lib/mysql-connector-java-5.1.34.jar --master spark://hadoop1:7077 /home/myjar.jar Thanks! -Terry Hi

error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Hi all, I am saving some hive- query results into the local directory: val hdfsFilePath = "hdfs://master:ip/ tempFile "; val localFilePath = "file:///home/hduser/tempFile"; hiveContext.sql(s"""my hql codes here""") res.printSchema() --working res.show() --working res.map{ x => tranRow2Str(x)

RE: error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Yes. My one is 1.4.0. Then is this problem to do with the version? I doubt that. Any comments please? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, 4 November 2015 11:52 AM To: Jack Yang Cc: user@spark.apache.org Subject: Re: error with saveAsTextFile in local directory Looks

RE: No space left on device when running graphx job

2015-10-05 Thread Jack Yang
September 2015 12:27 AM To: Jack Yang Cc: Ted Yu; Andy Huang; user@spark.apache.org Subject: Re: No space left on device when running graphx job Would you mind sharing what your solution was? It would help those on the forum who might run into the same problem. Even it it’s a silly ‘gotcha

No space left on device when running graphx job

2015-09-24 Thread Jack Yang
Hi folk, I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU cores) Basically, I load data using GraphLoader.edgeListFile mthod and then count number of nodes using: graph.vertices.count() method. The problem is : Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129):

RE: No space left on device when running graphx job

2015-09-24 Thread Jack Yang
Hi all, I resolved the problems. Thanks folk. Jack From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 25 September 2015 9:57 AM To: Ted Yu; Andy Huang Cc: user@spark.apache.org Subject: RE: No space left on device when running graphx job Also, please see the screenshot below from spark web

  1   2   3   >