Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
ion (https://aws.amazon.com/about-aws/whats-new/2017/12/amazon- > redshift-introduces-late-materialization-for-faster-query-processing/) > > Thanks.. > > > On Mar 18, 2018 8:09 PM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote: > >> Hi @CPC, >> Pa

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will

Re: How does spark work?

2017-09-12 Thread nguyen duc Tuan
In genernal, RDD, which is the central concept of Spark, is just deffinition of how to get data and process data. Each partition of RDD defines how to get/process each partition of data. A series of transformation will transform every partitions of data from previous RDD. I give you very easy

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread nguyen duc Tuan
Hi Chapman, You can use "cluster by" to do what you want. https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake : > I haven't worked with datasets but would this help https://stackoverflow. >

Re: Kafka failover with multiple data centers

2017-03-27 Thread nguyen duc Tuan
rts timestamp-based index. 2017-03-28 5:24 GMT+07:00 Soumitra Johri <soumitra.siddha...@gmail.com>: > Hi, did you guys figure it out? > > Thanks > Soumitra > > On Sun, Mar 5, 2017 at 9:51 PM nguyen duc Tuan <newvalu...@gmail.com> > wrote: > >> Hi everyone, &g

Kafka failover with multiple data centers

2017-03-05 Thread nguyen duc Tuan
Hi everyone, We are deploying kafka cluster for ingesting streaming data. But sometimes, some of nodes on the cluster have troubles (node dies, kafka daemon is killed...). However, Recovering data in Kafka can be very slow. It takes serveral hours to recover from disaster. I saw a slide here

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan
> > On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan <newvalu...@gmail.com> > wrote: > >> Hi Seth, >> Here's the parameters that I used in my experiments. >> - Number of executors: 16 >> - Executor's memories: vary from 1G -> 2G -> 3G >> - N

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread nguyen duc Tuan
ementation that I used before ( >>> https://github.com/karlhigley/spark-neighbors ). I can run on my >>> dataset now. If someone has any suggestion, please tell me. >>> Thanks. >>> >>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <newvalu...@gmail.com&

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-12 Thread nguyen duc Tuan
After all, I switched back to LSH implementation that I used before ( https://github.com/karlhigley/spark-neighbors ). I can run on my dataset now. If someone has any suggestion, please tell me. Thanks. 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <newvalu...@gmail.com>: > Hi Timur, >

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>: > What other params are you using for the lsh transformer? > > Are the issues occurring during transform or during the similarity join? > > > On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <newvalu...@gmai

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
Debasish Das <debasish.da...@gmail.com>: > If it is 7m rows and 700k features (or say 1m features) brute force row > similarity will run fine as well...check out spark-4823...you can compare > quality with approximate variant... > On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"

Practical configuration to run LSH in Spark 2.1.0

2017-02-08 Thread nguyen duc Tuan
Hi everyone, Since spark 2.1.0 introduces LSH ( http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing), we want to use LSH to find approximately nearest neighbors. Basically, We have dataset with about 7M rows. we want to use cosine distance to meassure the similarity

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread nguyen duc Tuan
The simplest solution that I found: using an browser extension which do that for you :D. For example, if you are using Chrome, you can use this extension: https://chrome.google.com/webstore/detail/easy-auto-refresh/aabcgdmkeabbnleenpncegpcngjpnjkc/related?hl=en An other way, but a bit more

Re: How to recommend most similar users using Spark ML

2016-07-15 Thread nguyen duc Tuan
Hi jeremycod, If you want to find top N nearest neighbors for all users using exact top-k algorithm for all users, I recommend using the same approach as as used in Mllib :

Re: how to add a column according to an existing column of a dataframe?

2016-06-30 Thread nguyen duc tuan
About spark issue that you refer to, it's is not related to your problem :D In this case, you only have to to is using withColumn function. For example: import org.apache.spark.sql.functions._ val getRange = udf((x: Int) => get price range code ...) val priceRange = resultPrice.withColumn("range",

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan
is to figure out if the DStream 'result' is empty or not > and based on the result, perform some operation on input1Pair DStream and > input2Pair RDD. > > > On Mon, Jun 20, 2016 at 7:05 PM, nguyen duc tuan <newvalu...@gmail.com> > wrote: > >> Hi Praseetha, >> In or

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan
Hi Praseetha, In order to check if DStream is empty or not, using isEmpty method is correct. I think the problem here is calling input1Pair.lefOuterJoin(input2Pair). I guess input1Pair rdd comes from above transformation. You should do it on DStream instead. In this case, do any transformation

Re: ImportError: No module named numpy

2016-06-02 Thread nguyen duc tuan
​​ You should set both PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON the path to your python interpreter. 2016-06-02 20:32 GMT+07:00 Bhupendra Mishra : > did not resolved. :( > > On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández > wrote: > >> >> On Thu,

Re: Behaviour of RDD sampling

2016-05-31 Thread nguyen duc tuan
​​Spark will load the whole dataset. The sampling action can be viewed as an filter. The real implementation can be more complicate, but I give you the idea by simple implementation. val rand = new Random(); val subRdd = rdd.filter(x => rand.nextDouble() <= 0.3) To prevent recomputing data, you

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-31 Thread nguyen duc tuan
<<< As you can see here, no variable to > store the output from foreachRDD. My target is to get (pred, text) pair and > then use > > Whatever it is, the output from "extract_feature" is not what I want. > I will be more than happy if you please correct my mistakes

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-31 Thread nguyen duc tuan
just to send result to > another place/context, like db,file etc. > I could use that but seems like over head of having another hop. > I wanted to make it simple and light. > > > On Tuesday, 31 May 2016, nguyen duc tuan <newvalu...@gmail.com> wrote: > >> How about using fore

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan
y=features > # Retrun will be tuple of (score,'original text') > return predictions > > > Hope, it will help somebody who is facing same problem. > If anybody has better idea, please post it here. > > -Obaid > > On Mon, May 30, 2016 at 8:43 PM, nguyen duc tuan

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan
How about this ? def extract_feature(rf_model, x): text = getFeatures(x).split(',') fea = [float(i) for i in text] prediction = rf_model.predict(fea) return (prediction, x) output = texts.map(lambda x: extract_feature(rf_model, x)) 2016-05-30 14:17 GMT+07:00 obaidul karim :

Re: job build cost more and more time

2016-05-25 Thread nguyen duc tuan
Take a look in here: http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes So all you have to do create a checkpoint for a dataframe is as follow: df.rdd.checkpoint df.rdd.count // or any action 2016-05-25 8:43 GMT+07:00 naliazheli <754405...@qq.com>: >

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread nguyen duc tuan
There's no *RowSimilarity *method in RowMatrix class. You have to transpose your matrix to use that method. However, when the number of rows is large, this approach is still very slow. Try to use approximate nearest neighbor (ANN) methods instead such as LSH. There are several implements of LSH on

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread nguyen duc tuan
Try to use Dataframe instead of RDD. Here's an introduction to Dataframe: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html 2016-05-06 21:52 GMT+07:00 pratik gawande : > Thanks Shao for quick reply. I will look

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan
what does the WebUI show? What do you see when you click on "stderr" and "stdout" links ? These links must contain stdoutput and stderr for each executor. About your custom logging in executor, are you sure you checked "${spark. yarn.app.container.log.dir}/spark-app.log" Actual location of this

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan
These are executor's logs, not the driver logs. To see this log files, you have to go to executor machines where tasks is running. To see what you will print to stdout or stderr you can either go to the executor machines directly (will store in "stdout" and "stderr" files somewhere in the executor

Re: Compute

2016-04-27 Thread nguyen duc tuan
oint (like a join by point ids would). > > Here's an implementation of that idea in the context of finding nearest > neighbors: > > https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34 > > Best, > Karl >

Compute

2016-04-27 Thread nguyen duc tuan
Hi all, Currently, I'm working on implementing LSH on spark. The problem leads to follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of vectors need to compute distance and an other RDD[(Int, Vector)] stores all vectors with their ids. Can anyone suggest an efficiency way to

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-25 Thread nguyen duc tuan
Maybe the problem is the data itself. For example, the first dataframe might has common keys in only one part of the second dataframe. I think you can verify if you are in this situation by repartition one dataframe and join it. If this is the true reason, you might see the result distributed more

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-21 Thread nguyen duc tuan
That's very useful information. The reason for weird problem is because of the non-determination of RDD before applying randomSplit. By caching RDD, we can make RDD become deterministic and so problem is solved. Thank you for your help. 2016-02-21 11:12 GMT+07:00 Ted Yu : >

Re: spark streaming: stderr does not roll

2014-11-12 Thread Nguyen, Duc
the properties are correctly set. But regardless of what I've tried, the stderr log file on the worker nodes does not roll and continues to grow...leading to a crash of the cluster once it claims 100% of disk. Has anyone else encountered this? Anyone? On Fri, Nov 7, 2014 at 3:35 PM, Nguyen, Duc duc.ngu

spark streaming: stderr does not roll

2014-11-07 Thread Nguyen, Duc
We are running spark streaming jobs (version 1.1.0). After a sufficient amount of time, the stderr file grows until the disk is full at 100% and crashes the cluster. I've read this https://github.com/apache/spark/pull/895 and also read this