How to kill a query job when using spark thrift-server?

2017-11-27 Thread 张万新
Hi, I intend to use spark thrift-server as a service to support concurrent sql queries. But in our situation we need a way to kill arbitrary query job, is there an api to use here?

Re: Custom Data Source for getting data from Rest based services

2017-11-27 Thread Sourav Mazumder
It would be great if you can elaborate on the bulk provisioning use case. Regards, Sourav On Sun, Nov 26, 2017 at 11:53 PM, shankar.roy wrote: > This would be a useful feature. > We can leverage it while doing bulk provisioning. > > > > > -- > Sent from:

Re: Custom Data Source for getting data from Rest based services

2017-11-27 Thread smazumder
@sathich Here are my thoughts on your points - 1. Yes this should be able to handle any complex json structure returned by the target rest API. Essentially what it would be returning is Rows of that complex structure. Then one can use Spark SQL to further flatten it using the functions like

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Gourav Sengupta
Hi, 10 TB in Athena would cost $50. If your data is in Parquet, then it will cost even less because of columnar striping. So I am genuinely not quite sure what you are speaking about? Also what do you mean by "I currently need"? Are you already processing the data? Since you mentioned that you

Re: Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
I know that the algorithm itself is not able to extract features for a user that it was not trained on, however, I'm trying to find a way to compare users for similarity so that when I find a user that's really similar to another user, I can just use the similar user's recommendations until the

Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
I'm trying to use the MatrixFactorizationModel to, for instance, determine the latent factors of a user or item that were not used in the training data of the model. I'm not as concerned about the rating as I am with the latent factors for the user/item. Thanks!

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
Hi Simply because you have to pay on top of every instance hour. I currently need about 4800h of r3.2xlarge EMR takes 0.18$ instance hour so it would be 864$ just in EMR costs (spot prices are around 0.12$/h). Just to stay on topic I thought about getting 40 i2.xlarge instances which have about

Re: [Spark R]: dapply only works for very small datasets

2017-11-27 Thread Felix Cheung
What's the number of executor and/or number of partitions you are working with? I'm afraid most of the problem is with the serialization deserialization overhead between JVM and R... From: Kunft, Andreas Sent: Monday, November 27,

[Spark R]: dapply only works for very small datasets

2017-11-27 Thread Kunft, Andreas
Hello, I tried to execute some user defined functions with R using the airline arrival performance dataset. While the examples from the documentation for the `<-` apply operator work perfectly fine on a size ~9GB, the `dapply` operator fails to finish even after ~4 hours. I'm using a

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Gourav Sengupta
Hi, I think that I have mentioned all the required alternatives. However I am quite curious as to how did you conclude that processing using EMR is going to be more expensive than using any other stack. I have been using EMR since last 6 years (almost about the time it came out), and have always

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
I don't use EMR I spin my clusters up using flintrock (beeing a student my budget is slim), my code is writen in pyspark and my data is in the us-east-1 region (N. Virginia). I will do my best explaining it with tables: My input with a size of (10TB) sits in multiple (~150) parquets on S3

Re: Cosine Similarity between documents - Rows

2017-11-27 Thread Ge, Yao (Y.)
You are essential doing document clustering. K-means will do it. You do have to specify the number of clusters up front. Sent from Email+ secured by MobileIron From: "Donni Khan" > Date:

Cosine Similarity between documents - Rows

2017-11-27 Thread Donni Khan
I have spark job to compute the similarity between text documents: RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd()); CoordinateMatrix rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD entries = rowsimilarity.entries().toJavaRDD(); List list = entries.collect(); for(MatrixEntry s :

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Gourav Sengupta
Hi, it would be much simpler in case you just provide two tables with the samples of input and output. Going through the verbose text and trying to read and figure out what is happening is a bit daunting. Personally, given that you have your entire data in Parquet, I do not think that you will

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
I have a temporary result file ( the 10TB one) that looks like this I have around 3 billion rows of (url,url_list,language,vector,text). The bulk of data is in url_list and at the moment I can only guess how large url_list is. I want to give an ID to every url and then this ID to every url in

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Georg Heiler
How many columns do you need from the big file? Also how CPU / memory intensive are the computations you want to perform? Alexander Czech schrieb am Mo. 27. Nov. 2017 um 10:57: > I want to load a 10TB parquet File from S3 and I'm trying to decide what > EC2

Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
I want to load a 10TB parquet File from S3 and I'm trying to decide what EC2 instances to use. Should I go for instances that in total have a larger memory size than 10TB? Or is it enough that they have in total enough SSD storage so that everything can be spilled to disk? thanks