RE: Need help with ALS Recommendation code

2015-04-05 Thread Phani Yadavilli -X (pyadavil)
Hi Xiangrui, Thank you for the response. I tried sbt package and sbt compile both the commands give me success result sbt compile [info] Set current project to machine-learning (in build file:/opt/mapr/spark/spark-1.2.1/SparkTraining/machine-learning/) [info] Updating {file:/opt/mapr/spark/spa

Re: Re: About Waiting batches on the spark streaming UI

2015-04-05 Thread bit1...@163.com
Thanks Tathagata for the explanation! bit1...@163.com From: Tathagata Das Date: 2015-04-04 01:28 To: Ted Yu CC: bit1129; user Subject: Re: About Waiting batches on the spark streaming UI Maybe that should be marked as waiting as well. Will keep that in mind. We plan to update the ui soon, so

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Sorry, it should be toDF("text", "id"). On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng wrote: > Try: sc.textFile("path/file").zipWithIndex().toDF("id", "text") -Xiangrui > > On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh wrote: >> What would be the most efficient neat method to add a column with r

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Try: sc.textFile("path/file").zipWithIndex().toDF("id", "text") -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh wrote: > What would be the most efficient neat method to add a column with row ids to > dataframe? > > I can think of something as below, but it completes with errors (at line 3

Re: Need help with ALS Recommendation code

2015-04-05 Thread Xiangrui Meng
Could you try `sbt package` or `sbt compile` and see whether there are errors? It seems that you haven't reached the ALS code yet. -Xiangrui On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil) wrote: > Hi , > > > > I am trying to run the following command in the Movie Recommendation exa

Create DataFrame from textFile with unknown columns

2015-04-05 Thread olegshirokikh
Assuming there is a text file with unknown number of columns, how one would create a data frame? I have followed the example in Spark Docs where one first creates a RDD of Rows, but it seems that you have to know exact number of columns in file and can't to just this: val rowRDD = sc.textFile("pat

Add row IDs column to data frame

2015-04-05 Thread olegshirokikh
What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at line 3), and anyways doesn't look like the best route possible: var dataDF = sc.textFile("path/file").toDF() val rowDF = sc.parallelize(1 to

Re: conversion from java collection type to scala JavaRDD

2015-04-05 Thread Dean Wampler
The runtime attempts to serialize everything required by records, and also any lambdas/closures you use. Small, simple types are less likely to run into this problem. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe <

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
I am already using STRATROW and ENDROW in Hbase from newAPIHadoopRDD. Can I do similar with RDD?.lets say use Filter in RDD to get only those records which matches the same Criteria mentioned in STARTROW and Stop ROW.will it much faster than Hbase querying? On 6 April 2015 at 03:15, Ted Yu wr

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
bq. HBase scan operation like scan StartROW and EndROW in RDD? I don't think RDD supports concept of start row and end row. In HBase, please take a look at the following methods of Scan: public Scan setStartRow(byte [] startRow) { public Scan setStopRow(byte [] stopRow) { Cheers On Sun, A

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
I have 2GB hbase table where this data is store in the form on key and value(only one column per key) and key also unique What I thinking to load the complete hbase table into RDD and then do the operation like scan and all in RDD rather than Hbase. Can I do HBase scan operation like scan StartR

Re: NoSuchMethodException KafkaUtils.

2015-04-05 Thread Yamini
Customized spark-streaming-kafka_2.10-1.1.0.jar. Included a new method in kafkaUtils class to handle byte array format. That helped. - Thanks, Yamini -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodException-KafkaUtils-tp17142p22384.html Sen

Re: conversion from java collection type to scala JavaRDD

2015-04-05 Thread Jeetendra Gangele
You are right I have class called VendorRecord which is not serializable also this class object have many sub classed(may be 30 or more).Do I need to recursively serialize all? On 4 April 2015 at 18:14, Dean Wampler wrote: > Without the rest of your code, it's hard to know what might be > unse

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
Sure I will check. On 6 April 2015 at 02:45, Ted Yu wrote: > You do need to apply the patch since 0.96 doesn't have this feature. > > For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics > to see where the overhead might be (compared to creating scan and firing > query using

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
You do need to apply the patch since 0.96 doesn't have this feature. For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics to see where the overhead might be (compared to creating scan and firing query using native client) ? Thanks On Sun, Apr 5, 2015 at 2:00 PM, Jeetendra Ga

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
Thats true I checked the MultiRowRangeFilter and its serving my need. do I need to apply the patch? for this since I am using 0.96 hbase version. Also I have checked when I used JavaSparkContext.newAPIHadoopRDD its slow compare to creating scan and firing query, is there any reason? On 6 Apri

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
Looks like MultiRowRangeFilter would serve your need. See HBASE-11144. HBase 1.1 would be released in May. You can also backport it to the HBase release you're using. On Sat, Apr 4, 2015 at 8:45 AM, Jeetendra Gangele wrote: > Here is my conf object passing first parameter of API. > but here I

Sending RDD object over the network

2015-04-05 Thread raggy
For a class project, I am trying to utilize 2 spark Applications communicate with each other by passing an RDD object that was created from one application to another Spark application. The first application is developed in Scala and creates an RDD and sends it to the 2nd application over the netwo

Diff between foreach and foreachsync

2015-04-05 Thread Jeetendra Gangele
Hi can somebody explain me what is the difference between foreach and foreachsync over RDD action. which one will give good result maximum throughput. does foreach run in parallel way?

Re: Low resource when upgrading from 1.1.0 to 1.3.0

2015-04-05 Thread nsalian
Could you check whether your workers are registered to the Master? Moreover, also look at the heap size for each Worker. For reference, could you paste the exact command that you executed? You mentioned that you changed the script; what is the change? -- View this message in context: http://ap

Re: Spark + Kinesis

2015-04-05 Thread Vadim Bichutskiy
ᐧ Hi all, Below is the output that I am getting. My Kinesis stream has 1 shard, and my Spark cluster on EC2 has 2 slaves (I think that's fine?). I should mention that my Kinesis producer is written in Python where I followed the example http://blogs.aws.amazon.com/bigdata/post/Tx2Z24D4T99AN35/Snak

Re: input size too large | Performance issues with Spark

2015-04-05 Thread Ted Yu
Reading Sandy's blog, there seems to be one typo. bq. Similarly, the heap size can be controlled with the --executor-cores flag or thespark.executor.memory property. '--executor-memory' should be the right flag. BTW bq. It defaults to max(384, .07 * spark.executor.memory) Default memory overhead

Re: Pseudo Spark Streaming ?

2015-04-05 Thread Jörn Franke
Hallo, Only because you receive the log files hourly it means that you have to use Spark Streaming. Spark streaming is often used if you receive new events each minute /second potentially at an irregular frequency. Of course your analysis window can be larger. I think your use case justifies stand

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean "how many times was the page accessed *till now* " I have the log files for past 2 yea

Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-05 Thread Horia
Are you pre-caching them in memory? On Apr 4, 2015 3:29 AM, "SamyaMaiti" wrote: > Reduce *spark.sql.shuffle.partitions* from default of 200 to total number > of > cores. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-D