Re: Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Saisai Shao
Normally there will be one RDD in each batch. You could refer to the implementation of DStream#getOrCompute. On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel wrote: > It may be simple question...But, I am struggling to understand this > > DStream is a sequence of RDDs

Re: DataFrame operations

2015-12-20 Thread Jeff Zhang
If it does not return a column you expect, then what does this return ? Do you will have 2 columns with the same column name ? On Sun, Dec 20, 2015 at 7:40 PM, Eran Witkon wrote: > Hi, > > I am a bit confused with dataframe operations. > I have a function which takes a

Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
It may be simple question...But, I am struggling to understand this DStream is a sequence of RDDs created in a batch window. So, how do I know how many RDDs are created in a batch? I am clear about the number of partitions created which is Number of Partitions = (Batch Interval /

Re: Spark batch getting hung up

2015-12-20 Thread swetha kasireddy
I see this happens when there is a deadlock situation. The RDD test1 has a Couchbase call and it seems to be having threads hanging there. Eventhough all the connections are closed I see the threads related to Couchbase causing the job to hang for sometime before it gets cleared up. Would the

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Yin Huai
Hi Eran, Can you try 1.6? With the change in https://github.com/apache/spark/pull/10288, JSON data source will not throw a runtime exception if there is any record that it cannot parse. Instead, it will put the entire record to the column of "_corrupt_record". Thanks, Yin On Sun, Dec 20, 2015

Re: Spark batch getting hung up

2015-12-20 Thread Jeff Zhang
>>> Would the driver not wait till all the stuff related to test1 is completed before calling test2 as test2 is dependent on test1? >>> val test1 =RDD1.mapPartitions.() >>> val test2 = test1.mapPartititions() On the driver side, actually these 2 lines of code will be executed but the real

Creating vectors from a dataframe

2015-12-20 Thread Arunkumar Pillai
Hi I'm trying to use Linear Regression from ml library but the problem is the independent variable should be a vector. My code snippet is as as follows var dataDF = sqlContext.emptyDataFrame dataDF = sqlContext.sql("SELECT "+ dependentVariable+","+independentVariables +" FROM " +

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
Once I removed the CR LF from the file it worked ok. eran On Mon, 21 Dec 2015 at 06:29 Yin Huai wrote: > Hi Eran, > > Can you try 1.6? With the change in > https://github.com/apache/spark/pull/10288, JSON data source will not > throw a runtime exception if there is any

Re: DataFrame operations

2015-12-20 Thread Eran Witkon
Ptoblem resolved, syntext issue )-: On Mon, 21 Dec 2015 at 06:09 Jeff Zhang wrote: > If it does not return a column you expect, then what does this return ? Do > you will have 2 columns with the same column name ? > > On Sun, Dec 20, 2015 at 7:40 PM, Eran Witkon

Memory allocation for Broadcast values

2015-12-20 Thread Pat Ferrel
I have a large Map that is assembled in the driver and broadcast to each node. My question is how best to allocate memory for this. The Driver has to have enough memory for the Maps, but only one copy is serialized to each node. What type of memory should I size to match the Maps? Is the

Getting estimates and standard error using ml.LinearRegression

2015-12-20 Thread Arunkumar Pillai
Hi I'm using ml.LinearRegession package How to get estimates and standard Error for the coefficient PFB the code snippet val lr = new LinearRegression() lr.setMaxIter(10) .setRegParam(0.01) .setFitIntercept(true) val model= lr.fit(test) val estimates = model.summary

Query partition keys for indexed parquet input

2015-12-20 Thread ajackson92
I have a directory that contains a set of parquet files that have been partitioned. As such, there are subdirectories all of the form MyKey=XXX. The structure is relatively large but I have a query that wants to use only the data from the maximum value for MyKey. I tried the trivial approach

pyspark streaming crashes

2015-12-20 Thread Antony Mayi
Hi, can anyone please help me troubleshooting this prob: I have a streaming pyspark application (spark 1.5.2 on yarn-client) which keeps crashing after few hours. Doesn't seem to be running out of mem neither on driver or executors. driver error: py4j.protocol.Py4JJavaError: An error occurred

Re: How to run multiple Spark jobs as a workflow that takes input from a Streaming job in Oozie

2015-12-20 Thread rahulkumar-aws
Use Spark job server https://github.com/spark-jobserver/spark-jobserver Additional: 1. You can also write your on job server with spray (a Scala REST framework). 2. Create Thrift server and pass states of each job states (Thrift Object ) between your different Jobs. - Software

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Alexander Pivovarov
Just point loader to the folder. You do not need * On Dec 19, 2015 11:21 PM, "Eran Witkon" wrote: > Hi, > Can I combine multiple JSON files to one DataFrame? > > I tried > val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") > but I get an empty DF > Eran >

Re: How to run multiple Spark jobs as a workflow that takes input from a Streaming job in Oozie

2015-12-20 Thread Jörn Franke
Flume could be interesting for you. > On 19 Dec 2015, at 00:27, SRK wrote: > > Hi, > > How to run multiple Spark jobs that takes Spark Streaming data as the > input as a workflow in Oozie? We have to run our Streaming job first and > then have a workflow of Spark

Re: How to do map join in Spark SQL

2015-12-20 Thread Alexander Pivovarov
spark.sql.autoBroadcastJoinThreshold default value in 1.5.2 is 10MB According to the output in console Spark is doing broadcast, but query which looks like the following does not perform well select big_t.*, small_t.name range_name from big_t join small_t on (1=1) where small_t.min <= big_t.v

DataFrame operations

2015-12-20 Thread Eran Witkon
Hi, I am a bit confused with dataframe operations. I have a function which takes a string and returns a string I want to apply this functions on all rows on a single column in my dataframe I was thinking of the following: jsonData.withColumn("computedField",computeString(jsonData("hse"))) BUT

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
disregard my last question - my mistake. I accessed it as a col not as a row : jsonData.first.getAs[String]("cty") Eran On Sun, Dec 20, 2015 at 11:42 AM Eran Witkon wrote: > Thanks, That's works. > One other thing - > I have the following code: > > val jsonData =

Re: Using Spark to process JSON with gzip filed

2015-12-20 Thread Akhil Das
Yes it is. You can actually use the java.util.zip.GZIPInputStream in your case. Thanks Best Regards On Sun, Dec 20, 2015 at 3:23 AM, Eran Witkon wrote: > Thanks, since it is just a snippt do you mean that Inflater is coming > from ZLIB? > Eran > > On Fri, Dec 18, 2015 at

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
Thanks, That's works. One other thing - I have the following code: val jsonData = sqlContext.read.json("/home/eranw/Workspace/JSON/sample") jsonData.show() +--++---+-+ | cty| hse| nm| yrs|

Re: Yarn application ID for Spark job on Yarn

2015-12-20 Thread Steve Loughran
On 19 Dec 2015, at 13:34, Steve Loughran > wrote: On 18 Dec 2015, at 21:39, Andrew Or > wrote: Hi Roy, I believe Spark just gets its application ID from YARN, so you can just do

create hive table in Spark with Java code

2015-12-20 Thread Soni spark
Hi Friends, I have created a hive external table with partition. I want to alter the hive table partition through spark with java code. alter table table1 add if not exists partition(datetime='2015-12-01') location 'hdfs://localhost:54310/spark/twitter/datetime=2015-12-01/' The above query

Re: Creating vectors from a dataframe

2015-12-20 Thread Yanbo Liang
Hi Arunkumar, If you want to create a vector from multiple columns of DataFrame, Spark ML provided VectorAssembler to help us. Yanbo 2015-12-21 13:44 GMT+08:00 Arunkumar Pillai : > Hi > > > I'm trying to use Linear Regression from ml library > > but the problem is the

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Peter Zhang
Hi Eran, Missing import package. import org.apache.spark.sql.types._ will work. please try. Peter Zhang --  Google Sent with Airmail On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com) wrote: Hi, I am using spark-shell with version 1.5.2. scala>

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
Yes, this works... Thanks On Sun, Dec 20, 2015 at 3:57 PM Peter Zhang wrote: > Hi Eran, > > Missing import package. > > import org.apache.spark.sql.types._ > > will work. please try. > > Peter Zhang > -- > Google > Sent with Airmail > > On December 20, 2015 at 21:43:42,

Re: How to convert and RDD to DF?

2015-12-20 Thread Ted Yu
See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) method: * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the given schema. * It is important to make sure that the structure of every [[Row]] of the provided RDD matches * the provided schema.

error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
Hi, I am using spark-shell with version 1.5.2. scala> sc.version res17: String = 1.5.2 but when trying to use StructType I am getting error: val struct = StructType( StructField("a", IntegerType, true) :: StructField("b", LongType, false) :: StructField("c", BooleanType, false) ::

Re: Getting an error in insertion to mysql through sparkcontext in java..

2015-12-20 Thread Ted Yu
Was there stack trace following the error ? Which Spark release are you using ? Cheers > On Dec 19, 2015, at 10:43 PM, Sree Eedupuganti wrote: > > i had 9 rows in my Mysql table > > > options.put("dbtable", "(select * from employee"); >options.put("lowerBound",

How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Hi, I have an RDD jsonGzip res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = MapPartitionsRDD[8] at map at :65 which I want to convert to a DataFrame with schema so I created a schema: al schema = StructType( StructField("cty", StringType, false) ::

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
I might be missing you point but I don't get it. My understanding is that I need a RDD containing Rows but how do I get it? I started with a DataFrame run a map on it and got the RDD [string,string,string,strng] not I want to convert it back to a DataFrame and failing Why? On Sun, Dec 20,

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Chris Fregly
separating out your code into separate streaming jobs - especially when there are no dependencies between the jobs - is almost always the best route. it's easier to combine atoms (fusion), then split them (fission). I recommend splitting out jobs along batch window, stream window, and

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Chris Fregly
hey Eran, I run into this all the time with Json. the problem is likely that your Json is "too pretty" and extending beyond a single line which trips up the Json reader. my solution is usually to de-pretty the Json - either manually or through an ETL step - by stripping all white space before

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Chris Fregly
hopping on a plane, but check the hive-site.xml that's in your spark/conf directory (or should be, anyway). I believe you can change the root path thru this mechanism. if not, this should give you more info google on. let me know as this comes up a fair amount. > On Dec 19, 2015, at 4:58 PM,

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Marco Mistroni
Thanks Chris will give it a go and report back. Bizarrely if I start the pyspark shell I don't see any issues Kr Marco On 20 Dec 2015 5:02 pm, "Chris Fregly" wrote: > hopping on a plane, but check the hive-site.xml that's in your spark/conf > directory (or should be, anyway).

Re: How to do map join in Spark SQL

2015-12-20 Thread Chris Fregly
this type of broadcast should be handled by Spark SQL/DataFrames automatically. this is the primary cost-based, physical-plan query optimization that the Spark SQL Catalyst optimizer supports. in Spark 1.5 and before, you can trigger this optimization by properly setting the

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
Thanks for this! This was the problem... On Sun, 20 Dec 2015 at 18:49 Chris Fregly wrote: > hey Eran, I run into this all the time with Json. > > the problem is likely that your Json is "too pretty" and extending beyond > a single line which trips up the Json reader. > > my

Re: Word2Vec distributed?

2015-12-20 Thread Yao
I have the similar observation with 1.4.1 where the 3rd stage running mapPartitionsWithIndex at Word2Vec.scala:312 seems running with a single thread (which takes forever for reasonable large corpus). Can anyone help explain if this is an algorithm limitation or there model parameters can be

Numpy and dynamic loading

2015-12-20 Thread Abhinav M Kulkarni
I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error: Traceback (most recent call last): File "/home/user/spark-script.py", line 12,

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Got it to work, thanks On Sun, 20 Dec 2015 at 17:01 Eran Witkon wrote: > I might be missing you point but I don't get it. > My understanding is that I need a RDD containing Rows but how do I get it? > > I started with a DataFrame > run a map on it and got the RDD

Re: Pyspark SQL Join Failure

2015-12-20 Thread Chris Fregly
how does Spark SQL/DataFrame know that train_users_2.csv has a field named, "id" or anything else domain specific? is there a header? if so, does sc.textFile() know about this header? I'd suggest using the Databricks spark-csv package for reading csv data. there is an option in there to

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Neelesh
@Chris, There is a 1-1 mapping b/w spark partitions & kafka partitions out of the box . One can break it by repartitioning of course and add more parallelism, but that has its own issues around consumer offset management- when do I commit the offsets, for example. While its trivial to