Re: Spark Intro

2015-07-14 Thread Akhil Das
This is where you can get started https://spark.apache.org/docs/latest/sql-programming-guide.html Thanks Best Regards On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am developing application which handles bulk of data around millions(This may

Re: Spark executor memory information

2015-07-14 Thread Akhil Das
1. Yes open up the webui running on 8080 to see the memory/cores allocated to your workers, and open up the ui running on 4040 and click on the Executor tab to see the memory allocated for the executor. 2. mllib codes can be found over here https://github.com/apache/spark/tree/master/mllib and

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread Akhil Das
Why not add a trigger to your database table and whenever its updated push the changes to kafka etc and use normal sparkstreaming? You can also write a receiver based architecture https://spark.apache.org/docs/latest/streaming-custom-receivers.html for this, but that will be a bit time consuming.

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread sivarani
I am also facing the same issue, anyone figured it? Please help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-mode-connection-failure-from-worker-node-to-master-tp23101p23816.html Sent from the Apache Spark User List mailing list archive at

RE: Share RDD from SparkR and another application

2015-07-14 Thread Sun, Rui
Hi, hari, I don't think job-server can work with SparkR (also pySpark). It seems it would be technically possible but needs support from job-server and SparkR(also pySpark), which doesn't exist yet. But there may be some in-direct ways of sharing RDDs between SparkR and an application. For

Spark executor memory information

2015-07-14 Thread Naveen Dabas
Hi, I am new to spark and need some guidance on below mentioned points: 1)I am using spark 1.2,is it possible to see how much memory is being allocated to an executor for web UI. If not how can we figure that out.2)    I am interested in source code of mlib,it is possible to get access to

Re: How to speed up Spark process

2015-07-14 Thread ๏̯͡๏
genericRecordsAndKeys.persist(StorageLevel.MEMORY_AND_DISK) with 17 as repartitioning argument is throwing this exception: 7/13 23:26:36 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
FYI, another benchmark: http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html quote: I have observed a lot of fetch failures while running Spark, which results in many restarted tasks and, therefore, takes the longest time. I suspect that executors are incapable of

Re: Spark Intro

2015-07-14 Thread vinod kumar
Hi Akhil Is my choice to switch to spark is good? because I don't have enough information regards limitation and working environment of spark. I tried spark SQL but it seems it returns data slower than compared to MsSQL.( I have tested with data which has 4 records) On Tue, Jul 14, 2015 at

spark submit configuration on yarn

2015-07-14 Thread Pa Rö
hello community, i want run my spark app on a cluster (cloudera 5.4.4) with 3 nodes (one pc has i7 8core with 16GB RAM). now i want submit my spark job on yarn (20GB RAM). my script to submit the job is to time the following: export HADOOP_CONF_DIR=/etc/hadoop/conf/

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Look in the worker logs and see whats going on. Thanks Best Regards On Tue, Jul 14, 2015 at 4:02 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, I use Spark 1.4. When saving the model to HDFS, I got error? Please help! Regards my scala command:

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread ayan guha
Hi At this moment we have the same requirement. Unfortunately, database owners will not be able to push to a msg queue but they have enabled Oracle CDC which synchronously update a replica of production DB. Our task will be query the replica and create msg streams to Kinesis. There is already an

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Michal Haris
Ok thanks. It seems that --jars is not behaving as expected - getting class not found for even the most simple object from my lib. But anyways, I have to do at least a filter transformation before collecting the HBaseRDD into R so will have to go the route of using scala spark shell to transform

Udf's in spark

2015-07-14 Thread Ravisankar Mani
Hi Everyone, As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs. I have built the UDF's in hive meta store. It working perfectly in hive connection. But it is not working in spark (java.lang.RuntimeException: Couldn't find function DATE_FORMAT). Could you please help how to

Re: Spark Intro

2015-07-14 Thread Akhil Das
It might take some time to understand the echo system. I'm not sure about what kind of environment you are having (like #cores, Memory etc.), To start with, you can basically use a jdbc connector or dump your data as csv and load it into Spark and query it. You get the advantage of caching if you

java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
Hi, I use Spark 1.4. When saving the model to HDFS, I got error? Please help! Regards my scala command: sc.makeRDD(model.clusterCenters,10).saveAsObjectFile(/tmp/tweets/model) The error log: 15/07/14 18:27:40 INFO SequenceFileRDDFunctions: Saving as sequence file of type

Re: Problems after upgrading to spark 1.4.0

2015-07-14 Thread Luis Ángel Vicente Sánchez
I have just restarted the job and it doesn't seem that the shutdown hook is executed. I have attached to this email the log from the driver. It seems that the slave are not accepting the tasks... but we haven't change anything on our mesos cluster, we have only upgrade one job to spark 1.4; is

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Hafsa Asif
I m still looking forward for the answer. I want to know how to properly close everything about spark in java standalone app. -- View this message in context:

RE: Including additional scala libraries in sparkR

2015-07-14 Thread Sun, Rui
Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a bug. From: Michal Haris [michal.ha...@visualdna.com] Sent: Tuesday, July 14, 2015 5:31 PM To: Sun, Rui Cc: Michal Haris; user@spark.apache.org Subject: Re: Including additional

About extra memory on yarn mode

2015-07-14 Thread Sea
Hi all: I have a question about why spark on yarn will need extra memory I apply for 10 executors, executor memory 6g, I find that it will allocate 1g more for 1 executor, totally 7g for 1 executor. I try to set spark.yarn.executor.memoryOverhead, but it did not help. 1g for 1 executor is too

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
Hi, Below is the log form the worker. 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file /spark/app-20150714171703-0004/5/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at

Re: Research ideas using spark

2015-07-14 Thread Daniel Darabos
Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark

Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
Cool thanks. Will take a look... Sent from my iPhone On Jul 13, 2015, at 6:40 PM, Michael Armbrust mich...@databricks.com wrote: I'd look at the JDBC server (a long running yarn job you can submit queries too)

?????? Does Spark Streaming support streaming from a database table?

2015-07-14 Thread focus
Hi In our case, we have some data stored in a Oracle database table, and new records will be added into this table. We need to analyse new records to calculate some values continuesly, then we write a program to monitor the table every minute. Because every record has a increased unique ID

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't havemost of the basic analysis that you need on stream can be done through mllib components... On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote: Sorry; I think I may have used poor wording. SparkR will let you use R to

Re: Create RDD from output of unix command

2015-07-14 Thread Hafsa Asif
Your question is very interesting. What I suggest is, that copy your output in some text file. Read text file in your code and apply RDD. Just consider wordcount example by Spark. I love this example with Java client. Well, Spark is an analytical engine and it has a slogan to analyze big big data

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Someone else also reported this error with spark 1.4.0 Thanks Best Regards On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, Below is the log form the worker. 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file

Re: Share RDD from SparkR and another application

2015-07-14 Thread harirajaram
A small correction when I typed it is not RDDBackend it is RBackend,sorry. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Share-RDD-from-SparkR-and-another-application-tp23795p23828.html Sent from the Apache Spark User List mailing list archive at

Re: About extra memory on yarn mode

2015-07-14 Thread Jong Wook Kim
executor.memory only sets the maximum heap size of executor and the JVM needs non-heap memory to store class metadata, interned strings and other native overheads coming from networking libraries, off-heap storage levels, etc. These are (of course) legitimate usage of resources and you'll have

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Shivaram Venkataraman
There was a fix for `--jars` that went into 1.4.1 https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6 Shivaram On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote: Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a

Re: Few basic spark questions

2015-07-14 Thread Oded Maimon
Hi, Thanks for all the help. I'm still missing something very basic. If I wont use sparkR, which doesn't support streaming (will use mlib instead as Debasish suggested), and I have my scala receiver working, how the receiver should save the data in memory? I do see the store method, so if i use

Re: Create RDD from output of unix command

2015-07-14 Thread Igor Berman
haven't you thought about spark streaming? there is thread that could help https://www.mail-archive.com/user%40spark.apache.org/msg30105.html On 14 July 2015 at 18:20, Hafsa Asif hafsa.a...@matchinguu.com wrote: Your question is very interesting. What I suggest is, that copy your output in

How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into JavaDStreamString this rdd contains actual data. Now after going through documentation and other help I have found we traverse JavaDStream using foreachRDD javaDStreamRdd.foreachRDD(new FunctionJavaRDDlt;String,Void() {

Re: Spark application with a RESTful API

2015-07-14 Thread Hafsa Asif
I have almost the same case. I will tell you what I am actually doing, if it is according to your requirement, then I will love to help you. 1. my database is aerospike. I get data from it. 2. written standalone spark app (it does not run in standalone mode, but with simple java command or maven

Re: correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread DW @ Gmail
You are mixing the 1.0.0 Spark SQL jar with Spark 1.4.0 jars in your build file Sent from my rotary phone. On Jul 14, 2015, at 7:57 AM, ashwang168 ashw...@mit.edu wrote: Hello! I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and create a jar file from a scala file

No. of Task vs No. of Executors

2015-07-14 Thread shahid
hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks

Re: Share RDD from SparkR and another application

2015-07-14 Thread harirajaram
I appreciate your reply. Yes,you are right by putting in a parquet etc and reading from another app,I would rather use spark-jobserver or IBM kernel to achieve the same if it is not SparkR as it gives more flexibility/scalabilty. Anyway,I have found a way to run R for my poc from my existing app

Re: hive-site.xml spark1.3

2015-07-14 Thread Akhil Das
Try adding it in your SPARK_CLASSPATH inside conf/spark-env.sh file. Thanks Best Regards On Tue, Jul 14, 2015 at 7:05 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql CLI doesn't pick it up. (copying the same

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread Akhil Das
Can you paste your conf/spark-env.sh file? Put SPARK_MASTER_IP as the master machine's host name in spark-env.sh file. Also add your slaves hostnames into conf/slaves file and do a sbin/start-all.sh Thanks Best Regards On Tue, Jul 14, 2015 at 1:26 PM, sivarani whitefeathers...@gmail.com wrote:

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska
Have you seen this SO thread: http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin This seems to be more related to the plugin than Spark, looking at the stack trace On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: I m still looking

Re: No. of Task vs No. of Executors

2015-07-14 Thread ayan guha
Hi As you can see, Spark has taken data locality into consideration and thus scheduled all tasks as node local. It is because spark could run task on a node where data is present, so spark went ahead and scheduled the tasks. It is actually good for reading. If you really want to fan out

correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread ashwang168
Hello! I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and create a jar file from a scala file (attached above) and run it using spark-submit. I am also using Hive, Hadoop 2.6.0-cdh5.4.0 which has the files that I'm trying to read in. Currently I am very confused about how

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from 1000 users to 1 users ? On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: I have almost the same case. I will tell you what I am actually doing, if it is according to your requirement,

Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
More particular example: I run pi.py Spark Python example in *yarn-cluster* mode (--master) through SparkLauncher in Java. While the program is running, these are the stats of how much memory each process takes: SparkSubmit process : 11.266 *gigabyte* Virtual Memory ApplicationMaster process:

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
If your rows may have NAs in them, I would process each column individually by first projecting the column ( map(x = x.nameOfColumn) ), filtering out the NAs, then running a summarizer over each column. Even if you have many rows, after summarizing you will only have a vector of length #columns.

spark on yarn

2015-07-14 Thread Shushant Arora
I am running spark application on yarn managed cluster. When I specify --executor-cores 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores 5 --master masteradd jarname Exception in thread main

ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Elkhan Dadashov
Hi all, If you want to launch Spark job from Java in programmatic way, then you need to Use SparkLauncher. SparkLauncher uses ProcessBuilder for creating new process - Java seems handle process creation in an inefficient way. When you execute a process, you must first fork() and then exec().

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora shushantaror...@gmail.com wrote: When I specify --executor-cores 4 it fails to start the application. When I give --executor-cores as 4 , it works fine. Do you have any NM that advertises more than 4 available cores? Also, it's always worth it

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Ok thanks a lot! few more doubts : What happens in a streaming application say with spark-submit --class classname --num-executors 10 --executor-cores 4 --master masteradd jarname Will it allocate 10 containers throughout the life of streaming application on same nodes until any node failure

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 11:13 AM, Shushant Arora shushantaror...@gmail.com wrote: spark-submit --class classname --num-executors 10 --executor-cores 4 --master masteradd jarname Will it allocate 10 containers throughout the life of streaming application on same nodes until any node failure

To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Hi, I'm wondering how to access elements of a linalg.Vector, e.g: sparseVector: Seq[org.apache.spark.mllib.linalg.Vector] = List((3,[1,2],[1.0,2.0]), (3,[0,1,2],[3.0,4.0,5.0])) scala sparseVector(1) res16: org.apache.spark.mllib.linalg.Vector = (3,[0,1,2],[3.0,4.0,5.0]) How to get the

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread algermissen1971
On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote: Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. Sorry, by now I have understood that I would not necessarily put the driver app on the master node and that

Re: spark streaming with kafka reset offset

2015-07-14 Thread Cody Koeninger
You have access to the offset ranges for a given rdd in the stream by typecasting to HasOffsetRanges. You can then store the offsets wherever you need to. On Tue, Jul 14, 2015 at 5:00 PM, Chen Song chen.song...@gmail.com wrote: A follow up question. When using createDirectStream approach,

Data Frame for nested json

2015-07-14 Thread spark user
is DataFrame  support nested json to dump directely to data base  For simple json it working fine  {id:2,name:Gerald,email:gbarn...@zimbio.com,city:Štoky,country:Czech Republic,ip:92.158.154.75”},  But for nested json it failed to load  root |-- rows: array (nullable = true) |    |-- element:

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc =

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Yep :) On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com wrote: On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote: Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. Sorry, by now I

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 12:03 PM, Shushant Arora shushantaror...@gmail.com wrote: Can a container have multiple JVMs running in YARN? Yes and no. A container runs a single command, but that process can start other processes, and those also count towards the resource usage of the container

Re: How to speed up Spark process

2015-07-14 Thread ๏̯͡๏
Any solutions to solve this exception ? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389) at

Misaligned Rows with UDF

2015-07-14 Thread pedro
Hi, I am working at finding the root cause of a bug where rows in dataframes seem to have misaligned data. My dataframes have two types of columns: columns from data and columns from UDFs. I seem to be having trouble where for a given row, the row data doesn't match the data used to compute the

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: it works for scala 2.10, but for 2.11 i get: [ERROR]

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
A follow up question. When using createDirectStream approach, the offsets are checkpointed to HDFS and it is understandable by Spark Streaming job. Is there a way to expose the offsets via a REST api to end users. Or alternatively, is there a way to have offsets committed to Kafka Offset Manager

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Yes, it works! Thanks a lot Burak! Cheers, Dan 2015-07-14 14:34 GMT-05:00 Burak Yavuz brk...@gmail.com: Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map =

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. In that case, the earlier responses are correct. TD On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com wrote: The master node does not have to be similar to

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Burak Yavuz
Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map = sVec.indices.zip(sVec.values).toMap ``` Best, Burak On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong dongda...@gmail.com wrote: Hi,

DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
Hi! I am seeing some unexpected behavior with regards to cache() in DataFrames. Here goes: In my Scala application, I have created a DataFrame that I run multiple operations on. It is expensive to recompute the DataFrame, so I have called cache() after it gets created. I notice that the

Java 8 vs Scala

2015-07-14 Thread spark user
Hi All  To Start new project in Spark , which technology is good .Java8 OR  Scala . I am Java developer , Can i start with Java 8  or I Need to learn Scala . which one is better technology  for quick start any POC project  Thanks  - su 

Re: Java 8 vs Scala

2015-07-14 Thread Ted Yu
See previous thread: http://search-hadoop.com/m/q3RTtaXamv1nFTGR On Tue, Jul 14, 2015 at 1:30 PM, spark user spark_u...@yahoo.com.invalid wrote: Hi All To Start new project in Spark , which technology is good .Java8 OR Scala . I am Java developer , Can i start with Java 8 or I Need to

Re: Java 8 vs Scala

2015-07-14 Thread Vineel Yalamarthy
Good question. Like you , many are in the same boat(coming from Java background). Looking forward to response from the community. Regards Vineel On Tue, Jul 14, 2015 at 2:30 PM, spark user spark_u...@yahoo.com.invalid wrote: Hi All To Start new project in Spark , which technology is good

RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit, I just wanted to access public datasets on Amazon. Do I still need to provide the keys? Thank you, From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Tuesday, July 14, 2015 3:14 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: Spark on EMR with S3 example (Python)

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
I found the reason, it is about sc. Thanks On Tue, Jul 14, 2015 at 9:45 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Someone else also reported this error with spark 1.4.0 Thanks Best Regards On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, Below is

Re: rest on streaming

2015-07-14 Thread Chen Song
Thanks TD, that is very useful. On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das t...@databricks.com wrote: You can do this. // global variable to keep track of latest stuff var latestTime = _ var latestRDD = _ dstream.foreachRDD((rdd: RDD[..], time: Time) = { latestTime = time

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Of course, exactly once receiving is not same as exactly once. In case of direct kafka stream, the data may actually be pulled multiple time. But even if the data of a batch is pulled twice because of some failure, the final result (that is, transformed data accessed through foreachRDD) will

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
Can you describe how did you cache the tables? In another HiveContext? AFAIK, cached table only be visible within the same HiveContext, you probably need to execute the sql query like “cache table mytable as SELECT xxx” in the JDBC connection also. Cheng Hao From: Brandon White

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
So you’re with different HiveContext instances for the caching. We are not expected to see the cached tables cached with the other HiveContext instance. From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Wednesday, July 15, 2015 8:48 AM To: Cheng, Hao Cc: user Subject: Re: How do you

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD and Cody. I saw that. 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on HDFS at the end of each batch interval? 2. In the code, if I first apply transformations and actions on the directKafkaStream and then use foreachRDD on the original KafkaDStream to commit

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com wrote: On Tue, Jul 14, 2015 at 6:42 PM,

Re: Stopping StreamingContext before receiver has started

2015-07-14 Thread Tathagata Das
This is a known race condition - root cause of SPARK-5681 https://issues.apache.org/jira/browse/SPARK-5681 On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I have noticed that when StreamingContext.stop is called when no receiver has

Getting not implemented by the TFS FileSystem implementation

2015-07-14 Thread Jerrick Hoang
Hi all, I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation``` exception. I did not get this error with 1.3 and I don't use any TFS FileSystem. Full stack trace is

rest on streaming

2015-07-14 Thread Chen Song
I have been POC adding a rest service in a Spark Streaming job. Say I create a stateful DStream X by using updateStateByKey, and each time there is a HTTP request, I want to apply some transformations/actions on the latest RDD of X and collect the results immediately but not scheduled by streaming

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
Thanks, Marcelo. That article confused me, thanks for correcting it helpful tips. I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage using top -p pid command for SparkSubmit process. The

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Relevant documentation - https://spark.apache.org/docs/latest/streaming-kafka-integration.html, towards the end. directKafkaStream.foreachRDD { rdd = val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] // offsetRanges.length = # of Kafka partitions being consumed ... } On Tue,

Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-14 Thread Kelly, Jonathan
I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c

SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread ogoh
Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I appreciate any advice. == With Spark 1.4 Beeline version 1.4.0 by

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
Please take a look at SPARK-2365 which is in progress. On Tue, Jul 14, 2015 at 5:18 PM, swetha swethakasire...@gmail.com wrote: Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
bq. that is, key-value stores Please consider HBase for this purpose :-) On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das t...@databricks.com wrote: I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs,

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com wrote: Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 3:42 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage using top -p pid command for SparkSubmit process. If you're

Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread swetha
Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context:

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Tathagata Das
I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs, which important because long running streaming jobs can lead to infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey operation which

MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Vi Ngo Van
Hi All, I've met a issue with MLlib when i use LogisticRegressionWithLBFGS my sample data : *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1 44187:1 5694:1 27799:1 12010:1* *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1

Re: DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
I was able to workaround this by converting the DataFrame to an RDD and then back to DataFrame. This seems very weird to me, so any insight would be much appreciated! Thanks, Nick P.S. Here's the updated code with the workaround: ``` // Examples udf's that println when called val twice

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song chen.song...@gmail.com wrote: Thanks TD and Cody. I saw that. 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on HDFS at the end of each batch interval? The timing is not guaranteed. 2. In the code, if I first apply

How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
Hello there, I have a JBDC connection setup to my Spark cluster but I cannot see the tables that I cache in memory. The only tables I can see are those that are in my Hive instance. I use a HiveContext to register a table and cache it in memory. How can I enable my JBDC connection to query this

Re: creating a distributed index

2015-07-14 Thread swetha
Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context:

Using reference for RDD is safe?

2015-07-14 Thread Abarah
Hello, I am wondering what will happen if I use a reference for transforming rdd, for example: def func1(rdd: RDD[Int]): RDD[Int] = { rdd.map(x = x * 2) // example transformation, but I am using a more complex function } def main() { . val myrdd = sc.parallelize(1 to 100)

Re: Spark Intro

2015-07-14 Thread vinod kumar
Thank you Hafsa On Tue, Jul 14, 2015 at 11:09 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: Hi, I was also in the same situation as we were using MySQL. Let me give some clearfications: 1. Spark provides a great methodology for big data analysis. So, if you want to make your system more

Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I have found a post discussing the same thing: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J The solution is using joinWithCassandraTable and the documentation is here:

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I have explored spark joins for last few months (you can search my posts) and its frustrating useless. On Tue, Jul

  1   2   >