Spark with MapDB

2015-12-07 Thread Ramkumar V
Hi, I'm running java over spark in cluster mode. I want to apply filter on javaRDD based on some previous batch values. if i store those values in mapDB, is it possible to apply filter during the current batch ? *Thanks*,

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
Try Phoenix from Cloudera parcel distribution https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/ They may have better Kerberos support .. On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Yes, its a kerberized clus

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia
Yes, its a kerberized cluster and ticket was generated using kinit command before running spark job. That's why Spark on hbase worked but when phoenix is used to get the connection to hbase, it does not pass the authentication to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark

HiveContext creation failed with Kerberos

2015-12-07 Thread Neal Yin
Hi I am using Spark 1.5.1 with CDH 5.4.2. My cluster is kerberos protected. Here is pseudocode for what I am trying to do. ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(“foo", “…") ugi.doAs( new PrivilegedExceptionAction() { val sparkConf: SparkConf = createSparkConf(…)

Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu
Can you try like this in your sbt: val spark_version = "1.5.2" val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact = "servlet-api") val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty") libraryDependencies ++= Seq( "org.apache.spark" %% "spark

NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Sunil Tripathy
I am getting the following exception when I use spark-submit to submit a spark streaming job. Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;

Unable to acces hive table (created through hive context) in hive console

2015-12-07 Thread Divya Gehlot
Hi, I am new bee to Spark and using HDP 2.2 which comes with Spark 1.3.1 I tried following code example > import org.apache.spark.sql.SQLContext > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > > val personFile = "/user/hdfs/TestSpark/Person.csv" > val

Kryo Serialization in Spark

2015-12-07 Thread prasad223
Hi All, I'm unable to use Kryo serializer in my Spark program. I'm loading a graph from an edgelist file using GraphLoader and performing a BFS using pregel API. But I get the below mentioned error while I'm running. Can anybody tell me what is the right way to serialize a class in Spark and what

Best way to save key-value pair rdd ?

2015-12-07 Thread Anup Sawant
Hello, what would be the best way to save key-value pair rdd so that I don't have to convert the saved record into tuple while reading the rdd back into spark ? -- Best, Anup

Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu
refer here: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html of section: Example 4-27. Python custom partitioner > On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote: > > I'm not a python expert, so I'm wondering if anybody has a working

python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Keith Freeman
I'm not a python expert, so I'm wondering if anybody has a working example of a partitioner for the "partitionFunc" argument (default "portable_hash") to rdd.partitionBy()? - To unsubscribe, e-mail: user-unsubscr...@spark.apach

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
Thanks for the response. The version is Spark 1.5.2. Some examples of the thread names: pool-1061-thread-1 pool-1059-thread-1 pool-1638-thread-1 There become hundreds then thousands of these stranded in WAITING. I added logging to try to track the lifecycle of the thread pool in Executor as me

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Shixiong Zhu
Which version are you using? Could you post these thread names here? Best Regards, Shixiong Zhu 2015-12-07 14:30 GMT-08:00 Richard Marscher : > Hi, > > I've been running benchmarks against Spark in local mode in a long running > process. I'm seeing threads leaking each time it runs a job. It doe

Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-07 Thread Deenar Toraskar
Hi I have a process that writes to multiple partitions of the same table in parallel using multiple threads sharing the same SQL context df.write.partitionedBy("partCol").insertInto("tableName") . I am getting FileNotFoundException on _temporary directory. Each write only goes to a single partitio

Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
Hi, I've been running benchmarks against Spark in local mode in a long running process. I'm seeing threads leaking each time it runs a job. It doesn't matter if I recycle SparkContext constantly or have 1 context stay alive for the entire application lifetime. I see a huge accumulation ongoing of

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful - WARN server.TransportChannelHandler: Exception in connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(

Re: Managed to make Hive run on Spark engine

2015-12-07 Thread Ashok Kumar
This is great news sir. It shows perseverance pays at last. Can you inform us when the write-up is ready so I can set it up as well please. I know a bit about the advantages of having Hive using Spark engine. However, the general question I have is when one should use Hive on spark as opposed to

issue creating pyspark Transformer UDF that creates a LabeledPoint: AttributeError: 'DataFrame' object has no attribute '_get_object_id'

2015-12-07 Thread Andy Davidson
Hi I am running into a strange error. I am trying to write a transformer that takes in to columns and creates a LabeledPoint. I can not figure out why I am getting AttributeError: 'DataFrame' object has no attribute Œ_get_object_id¹ I am using spark-1.5.1-bin-hadoop2.6 Any idea what I am doin

Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar
By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions. originalDF.repartition(10).write.avro("masterNew.avro")

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Jon Gregg
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll be able to upgrade in probably two months. For now I'm seeing the same issue with spark not recognizing an existing column name in many hive-table-to-dataframe situations: Py4JJavaError: An error occurred while calling o37

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
That error is not directly related to spark nor hbase javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Is this a kerberized cluster? You likely don't have a good (non-expired) kerberos

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
BTW I forgot to mention that this was added through SPARK-11736 which went into the upcoming 1.6.0 release FYI On Mon, Dec 7, 2015 at 12:53 PM, Ted Yu wrote: > scala> val test=sqlContext.sql("select monotonically_increasing_id() from > t").show > +---+ > |_c0| > +---+ > | 0| > | 1| > | 2| >

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov
How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file. We normally use Parquet, but it should be the same for Avro, so this might be relevant http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-ta

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Ali Tajeldin EDU
Checkout the Sameer Farooqui video on youtube for spark internals (https://www.youtube.com/watch?v=7ooZ4S7Ay6Y&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX) Starting at 2:15:00, he describes YARN mode. btw, highly recommend the entire video. Very detailed and concise. -- Ali On Dec 7, 2015, at 8:3

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jian Feng
The only way I can think of is through some kind of wrapper. For java/scala, use JNI. For Python, use extensions. There should not be a lot of work if you know these tools.  From: Robin East To: Annabel Melongo Cc: Jia ; Dewful ; "user @spark" ; "d...@spark.apache.org" Sent: Monday,

Example of a Trivial Custom PySpark Transformer

2015-12-07 Thread Andy Davidson
FYI Hopeful other will find this example helpful Andy Example of a Trivial Custom PySpark Transformer ref: * * NLTKWordPunctTokenizer example * * pyspark.sql.functions.udf

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
scala> val test=sqlContext.sql("select monotonically_increasing_id() from t").show +---+ |_c0| +---+ | 0| | 1| | 2| +---+ Cheers On Mon, Dec 7, 2015 at 12:48 PM, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi Ted, > > Gave and exception am I following right approach ? > >

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Davies Liu
Could you reproduce this problem in 1.5 or 1.6? On Sun, Dec 6, 2015 at 12:29 AM, YaoPau wrote: > If anyone runs into the same issue, I found a workaround: > df.where('state_code = "NY"') > > works for me. > df.where(df.state_code == "NY").collect() > > fails with the error from the firs

Re: spark sql current time stamp function ?

2015-12-07 Thread sri hari kali charan Tummala
Hi Ted, Gave and exception am I following right approach ? val test=sqlContext.sql("select *, monotonicallyIncreasingId() from kali") On Mon, Dec 7, 2015 at 4:52 PM, Ted Yu wrote: > Have you tried using monotonicallyIncreasingId ? > > Cheers > > On Mon, Dec 7, 2015 at 7:56 AM, Sri wrote: >

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread ayan guha
One more thing I feel for better maintability would be to create a dB view and then use the view in spark. This will avoid burying complicated SQL queries within application code. On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)" wrote: > This is a very helpful article. Thanks for the help. > > > >

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-07 Thread Cody Koeninger
Just to be clear, spark checkpoints have nothing to do with zookeeper, they're stored in the filesystem you specify. On Sun, Dec 6, 2015 at 1:25 AM, manasdebashiskar wrote: > When you enable check pointing your offsets get written in zookeeper. If > you > program dies or shutdowns and later rest

Re: Implementing fail-fast upon critical spark streaming tasks errors

2015-12-07 Thread Cody Koeninger
Personally, for jobs that I care about I store offsets in transactional storage rather than checkpoints, which eliminates that problem (just enforce whatever constraints you want when storing offsets). Regarding the question of communication of errors back to the streamingListener, there is an onR

Re: Dataset and lambas

2015-12-07 Thread Deenar Toraskar
Michael Having VectorUnionSumUDAF implemented would be great. This is quite generic, it does element-wise sum of arrays and maps https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java and would be massive benefit for a lot of risk analytics.

Re: Dataset and lambas

2015-12-07 Thread Koert Kuipers
great thanks On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust wrote: > These specific JIRAs don't exist yet, but watch SPARK- as we'll make > sure everything shows up there. > > On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > >> that's good news about plans to avoid unnecessary conv

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar wrote: > > On a similar note, what is involved in getting native support for some > user defined functions, so that they are as efficient as native Spark SQL > expressions? I had one particular one - an arraySum (element wise sum) that > is heavily u

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
I’m not sure what point you’re trying to prove and I’m not particularly interested in getting into a protracted discussion. Here is what you wrote: The architecture of Spark is to run on top of HDFS. I interpreted that as a statement implying that Spark has to run on HDFS which is definitely not

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
These specific JIRAs don't exist yet, but watch SPARK- as we'll make sure everything shows up there. On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > that's good news about plans to avoid unnecessary conversions, and allow > access to more efficient internal types. could you point me

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Nick Pentreath
SparkNet may have some interesting ideas - https://github.com/amplab/SparkNet. Haven't had a deep look at it yet but it seems to have some functionality allowing caffe to read data from RDDs, though I'm not certain the memory is shared. — Sent from Mailbox On Mon, Dec 7, 2015 at 9:55 PM, Rob

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin, To prove my point, this is an unresolved issue still in the implementation stage. On Monday, December 7, 2015 2:49 PM, Robin East wrote: Hi Annabel I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very i

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Hi Annabel I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. T

Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia
Hi, I am running spark job on yarn in cluster mode in secured cluster. I am trying to run Spark on Hbase using Phoenix, but Spark executors are unable to get hbase connection using phoenix. I am running knit command to get the ticket before starting the job and also keytab file and principal are c

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin, Maybe you didn't read my post in which I stated that Spark works on top of HDFS. What Jia wants is to have Spark interacts with a C++ process to read and write data. I've never heard about Jia's use case in Spark. If you know one, please share that with me. Thanks On Monday, Decemb

Re: How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread swetha kasireddy
OK. I think the following can be used. mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package On Mon, Dec 7, 2015 at 10:13 AM, SRK wrote: > Hi, > > How to do a maven build to enable monitoring using Ganglia? What is the > command for the same? > > Thanks,

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line: [Stage 4:> (60 + 60) / 200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 10.0.0.138:33822 15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnS

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Annabel Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc. Robin Sent from my iPhone > On 7 Dec 2015, at 18:50, Annabel Melongo wrote: > > Jia, > > I'm so confused on this. The

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Jia, I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement. On Monday, December 7, 2015 1:42 PM, Jia wrote: Thanks, Annabel, but I may need to clarify that I have no

Fwd: Oozie SparkAction not able to use spark conf values

2015-12-07 Thread Rajadayalan Perumalsamy
Hi We are trying to change our existing oozie workflows to use SparkAction instead of ShellAction. We are passing spark configuration in spark-opts with --conf, but these values are not accessible in Spark and it is throwing error. Please note we are able to use SparkAction successfully in yarn-c

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
This is a very helpful article. Thanks for the help. Ningjun From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Monday, December 07, 2015 12:42 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to create dataframe from SQL Server SQL query Hi Ningjun, Haven't done thi

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards, Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo wrote: > My guess is that Jia want

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Kazuaki, It’s very similar with my requirement, thanks! It seems they want to write to a C++ process with zero copy, and I want to do both read/write with zero copy. Any one knows how to obtain more information like current status of this JIRA entry? Best Regards, Jia On Dec 7, 2015, at

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging. I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: bq. complete a shuffle stage due to lost executors Have you taken a look at t

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R. The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark

SparkSQL AVRO

2015-12-07 Thread Test One
I'm using spark-avro with SparkSQL to process and output avro files. My data has the following schema: root |-- memberUuid: string (nullable = true) |-- communityUuid: string (nullable = true) |-- email: string (nullable = true) |-- firstName: string (nullable = true) |-- lastName: string (nu

Re: Removing duplicates from dataframe

2015-12-07 Thread Ted Yu
bq. complete a shuffle stage due to lost executors Have you taken a look at the log for the lost executor(s) ? Which release of Spark are you using ? Cheers On Mon, Dec 7, 2015 at 10:12 AM, wrote: > I have pyspark app loading a large-ish (100GB) dataframe from JSON files > and it turns out th

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages. However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance. But definitely I need to look at Tachyon m

Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it turns out there are a number of duplicate JSON objects in the source data. I am trying to find the best way to remove these duplicates before using the dataframe. With both df.dropDuplicates() and df.sqlContext.sql(

How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread SRK
Hi, How to do a maven build to enable monitoring using Ganglia? What is the command for the same? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html Sent from the A

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Robin, you have a very good point! We feel that the data copy and allocation overhead may become a performance bottleneck, and is evaluating it right now. We will do the shared memory stuff only if we’re sure about the potential performance gain and sure that there is no existing stuff in

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
I guess you could write a custom RDD that can read data from a memory-mapped file - not really my area of expertise so I’ll leave it to other members of the forum to chip in with comments as to whether that makes sense. But if you want ‘fancy analytics’ then won’t the processing time more than

Re: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Sujit Pal
Hi Ningjun, Haven't done this myself, saw your question and was curious about the answer and found this article which you might find useful: http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ According this article, you can pass in your SQL statement in

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list. Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why

Re: spark.authenticate=true YARN mode doesn't work

2015-12-07 Thread Marcelo Vanzin
Prasad, As I mentioned in my first reply, you need to enable spark.authenticate in the shuffle service's configuration too for this to work. It doesn't seem like you have done that. On Sun, Dec 6, 2015 at 5:09 PM, Prasad Reddy wrote: > Hi Marcelo, > > I am attaching all container logs. can you p

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve.

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
Have you tried using monotonicallyIncreasingId ? Cheers On Mon, Dec 7, 2015 at 7:56 AM, Sri wrote: > Thanks , I found the right function current_timestamp(). > > different Question:- > Is there a row_number() function in spark SQL ? Not in Data frame just > spark SQL? > > > Thanks > Sri > > Sen

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Jacek Laskowski
Hi, That's my understanding, too. Just spent an entire morning today to check it out and would be surprised to hear otherwise. Pozdrawiam, Jacek -- Jacek Laskowski | https://medium.com/@jaceklaskowski/ | http://blog.jaceklaskowski.pl Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-a

How to change StreamingContext batch duration after loading from checkpoint

2015-12-07 Thread yam
Is there a way to change the streaming context batch interval after reloading from checkpoint? I would like to be able to change the batch interval after restarting the application without loosing the checkpoint of course. Thanks! -- View this message in context: http://apache-spark-user-lis

How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
How can I create a RDD from a SQL query against SQLServer database? Here is the example of dataframe http://spark.apache.org/docs/latest/sql-programming-guide.html#overview val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tab

RE: Broadcasting a parquet file using spark and python

2015-12-07 Thread Shuai Zheng
Hi Michael, Thanks for feedback. I am using version 1.5.2 now. Can you tell me how to enforce the broadcast join? I don’t want to let the engine to decide the execution path of join. I want to use hint or parameter to enforce broadcast join (because I also have some cases are inner jo

Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Mich Talebzadeh
Thanks sorted. Actually I used version 1.3.1 and now I managed to make it work as Hive execution engine. Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_

RE: spark sql current time stamp function ?

2015-12-07 Thread Mich Talebzadeh
Or try this cast(from_unixtime(unix_timestamp()) AS timestamp HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of t

Re: spark sql current time stamp function ?

2015-12-07 Thread Sri
Thanks , I found the right function current_timestamp(). different Question:- Is there a row_number() function in spark SQL ? Not in Data frame just spark SQL? Thanks Sri Sent from my iPhone > On 7 Dec 2015, at 15:49, Ted Yu wrote: > > Does unix_timestamp() satisfy your needs ? > See sql/co

FW: Managed to make Hive run on Spark engine

2015-12-07 Thread Mich Talebzadeh
For those interested From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 06 December 2015 20:33 To: u...@hive.apache.org Subject: Managed to make Hive run on Spark engine Thanks all especially to Xuefu.for contributions. Finally it works, which means don’t give up until it works :)

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
Does unix_timestamp() satisfy your needs ? See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > I found a way out. > > import java.text.SimpleDateFormat > import java.util.Date; > > val

Spark sql random number or sequence numbers ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, I did implemented random_numbers using scala spark , is there a function to get row_number equivalent in spark sql ? example:- sql server:-row_number() Netezza:- sequence number mysql:- sequence number Example:- val testsql=sqlContext.sql("select column1,column2,column3,column4,colum

Re: Where to implement synchronization is GraphX Pregel API

2015-12-07 Thread Robineast
Not sure exactly what your asking but: 1) if you are asking do you need to implement synchronisation code - no that is built into the call to Pregel 2) if you are asking how is synchronisation implemented in GraphX - the superstep starts and ends with the beginning and end of a while loop in the P

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Nisrina Luthfiyati
Hi Jacek, thank you for your answer. I looked at TaskSchedulerImpl and TaskSetManager and it does looked like tasks are directly sent to executors. Also would love to be corrected if mistaken as I have little knowledge about Spark internals and very new at scala. On Tue, Dec 1, 2015 at 1:16 AM, Ja

Re: spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
I found a way out. import java.text.SimpleDateFormat import java.util.Date; val format = new SimpleDateFormat("-M-dd hh:mm:ss") val testsql=sqlContext.sql("select column1,column2,column3,column4,column5 ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new Date( Thanks

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-07 Thread Akhil Das
Whats in your SparkIsAwesome class? Just make sure that you are giving enough partition to spark to evenly distribute the job throughout the cluster. Try submitting the job this way: ~/spark/bin/spark-submit --executor-cores 10 --executor-memory 5G --driver-memory 5G --class com.example.SparkIsAwe

spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, Is there a spark sql function which returns current time stamp Example:- In Impala:- select NOW(); SQL Server:- select GETDATE(); Netezza:- select NOW(); Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-fu

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Sean Owen
You can't broadcast an RDD to begin with, and can't use RDDs inside RDDs. They are really driver-side concepts. Yes that's how you'd use a broadcast of anything else though, though you need to reference ".value" on the broadcast. The 'if' is redundant in that example, and if it's a map- or collect

Re: Spark applications metrics

2015-12-07 Thread Akhil Das
Usually your application is composed of jobs and jobs are composed of tasks, on the task level you can see how much read/write was happened from the stages tab of your driver ui. Thanks Best Regards On Fri, Dec 4, 2015 at 6:20 PM, patcharee wrote: > Hi > > How can I see the summary of data read

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Akhil Das
Something like this? val broadcasted = sc.broadcast(...) RDD2.filter(value => { //simply use *broadcasted* if(broadcasted.contains(value)) true }) Thanks Best Regards On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar < abhishek.shivku...@bigdatapartnership.com> wrote: > Hi, > > I have R

Re: parquet file doubts

2015-12-07 Thread Shushant Arora
how to read it using parquet tools. When I did hadoop parquet.tools.Main meta prquetfilename I didn't get any info of min and max values. How can I see parquet version of my file.Is min max respective to some parquet version or available since beginning? On Mon, Dec 7, 2015 at 6:51 PM, Singh, A

Re: Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Akhil Das
Did you read http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support Thanks Best Regards On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh wrote: > Hi, > > > > > > I am trying to make Hive work with Spark. > > > > I have been told that I need to use Spark 1.3

Re: Predictive Modeling

2015-12-07 Thread Akhil Das
You can write a simple python script to process the 1.5GB dataset, use the pandas library for building your predictive model. Thanks Best Regards On Fri, Dec 4, 2015 at 3:02 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Hi, > I'm very much interested to make a predictive model usi

Re: Not able to receive data in spark from rsyslog

2015-12-07 Thread Akhil Das
Just make sure you are binding on the correct interface. - java.net.ConnectException: Connection refused​ Means spark was not able to connect to that host/port. You can validate it by telneting to that host/port. ​ Thanks Best Regards On Fri, Dec 4, 2015 at 1:00 PM, masoom alam wrote: > I a

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das
UpdateStateByKey and your batch data could be filling up your executor memory and hence it might be hitting the disk, you can verify it by looking at the memory footprint while your job is running. Looking at the executor logs will also give you a better understanding of whats going on. Thanks Bes

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Sean Owen
I'm not sure if this is available in Python but from 1.3 on you should be able to call ALS.setFinalRDDStorageLevel with level "none" to ask it to unpersist when it is done. On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs wrote: > Jonathan, > Did you ever get to the bottom of this? I have some users wo

Obtaining metrics of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job, n

[SPARK] Obtaining matrices of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job, n

Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das
Not a direct answer but you can create a big fat jar combining all the classes in the three jars and pass it. Thanks Best Regards On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan wrote: > Hello > > I have a question about AWS CLI for people who use it. > > I create a spark cluster with aws cli

Re: How the cores are used in Directstream approach

2015-12-07 Thread Akhil Das
You will have to do a repartition after creating the dstream to utilize all cores. directStream keeps exactly the same partitions as in kafka for spark. Thanks Best Regards On Thu, Dec 3, 2015 at 9:42 AM, Charan Ganga Phani Adabala < char...@eiqnetworks.com> wrote: > Hi, > > We have* 1 kafka top

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs
Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quota. A 1.5G input set becomes >30G of spilled data on disk. I looked into how I

RE: parquet file doubts

2015-12-07 Thread Singh, Abhijeet
Yes, Parquet has min/max. From: Cheng Lian [mailto:l...@databricks.com] Sent: Monday, December 07, 2015 11:21 AM To: Ted Yu Cc: Shushant Arora; user@spark.apache.org Subject: Re: parquet file doubts Oh sorry... At first I meant to cc spark-user list since Shushant and I had been discussed some S

RE: Spark and Kafka Integration

2015-12-07 Thread Singh, Abhijeet
For Q2. The order of the logs in each partition is guaranteed but there cannot be any such thing as global order. From: Prashant Bhardwaj [mailto:prashant2006s...@gmail.com] Sent: Monday, December 07, 2015 5:46 PM To: user@spark.apache.org Subject: Spark and Kafka Integration Hi Some Background

Re: how create hbase connect?

2015-12-07 Thread censj
ok! I try it. > 在 2015年12月7日,20:11,ayan guha 写道: > > Kindly take a look https://github.com/nerdammer/spark-hbase-connector > > > On Mon, Dec 7, 2015 at 10:56 PM, censj > wrote: > hi all, > I want to update

Spark and Kafka Integration

2015-12-07 Thread Prashant Bhardwaj
Hi Some Background: We have a Kafka cluster with ~45 topics. Some of topics contains logs in Json format and some in PSV(pipe separated value) format. Now I want to consume these logs using Spark streaming and store them in Parquet format in HDFS. Now my question is: 1. Can we create a InputDStre

Re: how create hbase connect?

2015-12-07 Thread ayan guha
Kindly take a look https://github.com/nerdammer/spark-hbase-connector On Mon, Dec 7, 2015 at 10:56 PM, censj wrote: > hi all, > I want to update row on base. how to create connecting base on Rdd? > > - > To unsubscribe, e-

how create hbase connect?

2015-12-07 Thread censj
hi all, I want to update row on base. how to create connecting base on Rdd? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Scala 2.11 and Akka 2.4.0

2015-12-07 Thread RodrigoB
Hi Manas, Thanks for the reply. I've done that. The problem lies with Spark + akka 2.4.0 build. Seems the maven shader plugin is altering some class files and breaking the Akka runtime. Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build errors using sbt due to the issues f

  1   2   >