Re: Spark to HBase Fast Bulk Upload

2016-09-19 Thread Kabeer Ahmed
Hi, Without using Spark there are a couple of options. You can refer to the link: http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/. The gist is that you convert the data into HFiles and use the bulk upload option to get the data quickly into HBase. HTH Kabeer. On

Re: Finding unique across all columns in dataset

2016-09-19 Thread ayan guha
Hi If you want column wise distinct, you may need to define it. Will it be possible to demonstrate your problem with an example? Like what's the input and output. Maybe with few columns.. On 19 Sep 2016 20:36, "Abhishek Anand" wrote: > Hi Ayan, > > How will I get column

cassandra.yaml configuration for cassandra spark connection

2016-09-19 Thread muhammet pakyürek
how to configure cassandra.yaml configuration file for datastax cassandra spark connection

spark streaming slow checkpointing when calling Rserve

2016-09-19 Thread Piubelli, Manuel
Hello, I wrote a spark streaming application in Java. It reads stock trades off of a data feed receiver and converts them to Tick objects, and uses a microbatch interval, window interval and sliding interval of 10 seconds. A JavaPairDStream is created where the key is the

Spark to HBase Fast Bulk Upload

2016-09-19 Thread Punit Naik
Hi Guys I have a huge dataset (~ 1TB) which has about a billion records. I have to transfer it to an HBase table. What is the fastest way of doing it? -- Thank You Regards Punit Naik

Re: Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
Hi Ayan, How will I get column wise distinct items using this approach ? On Mon, Sep 19, 2016 at 3:31 PM, ayan guha wrote: > Create an array out of cilumns, convert to Dataframe, > explode,distinct,write. > On 19 Sep 2016 19:11, "Saurav Sinha"

Re: filling missing values in a sequence

2016-09-19 Thread ayan guha
Let me give you a possible direction, please do not use as it is :) >>> r = sc.parallelize([1,3,4,6,8,11,12,5],3) here, I am loading some numbers and partitioning. This partitioning is critical. You may just use partitioning scheme comes with Spark (like above) or, use your own through

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Sean Owen
Yes, relevance is always 1. The label is not a relevance score so don't think it's valid to use it as such. On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim wrote: > Hi, > > I'm trying to evaluate a recommendation model, and found that Spark and > Rival give different results,

cassandra 3.7 is compatible with datastax Spark Cassandra Connector 2.0?

2016-09-19 Thread muhammet pakyürek

Re: filling missing values in a sequence

2016-09-19 Thread Sudhindra Magadi
Each of the records will be having a sequence id .No duplicates On Mon, Sep 19, 2016 at 11:42 AM, ayan guha wrote: > And how do you define missing sequence? Can you give an example? > > On Mon, Sep 19, 2016 at 3:48 PM, Sudhindra Magadi > wrote: > >> Hi

true conf for sparkconf().set().setMaster() to connect to cassandra

2016-09-19 Thread muhammet pakyürek

1TB shuffle failed with executor lost failure

2016-09-19 Thread Cyanny LIANG
My job is 1TB join + 10 GB table on spark1.6.1 run on yarn mode: *1. if I open shuffle service, the error is * Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number of times: 4. Most recent failure

Re: filling missing values in a sequence

2016-09-19 Thread ayan guha
And how do you define missing sequence? Can you give an example? On Mon, Sep 19, 2016 at 3:48 PM, Sudhindra Magadi wrote: > Hi Jorn , > We have a file with billion records.We want to find if there any missing > sequences here .If so what are they ? > Thanks > Sudhindra > >

cassandra can not accessed via pyspark or spark-shell but it is accessible using cqlsh. what is the problem.

2016-09-19 Thread muhammet pakyürek
i have tried all possible examples on internet to access cassandra table via pypsark or spark shell. however, all of trials resulted in fails related to java gateway. what is the main problem?

Re: filling missing values in a sequence

2016-09-19 Thread ayan guha
Ok, so if you see 1,3,4,6. Will you say 2,5 are missing? On Mon, Sep 19, 2016 at 4:15 PM, Sudhindra Magadi wrote: > Each of the records will be having a sequence id .No duplicates > > On Mon, Sep 19, 2016 at 11:42 AM, ayan guha wrote: > >> And how

Re: filling missing values in a sequence

2016-09-19 Thread Sudhindra Magadi
that is correct On Mon, Sep 19, 2016 at 12:09 PM, ayan guha wrote: > Ok, so if you see > > 1,3,4,6. > > Will you say 2,5 are missing? > > On Mon, Sep 19, 2016 at 4:15 PM, Sudhindra Magadi > wrote: > >> Each of the records will be having a sequence id

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Nick Pentreath
The PR already exists for adding RankingEvaluator to ML - https://github.com/apache/spark/pull/12461. I need to revive and review it. DB, your review would be welcome too (and also on https://github.com/apache/spark/issues/12574 which has implications for the semantics of ranking metrics in the

Fwd: Write.df is failing on NFS and S3 based spark cluster

2016-09-19 Thread Sankar Mittapally
Hi , We have setup a spark cluster which is on NFS shared storage, there is no permission issues with NFS storage, all the users are able to write to NFS storage. When I fired write.df command in SparkR, I am getting below. Can some one please help me to fix this issue. 16/09/17 08:03:28 ERROR

Get profile from sbt

2016-09-19 Thread Saurabh Malviya (samalviy)
Hi, Is there any way equivalent to profiles in maven in sbt. I want spark build to pick up endpoints based on environment jar is built for In build.sbt we are ingesting variable dev,stage etc and pick up all dependencies. Similar way I need a way to pick up config for external dependencies

LDA and Maximum Iterations

2016-09-19 Thread Frank Zhang
Hi all,    I have a question about parameter setting for LDA model. When I tried to set a large number like 500 for  setMaxIterations, the program always fails.  There is a very straightforward LDA tutorial using an example data set in the mllib

How to know WHO are the slaves for an application

2016-09-19 Thread Xiaoye Sun
Hi all, I am currently making some changes in Spark in my research project. In my development, after an application has been submitted to the spark master, the master needs to get the IP addresses of all the slaves used by that application, so that the spark master is able to talk to the slave

Configuring Kinesis max records limit in KinesisReceiver

2016-09-19 Thread Aravindh
I use `KinesisUtil.createStream` to create a DStream from a kinesis stream. By default my spark receiver receives 1 events from the stream. I see that the default value for KCL comes from KCL Configuration

RE: as.Date can't be applied to Spark data frame in SparkR

2016-09-19 Thread xingye
Update: the job can finish, but takes a long time on a 10M row data. is there a better solution? From: xing_ma...@hotmail.com To: user@spark.apache.org Subject: as.Date can't be applied to Spark data frame in SparkR Date: Tue, 20 Sep 2016 10:22:17 +0800 Hi, all I've noticed that as.Date can't

as.Date can't be applied to Spark data frame in SparkR

2016-09-19 Thread xingye
Hi, all I've noticed that as.Date can't be applied to Spark data frame. I've created the following UDF and used dapply to change a integer column "aa" to a date with origin as 1960-01-01. change_date<-function(df){ df<-as.POSIXlt(as.Date(df$aa, origin = "1960-01-01", tz = "UTC")) }

it does not stop at the breakpoint line within an anonymous function concerning RDD

2016-09-19 Thread chen yong
Hello ALL, I am new to spark. I use IDEA ver 14.0.3 to debug spark recently.It is strange to me that any breakpoint set within an anonymous function concerning RDD,such as breakpoint-1 in the below code snippet, is invalid(a red X appears on the left of the line, mouse hovering message showing

SPARK-10835 in 2.0

2016-09-19 Thread janardhan shetty
Hi, I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835. Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is appreciated ? Note: Pipeline has Ngram before word2Vec. Error: val word2Vec = new

Re: Can I assign affinity for spark executor processes?

2016-09-19 Thread Xiaoye Sun
Hi Jakob, Yes. you are right. I should use taskset when I start the *.sh scripts. For more detail, I change the last line in ./sbin/start-slaves.sh on master to this "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; *"taskset" "0xffe"* "${SPARK_HOME}/sbin/start-slave.sh"

it does not stop at the breakpoint line within an anonymous function concerning RDD

2016-09-19 Thread chen yong
Hello ALL, I am new to spark. I use IDEA ver 14.0.3 to debug spark recently.It is strange to me that any breakpoint set within an anonymous function concerning RDD,such as breakpoint-1 in the below code snippet, is invalid(a red X appears on the left of the line, mouse hovering message showing

is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-19 Thread muhammet pakyürek
please tell me the configuration including the most recent version of cassandra, spark and cassandra spark connector

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Jong Wook Kim
Thanks for the clarification and the relevant links. I overlooked the comments explicitly saying that the relevance is binary. I understand that the label is not a relevance, but I have been, and I think many people are using the label as relevance in the implicit feedback context where the

write.df is failing on Spark Cluster

2016-09-19 Thread sankarmittapally
We have setup a spark cluster which is on NFS shared storage, there is no permission issues with NFS storage, all the users are able to write to NFS storage. When I fired write.df command in SparkR, I am getting below. Can some one please help me to fix this issue. 16/09/17 08:03:28 ERROR

off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread aka.fe2s
Hi folks, What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention it no longer. -- Oleksiy Dyagilev

Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Qiang Li
I tried dataframe writer with coalesce or repartition api, but it can not meet my requirements, I still can get far more files than bucket number, and spark jobs is very slow after I add coalesce or repartition. I've get back to Hive, use Hive to do data conversion. Thanks. On Sat, Sep 17, 2016

Re: Finding unique across all columns in dataset

2016-09-19 Thread Mich Talebzadeh
something like this df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") && not('transactiontype ==="BGC")).select('transactiontype).*distinct* .collect.foreach(println) HTH Dr Mich Talebzadeh LinkedIn *

driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
Hi, Spark version :spark-1.5.2-bin-hadoop2.6 ,using pyspark. I am running a machine learning program, which runs perfectly by specifying 2G for —driver-memory. However the program cannot be run with default 1G, driver crashes with OOM error. What is the recommended configuration for

NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Mohamed ismail
Hi all I am trying to read: sc.textFile(DataFile).mapPartitions(lines => { val parser = new CSVParser(",") lines.map(line=>parseLineToTuple(line, parser)) }) Data looks like: android

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Mike Metzger
While the SparkListener method is likely all around better, if you just need this quickly you should be able to do a SSH local port redirection over putty. In the putty configuration: - Go to Connection: SSH: Tunnels - In the Source port field, enter 4040 (or another unused port on your machine)

Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Fridtjof Sander
I didn't follow all of this thread, but if you want to have exactly one bucket-output-file per RDD-partition, you have to repartition (shuffle) your data on the bucket-key. If you don't repartition (shuffle), you may have records with different bucket-keys in the same RDD-partition, leading to

Re: Missing output partition file in S3

2016-09-19 Thread Chen, Kevin
Hi Steve, Our S3 is on US east. But this issue also occurred when we using a S3 bucket on US west. We are using S3n. We use Spark standalone deployment. We run the job in EC2. The datasets are about 25GB. We did not have speculative execution turned on. We did not use DirectCommiter. Thanks,

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-19 Thread Kevin Burton
I tried with write.json and write.csv. The write.text method won't work because I have more than one column and refuses to execute. Doesn't seem to work on any data. On Sat, Sep 17, 2016 at 10:52 PM, Hyukjin Kwon wrote: > Hi Kevin, > > I have few questions on this. > >

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Roshani Nagmote
Hello Sean, Can you please tell me how to set checkpoint interval? I did set checkpointDir("hdfs:/") But if I want to reduce the default value of checkpoint interval which is 10. How should it be done? Sorry is its a very basic question. I am a novice in spark. Thanks, Roshani On Fri, Sep 16,

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Nick Pentreath
Try als.setCheckpointInterval ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS@setCheckpointInterval(checkpointInterval:Int):ALS.this.type ) On Mon, 19 Sep 2016 at 20:01 Roshani Nagmote wrote: > Hello Sean, > > Can

Spark DataFrame Join _ performance issues

2016-09-19 Thread Subhajit Purkayastha
I am running my spark (1.5.2) instance in a virtualbox VM. I have 10gb memory allocated to it. I have a fact table extract, with 1 rows var glbalance_df_select = glbalance_df.select ("LEDGER_ID","CODE_COMBINATION_ID","CURRENCY_CODE", "PERIOD_TYPE","TEMPLATE_ID",

Re: Java Compatibity Problems when we install rJava

2016-09-19 Thread Sean Owen
This isn't a Spark question, so I don't think this is the right place. It shows that compilation of rJava failed for lack of some other shared libraries (not Java-related). I think you'd have to get those packages installed locally too. If it ends up being Anaconda specific, you should try

RE: Java Compatibity Problems when we install rJava

2016-09-19 Thread Arif,Mubaraka
we are running Juypter in the yarn-client mode for pyspark (python spark). And, we wanted to know if anybody faced such issues while installing rJava on Jupyter notebook. Also, reaching out for support with cloudera. thanks, Muby From: Sean Owen

Re: Kinesis Receiver not respecting spark.streaming.receiver.maxRate

2016-09-19 Thread tosaigan...@gmail.com
Hi Aravindh, spark.streaming.receiver.maxRate is per receiver. You should multiply with number of receivers with max rate. Regards, Sai On Mon, Sep 19, 2016 at 9:31 AM, Aravindh [via Apache Spark User List] < ml-node+s1001560n27754...@n3.nabble.com> wrote: > I am trying to throttle my spark

Re: Spark Job not failing

2016-09-19 Thread Mich Talebzadeh
As I understanding you are inserting into RDBMS from Spark and the insert is failing on RDBMS due to duplicate primary key but not acknowledged by Spark? Is this correct HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark Job not failing

2016-09-19 Thread sai ganesh
yes. Regards, Sai On Mon, Sep 19, 2016 at 12:29 PM, Mich Talebzadeh wrote: > As I understanding you are inserting into RDBMS from Spark and the insert > is failing on RDBMS due to duplicate primary key but not acknowledged by > Spark? Is this correct > > HTH > > >

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Richard Catlin
Here is my understanding. Spark used Tachyon as an off-heap solution for RDDs. In certain situations, it would alleviate Garbage Collection or the RDDs. Tungsten, Spark 2’s off-heap (columnar format) is much more efficient and used as the default. Alluvio no longer makes sense for this use.

Fwd: Missing output partition file in S3

2016-09-19 Thread Richard Catlin
> Begin forwarded message: > > From: "Chen, Kevin" > Subject: Re: Missing output partition file in S3 > Date: September 19, 2016 at 10:54:44 AM PDT > To: Steve Loughran > Cc: "user@spark.apache.org" > > Hi Steve, > >

Spark Job not failing

2016-09-19 Thread tosaigan...@gmail.com
Hi , I have primary key on sql table iam trying to insert Dataframe into table using insertIntoJDBC. I could see failure instances in logs but still spark job is getting successful. Do you know how can we handle in code to make it fail? 16/09/19 18:52:51 INFO TaskSetManager: Starting task

Java Compatibity Problems when we install rJava

2016-09-19 Thread Arif,Mubaraka
We are trying to install rJava on suse Linux running Cloudera Hadoop CDH 5.7.2  with Spark 1.6. Anaconda 4.0 was installed using the CDH parcel.   Have setup for Jupyter notebook but there are Java compability problems.   For Java we are running :   java version "1.8.0_51" Java(TM) SE

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Sean Owen
It backed the "OFF_HEAP" storage level for RDDs. That's not quite the same thing that off-heap Tungsten allocation refers to. It's also worth pointing out that things like HDFS also can put data into memory already. On Mon, Sep 19, 2016 at 7:48 PM, Richard Catlin

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Roshani Nagmote
Thanks Nick. Its working On Mon, Sep 19, 2016 at 11:11 AM, Nick Pentreath wrote: > Try als.setCheckpointInterval (http://spark.apache.org/docs/ > latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS@ >

Re: Spark_JDBC_Partitions

2016-09-19 Thread Ajay Chander
Thank you all for your valuable inputs. Sorry for getting back late because of personal issues. Mich, answer to your earlier question, Yes it is a fact table. Thank you. Ayan, I have tried ROWNUM as split column with 100 partitions. But it was taking forever to complete the job. Thank you.

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Kevin Mellott
Hi Anand, Unfortunately, there is not really a "one size fits all" answer to this question; however, here are some things that you may want to consider when trying different sizes. - What is the size of the data you are processing? - Whenever you invoke an action that requires ALL of the

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread janardhan shetty
Yes Sujit I have tried that option as well. Also tried sbt assembly but hitting below issue: http://stackoverflow.com/questions/35197120/java-outofmemoryerror-on-sbt- assembly Just wondering if there any clean approach to include StanfordCoreNLP classes in spark ML ? On Mon, Sep 19, 2016 at

Re: Spark Job not failing

2016-09-19 Thread Mich Talebzadeh
I am not sure a commit or roll-back by RDBMS is acknowledged by Spark. Hence it does not know what is going on. From my recollection this is an issue. Other alternative is to save it as a csv file and load it into RDBMS using a form of bulk copy. HTH Dr Mich Talebzadeh LinkedIn *

Similar Items

2016-09-19 Thread Kevin Mellott
Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components. - *RegexTokenizer* - strips out non-word

Re: very high maxresults setting (no collect())

2016-09-19 Thread Michael Gummelt
When you say "started seeing", do you mean after a Spark version upgrade? After running a new job? On Mon, Sep 19, 2016 at 2:05 PM, Adrian Bridgett wrote: > Hi, > > We've recently started seeing a huge increase in > spark.driver.maxResultSize - we are starting to set it

best versions for cassandra spark connection

2016-09-19 Thread muhammet pakyürek
hi in order to connect pyspark to cassandra which versions of items for conection must be installed. i think cassandra 3.7 is not compatible with spark 2.0 and datastax pyspark-cassandra connector 2.0, please give me the correct version and steps to connect them

Re: filling missing values in a sequence

2016-09-19 Thread Sudhindra Magadi
thanks ayan On Mon, Sep 19, 2016 at 12:25 PM, ayan guha wrote: > Let me give you a possible direction, please do not use as it is :) > > >>> r = sc.parallelize([1,3,4,6,8,11,12,5],3) > > here, I am loading some numbers and partitioning. This partitioning is > critical. You

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-19 Thread Yan Facai
Hi, all. I find that it's really confuse. I can use Vectors.parse to create a DataFrame contains Vector type. scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1, Vectors.parse("[2,4,6]"))).toDF dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector] But using map to

Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
I have an rdd which contains 14 different columns. I need to find the distinct across all the columns of rdd and write it to hdfs. How can I acheive this ? Is there any distributed data structure that I can use and keep on updating it as I traverse the new rows ? Regards, Abhi

Re: Finding unique across all columns in dataset

2016-09-19 Thread Saurav Sinha
You can use distinct over you data frame or rdd rdd.distinct It will give you distinct across your row. On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand wrote: > I have an rdd which contains 14 different columns. I need to find the > distinct across all the columns of

Re: 1TB shuffle failed with executor lost failure

2016-09-19 Thread Divya Gehlot
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 - i.e. an OutOfMemoryError Refer https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala On 19 September 2016 at 14:57,

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? On 20 Sep 2016 2:15 a.m., "Mohamed ismail" wrote: > Hi all > > I am trying to read: > > sc.textFile(DataFile).mapPartitions(lines => { > val parser = new

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? BTW, it seems there is something wrong with your email address. I am sending this again. On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" wrote: > It seems not an issue in Spark. Does

Re: feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-19 Thread Calvin Jia
Hi, Alluxio allows for data sharing between applications through a File System API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse). If your MPI applications can use any of these interfaces, you should be able to use Alluxio for data sharing out of the box. In terms of

Sending extraJavaOptions for Spark 1.6.1 on mesos 0.28.2 in cluster mode

2016-09-19 Thread sagarcasual .
Hello, I have my Spark application running in cluster mode in CDH with extraJavaOptions. However when I am attempting a same application to run with apache mesos, it does not recognize the properties below at all and code returns null that reads them. --conf

Re: Kinesis Receiver not respecting spark.streaming.receiver.maxRate

2016-09-19 Thread Aravindh
Hi Sai, I am running in local mode and there is only one receiver. Verified that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kinesis-Receiver-not-respecting-spark-streaming-receiver-maxRate-tp27754p27760.html Sent from the Apache Spark User List

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
Thank you so much Mich, I am using yarn as my master. I found a statement in Spark mentioning the amount of memory depends on individual application. http://spark.apache.org/docs/1.5.2/hardware-provisioning.html#memory I

Re: study materials for operators on Dataframe

2016-09-19 Thread Kevin Mellott
I would recommend signing up for a Databricks Community Edition account. It will give you access to a 6GB cluster, with many different example programs that you can use to get started. https://databricks.com/try-databricks If you are looking for a more formal training method, I just completed

Spark.1.6.1 on Apache Mesos : Log4j2 could not find a logging implementation

2016-09-19 Thread sagarcasual .
Hello, I am trying to run Spark.1.6.1 on Apache Mesos I have log4j-core and log4j-api 2.6.2 as part of my uber jar still I am getting following error while starting my spark app. ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread Sujit Pal
Hi Janardhan, You need the classifier "models" attribute on the second entry for stanford-corenlp to indicate that you want the models JAR, as shown below. Right now you are importing two instances of stanford-corenlp JARs. libraryDependencies ++= { val sparkVersion = "2.0.0" Seq(

very high maxresults setting (no collect())

2016-09-19 Thread Adrian Bridgett
Hi, We've recently started seeing a huge increase in spark.driver.maxResultSize - we are starting to set it at 3GB (and increase our driver memory a lot to 12GB or so). This is on v1.6.1 with Mesos scheduler. All the docs I can see is that this is to do with .collect() being called on a

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Mich Talebzadeh
Spark UI on port 4040 by default HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Cristina Rozee
I Mich, I do not have access to UI as I am running jobs on remote system and I can access it using putty only so only console or logs files are available to me. Thanks On Mon, Sep 19, 2016 at 11:36 AM, Mich Talebzadeh wrote: > Spark UI on port 4040 by default > >

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Cristina Rozee
Could you please explain a little bit? On Sun, Sep 18, 2016 at 10:19 PM, Jacek Laskowski wrote: > SparkListener perhaps? > > Jacek > > On 15 Sep 2016 1:41 p.m., "Cristina Rozee" > wrote: > >> Hello, >> >> I am running a spark application and I would

Re: Finding unique across all columns in dataset

2016-09-19 Thread ayan guha
Create an array out of cilumns, convert to Dataframe, explode,distinct,write. On 19 Sep 2016 19:11, "Saurav Sinha" wrote: > You can use distinct over you data frame or rdd > > rdd.distinct > > It will give you distinct across your row. > > On Mon, Sep 19, 2016 at 2:35

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Jacek Laskowski
Hi Cristina, http://blog.jaceklaskowski.pl/spark-workshop/slides/08_Monitoring_using_SparkListeners.html http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener Let me know if you've got more questions. Pozdrawiam, Jacek Laskowski

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Jacek Laskowski
On Mon, Sep 19, 2016 at 11:36 AM, Mich Talebzadeh wrote: > Spark UI on port 4040 by default That's exactly *a* SparkListener + web UI :) Jacek - To unsubscribe e-mail:

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread Jacek Laskowski
Hi Janardhan, What's the command to build the project (sbt package or sbt assembly)? What's the command you execute to run the application? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Anyone used Zoomdata visual dashboard with Spark

2016-09-19 Thread Mich Talebzadeh
Hi, Zoomdata is known to be a good tool for real time dashboard. I am trying to have a look. Anyone has experienced with it with Spark by any chance? https://demo.zoomdata.com/zoomdata/login Thanks Dr Mich Talebzadeh LinkedIn *

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Bin Fan
Hi, If you are looking for how to run Spark on Alluxio (formerly Tachyon), here is the documentation from Alluxio doc site: http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html It still works for Spark 2.x. Alluxio team also published articles on when and why running Spark (2.x)

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
Thank you so much, Kevin. My data size is around 4GB. I am not using collect(), take() or takeSample() At the final job, number of tasks grows up to 200,000 Still the driver crashes with OOM with default —driver-memory 1g but Job succeeds if i specify 2g. Thanks and regards, Anand Viswanathan

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Mich Talebzadeh
If you make your driver memory too low it is likely you are going to hit OOM error. You have not mentioned with Spark mode you are using (Local, Standalone, Yarn etc) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-19 Thread Gene Pang
Hi Mich, While Alluxio is not a database (it exposes a file system interface), you can use Alluxio to keep certain data in memory. With Alluxio, you can selectively pin data in memory (http://www.alluxio.org/docs/ master/en/Command-Line-Interface.html#pin). There are also ways to control how to