Unable to serialize dataframe

2016-03-15 Thread shyam
I am using Spark version 1.5.2 with Sparklingwater version 1.5.2 . I am getting a runtime error when I try to write a dataframe to disk. My code looks as shown below, val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",inferSchema.toString).option("header",isHeader.to

Reg:Reading a csv file with String label into labelepoint

2016-03-15 Thread Dharmin Siddesh J
Hi I am trying to read a csv with few double attributes and String Label . How can i convert it to labelpoint RDD so that i can run it with spark mllib classification algorithms. I have tried The LabelPoint Constructor (is available only for Regression ) but it accepts only double format label. I

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread pppsunil
Have you looked at using Accumulable interface, Take a look at Spark documentation at http://spark.apache.org/docs/latest/programming-guide.html#accumulators it gives example of how to use vector type for accumalator, which might be very close to what you need -- View this message in context:

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot
Hi, As I cant add colmns from another Dataframe I am planning to my row coulmns to map of key and arrays As I am new to scala and spark I am trying like below // create an empty map import scala.collection.mutable.{ArrayBuffer => mArrayBuffer} var map = Map[Int,mArrayBuffer[Any]]() def addNode(

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-15 Thread Imre Nagi
Hi, I'm just trying to process the data that come from the kafka source in my spark streaming application. What I want to do is get the pair of topic and message in a tuple from the message stream. Here is my streams: val streams = KafkaUtils.createDirectStream[String, Array[Byte], > StringDeco

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Thanks Mark and Jeff On Wed, Mar 16, 2016 at 7:11 AM, Mark Hamstra wrote: > Looks to me like the one remaining Stage would execute 19788 Task if all > of those Tasks succeeded on the first try; but because of retries, 19841 > Tasks were actually executed. Meanwhile, there were 41405 Tasks in th

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread sychungd
Hi Jeff, sorry forgot to mention that the same java code works fine if we replace the python pi.py file with the jar version of pi example. |-> |Jeff Zhang | | | | | |

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Saisai Shao
You cannot directly invoke Spark application by using yarn#client like what you mentioned, it is deprecated and not supported. you have to use spark-submit to submit a Spark application to yarn. Also here the specific problem is that you're invoking yarn#client to run spark app as yarn-client mode

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang
Could you try yarn-cluster mode ? Make sure your cluster nodes can reach your client machine and no firewall. On Wed, Mar 16, 2016 at 10:54 AM, wrote: > > Hi all, > > We're trying to submit a python file, pi.py in this case, to yarn from java > code but this kept failing(1.6.0). > It seems the A

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin
Hi, I'm running into consistent failures during a shuffle read while trying to do a group-by followed by a count aggregation (using the DataFrame API on Spark 1.5.2). The shuffle read (in stage 1) fails with org.apache.spark.shuffle.FetchFailedException: Failed to send RPC 7719188499899260109 to

Job failed while submitting python to yarn programatically

2016-03-15 Thread sychungd
Hi all, We're trying to submit a python file, pi.py in this case, to yarn from java code but this kept failing(1.6.0). It seems the AM uses the arguments we passed to pi.py as the driver IP address. Could someone help me figuring out how to get the job done. Thanks in advance. The java code look

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported. On Wed, Mar 16, 2016 at 10:48 AM, ayan guha wrote: > so, how about implementing security? Any pointer will be helpful > > On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang wrote: > >> The spark thrift server allow you to run hive q

Re: Spark Thriftserver

2016-03-15 Thread ayan guha
so, how about implementing security? Any pointer will be helpful On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang wrote: > The spark thrift server allow you to run hive query in spark engine. It > can be used as jdbc server. > > On Wed, Mar 16, 2016 at 10:42 AM, ayan guha wrote: > >> Sorry to be

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
The spark thrift server allow you to run hive query in spark engine. It can be used as jdbc server. On Wed, Mar 16, 2016 at 10:42 AM, ayan guha wrote: > Sorry to be dumb-head today, but what is the purpose of spark thriftserver > then? In other words, should I view spark thriftserver as a better

Re: Spark Thriftserver

2016-03-15 Thread ayan guha
Sorry to be dumb-head today, but what is the purpose of spark thriftserver then? In other words, should I view spark thriftserver as a better version of hive one (with Spark as execution engine instead of MR/Tez) OR should I see it as a JDBC server? On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang wr

Re: Does parallelize and collect preserve the original order of list?

2016-03-15 Thread Ted Yu
Not necessarily. > On Mar 15, 2016, at 7:16 PM, JoneZhang wrote: > > Step1 >List items = new ArrayList();items.addAll(XXX); >javaSparkContext.parallelize(items).saveAsTextFile(output); > Step2 >final List items2 = ctx.textFile(output).collect(); > > Does items and items

Does parallelize and collect preserve the original order of list?

2016-03-15 Thread JoneZhang
Step1 List items = new ArrayList();items.addAll(XXX); javaSparkContext.parallelize(items).saveAsTextFile(output); Step2 final List items2 = ctx.textFile(output).collect(); Does items and items2 has the same order? Besh wishes. Thanks. -- View this message i

PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-15 Thread craigiggy
I am having trouble with my standalone Spark cluster and I can't seem to find a solution anywhere. I hope that maybe someone can figure out what is going wrong so this issue might be resolved and I can continue with my work. I am currently attempting to use Python and the pyspark library to do dis

Re: S3 Zip File Loading Advice

2016-03-15 Thread Benjamin Kim
Hi Xinh, I tried to wrap it, but it still didn’t work. I got a "java.util.ConcurrentModificationException”. All, I have been trying and trying with some help of a coworker, but it’s slow going. I have been able to gather a list of the s3 files I need to download. ### S3 Lists ### import scala

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
Looks to me like the one remaining Stage would execute 19788 Task if all of those Tasks succeeded on the first try; but because of retries, 19841 Tasks were actually executed. Meanwhile, there were 41405 Tasks in the the 163 Stages that were skipped. I think -- but the Spark UI's accounting may n

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are skipped if the total is only 19788. On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra wrote: > It's not just if the RDD is explicitly cached, but also if the map outputs > for stages have been materialized into shuffle files and

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
I have submitted https://issues.apache.org/jira/browse/SPARK-13905 and a PR for it. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Wednesday, March 16, 2016 12:52 AM To: roni Cc: Sun, Rui ; user@spark.apache.org Subject: Re: sparkR issues ? Hi Roni, you can probably rename the as.data.frame

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
It's not just if the RDD is explicitly cached, but also if the map outputs for stages have been materialized into shuffle files and are still accessible through the map output tracker. Because of that, explicitly caching RDD actions often gains you little or nothing, since even without a call to c

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/AccumulatorSuite.scala FYI On Tue, Mar 15, 2016 at 4:29 PM, SRK wrote: > Hi, > > How do I add an accumulator for a Set in Spark? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.c

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Imre Nagi
Hi Cody, Can you give a bit example how to use mapPartitions with a switch on topic? I've tried, yet still didn't work. On Tue, Mar 15, 2016 at 9:45 PM, Cody Koeninger wrote: > The direct stream gives you access to the topic. The offset range for > each partition contains the topic. That way

Re: Spark UI Completed Jobs

2016-03-15 Thread Jeff Zhang
If RDD is cached, this RDD is only computed once and the stages for computing this RDD in the following jobs are skipped. On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph wrote: > Hi All, > > > Spark UI Completed Jobs section shows below information, what is the > skipped value shown for Stages a

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
spark thrift server is very similar with hive thrift server. You can use hive jdbc driver to access spark thrift server. AFAIK, all the features of hive thrift server are also available in spark thrift server. On Wed, Mar 16, 2016 at 8:39 AM, ayan guha wrote: > Hi All > > My understanding about

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
Right, it is a little confusing here. dropTempTable actually means unregister here. It only deletes the metadata of this table from catalog. But you can still operate this table by using its dataframe. On Wed, Mar 16, 2016 at 8:27 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
bq. remove them after the job finished. bq. That will keep audit people happy Looks like the above two may not be achieved at the same time "-) On Tue, Mar 15, 2016 at 5:04 PM, Mich Talebzadeh wrote: > in mvn the build mvn package will look for a file called pom.xml > > in sbt the build sbt pa

Spark Thriftserver

2016-03-15 Thread ayan guha
Hi All My understanding about thriftserver is to use it to expose pre-loaded RDD/dataframes to tools who can connect through JDBC. Is there something like Spark JDBC server too? Does it do the same thing? What is the difference between these two? How does Spark JDBC/Thrift supports security? Can

Re: filter by dict() key in pySpark

2016-03-15 Thread Davies Liu
Another solution could be using left-semi join: keys = sqlContext.createDataFrame(dict.keys()) DF2 = DF1.join(keys, DF1.a = keys.k, "leftsemi") On Wed, Feb 24, 2016 at 2:14 AM, Franc Carter wrote: > > A colleague found how to do this, the approach was to use a udf() > > cheers > > On 21 February

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Thanks Jeff I was looking for something like Œunregister¹ In SQL you use drop to delete a table. I always though register was a strange function name. Register **-1 = unregister createTable **-1 == dropTable Andy From: Jeff Zhang Date: Tuesday, March 15, 2016 at 4:44 PM To: Andrew Davidso

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
that should read anything.sbt Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 16 March 2016 at 00:04,

Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Hi All, Spark UI Completed Jobs section shows below information, what is the skipped value shown for Stages and Tasks below. Job_IDDescriptionSubmittedDuration Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total 11 count 2016/03/1

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
in mvn the build mvn package will look for a file called pom.xml in sbt the build sbt package will look for a file called anything.smt It works Keep it simple I will write a ksh script that will create both generic and sbt files on the fly in the correct directory (at the top of the tree) and

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
>>> sqlContext.registerDataFrameAsTable(df, "table1") >>> sqlContext.dropTempTable("table1") On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks > > Andy > -- Best Regards Jeff Zhang

what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Thanks Andy

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
The artifactId in maven basically (in a simple case) corresponds to name in sbt. Note however that you will manually need to append the _scalaBinaryVersion to the artifactId in case you would like to build against multiple scala versions (otherwise maven will overwrite the generated jar with the l

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
Feel free to adjust artifact Id and version in maven. They're under your control. > On Mar 15, 2016, at 4:27 PM, Mich Talebzadeh > wrote: > > ok Ted > > In sbt I have > > name := "ImportCSV" > version := "1.0" > scalaVersion := "2.10.4" > > which ends up in importcsv_2.10-1.0.jar as part

How to add an accumulator for a Set in Spark

2016-03-15 Thread SRK
Hi, How do I add an accumulator for a Set in Spark? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-an-accumulator-for-a-Set-in-Spark-tp26510.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

what is the best practice to read configure file in spark streaming

2016-03-15 Thread yaoxiaohua
Hi guys, I'm using kafka+spark streaming do log analysis. Now my requirement is that the log alarm rules may change sometimes. Rules maybe like this: App=Hadoop,keywords=oom|Exception|error,threshold=10 The thresho

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
ok Ted In sbt I have name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" which ends up in importcsv_2.10-1.0.jar as part of *target/scala-2.10/importcsv_2.**10-1.0.jar* In mvn I have 1.0 scala Does it matter? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/vie

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
1.0 ... scala On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh wrote: > An observation > > Once compiled with MVN the job submit works as follows: > > + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages > com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark:// > 50.1

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
An observation Once compiled with MVN the job submit works as follows: + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark:// 50.140.197.217:7077 --executor-memory=12G --executor-cores=12 --num-executors=2 *target/s

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
Jacek is correct for using org.apache.spark.ml.recommendation.ALSModel If you are trying to save org.apache.spark.mllib.recommendation.MatrixFactorizationModel, then it is similar, but just a little different, see the example here https://github.com/apache/spark/blob/master/examples/src/main/scala

spark.ml : eval model outside sparkContext

2016-03-15 Thread Emmanuel
Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train on Spark but use my model on my workflow. In `spark.ml` it seems like the only way to eval is to use `transform` which only takes a DataFrame.To build a DataFrame

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Many thanks Ted and thanks for heads up Jakob Just these two changes to dependencies org.apache.spark spark-core*_2.10* 1.5.1 org.apache.spark spark-sql*_2.10* 1.5.1 [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0 [INFO] -

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
Hi Mich, probably unrelated to the current error you're seeing, however the following dependencies will bite you later: spark-hive_2.10 spark-csv_2.11 the problem here is that you're using libraries built for different Scala binary versions (the numbers after the underscore). The simple fix here is

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
Please suffix _2.10 to artifact name See: http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh wrote: > Hi, > > I normally use sbt and using this sbt file works fine for me > > cat ImportCSV.sbt > name := "ImportCSV" > version := "1

Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Hi, I normally use sbt and using this sbt file works fine for me cat ImportCSV.sbt name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1" libraryDependenc

Re: Microsoft SQL dialect issues

2016-03-15 Thread Suresh Thalamati
You should be able to register your own dialect if the default mappings are not working for your scenario. import org.apache.spark.sql.jdbc JdbcDialects.registerDialect(MyDialect) Please refer to the JdbcDialects to find example of existing default dialect for your database or another databa

Re: Installing Spark on Mac

2016-03-15 Thread Jakob Odersky
Hi, what do you get running just 'sudo netstat'? Also, what's the output of 'jps -mlv' when running your spark application? Can you post the contents of the files in $SPARK_HOME/conf ? Are there any special firewall rules in place, forbidding connections on localhost? Regarding the IP address chan

Spark streaming with akka association with remote system failure

2016-03-15 Thread David Gomez Saavedra
hi there, I'm trying to set up a simple spark streaming app using akka actors as receivers. I followed the example provided and created two apps. One creating an actor system and another one subscribing to it. I can see the subscription message but few seconds later i get an error [info] 20:37:40

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the spark to a text (csv > prefera

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
Hi Frank We have thousands of small files . Each file is between 6K to maybe 100k. Conductor looks interesting Andy From: Frank Austin Nothaft Date: Tuesday, March 15, 2016 at 11:59 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: newbie HDFS S3 best practices > Hard to say with #

Re: Spark and KafkaUtils

2016-03-15 Thread Vinti Maheshwari
Hi Cody, I wanted to update my build.sbt which was working with kafka without giving any error, it may help other user if they face similar issue. name := "NetworkStreaming" version := "1.0" scalaVersion:= "2.10.5" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming-kafka" %

How to convert Parquet file to a text file.

2016-03-15 Thread Shishir Anshuman
I need to convert the parquet file generated by the spark to a text (csv preferably) file. I want to use the data model outside spark. Any suggestion on how to proceed?

Re: Docker configuration for akka spark streaming

2016-03-15 Thread David Gomez Saavedra
The issue is related to this https://issues.apache.org/jira/browse/SPARK-13906 .set("spark.rpc.netty.dispatcher.numThreads","2") seem to fix the problem On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra wrote: > I have updated the config since I realized the actor system was listening > on

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
You are quite right. I am getting this error while profiling my module to see what is the minimum resources I can use to achieve my SLA. My point is that if resource constraint creates this problem, then this issue is just waiting to happen in a larger scenario(Though the probability of happening w

Re: Microsoft SQL dialect issues

2016-03-15 Thread Mich Talebzadeh
Hi, Can you please clarify what you are trying to achieve and I guess you mean Transact_SQL for MSSQL? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Microsoft SQL dialect issues

2016-03-15 Thread Andrés Ivaldi
Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having dialect problems I found this https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E That is what is happening to me, It's possible to define the di

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X
I want to do a query based on a logic condition to query between two tables. select * from if(A>B, tableA, tableB) But "if" function in Hive cannot work within FROM above. Any idea how?

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 2016, at 11:

Re: bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Jan Štěrba
First off, I would advise against having dots in column names, thats just playing with fire. Second the exception is really strange since spark is complaining about a completely unrelated column. I would like to see the df schema before the exception was thrown. -- Jan Sterba https://twitter.com/h

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
The data in the title is different, so to correct the data in the column requires to find out what is the correct data and then replace. To find the correct data could be tedious but if some mechanism is in place which can help to group the partially matched data then it might help to do the furt

RE: [MARKETING] Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Iain Cundy
Hi Manas I saw a very similar problem while using mapWithState. Timeout on BlockManager remove leading to a stall. In my case it only occurred when there was a big backlog of micro-batches, combined with a shortage of memory. The adding and removing of blocks between new and old tasks was inte

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
Hi, I need to process some events in a specific order based on a timestamp, for each user in my data. I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user. Howe

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu
I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet. r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM, Moham

bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Emmanuel
In Spark 1.6 if I do (column name has dot in it, but is not a nested column): df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))org.apache.spark.sql.AnalysisException: cannot resolve 'raw.minOfDay' given input colu

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode} On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet wrote: > I think you need some sort of fuzzy join ? > Is it always the case that one titl

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea
Hi,manas: Maybe you can look at this bug: https://issues.apache.org/jira/browse/SPARK-13566 -- -- ??: "manas kar";; : 2016??3??15??(??) 10:48 ??: "Ted Yu"; : "user"; : Re: mapwithstate Hangs with Error cleaning b

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI
Hi, I have a pair RDD of the form: (mykey, (value1, value2)) How can I create a DataFrame having the schema [V1 String, V2 String] to store [value1, value2] and save it into a Parquet table named "mykey"? /createDataFrame()/ method takes an RDD and a schema (StructType) in parameters. The sc

Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin
Hi list, We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly for processing data in HDFS and kafka streaming. We are thinking of using mesos instead of yarn as the cluster resource manager so we can use docker container as the executor and makes deployment easier. But there

Re: create hive context in spark application

2016-03-15 Thread Antonio Si
Thanks Akhil. Yes, spark-shell works fine. In my app, I have a Restful service and from the Restful service, I am calling the spark-api to do some hiveql. That's why I am not using spark-submit. Thanks. Antonio. On Tue, Mar 15, 2016 at 12:02 AM, Akhil Das wrote: > Did you ry submitting your

Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
Hi Roni, you can probably rename the as.data.frame in $SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running install-dev.sh On Tue, Mar 15, 2016 at 8:46 AM, roni wrote: > Hi , > Is there a work around for this? > Do i need to file a bug for this? > Thanks > -R > > On Tue, Mar 15, 201

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, This is an interesting point of view. I thought the HashPartitioner works completely differently. Here's my understanding - the HashPartitioner defines how keys are distributed within a dataset between the different partitions, but play no role in assigning each partition for processing by exe

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu
Dear Spark Users and Developers, We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), a Portable Distributed XGBoost in Spark, Fli

Re: sparkR issues ?

2016-03-15 Thread roni
Hi , Is there a work around for this? Do i need to file a bug for this? Thanks -R On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote: > It seems as.data.frame() defined in SparkR convers the versions in R base > package. > > We can try to see if we can change the implementation of as.data.frame(

Re: sparkR issues ?

2016-03-15 Thread roni
Alex, No I have not defined he "dataframe" its the spark default Dataframe. That line is just casting Factor as datarame to send to the function. Thanks -R On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov wrote: > This seems to be a very unfortunate name collision. SparkR defines it's > own DataF

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Once again, please use roles, there is no way that you have to specify the access keys in the URI under any situation. Please read Amazon documentation and they will say the same. The only situation when you use the access keys in URI is when you have not read the Amazon documentation :) Regards,

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar
Your input is skewed in terms of the default hash partitioner that is used. Your options are to use a custom partitioner that can re-distribute the data evenly among your executors. I think you will see the same behaviour when you use more executors. It is just that the data skew appears to be les

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
There are many solutions to a problem. Also understand that sometimes your situation might be such. For ex what if you are accessing S3 from your Spark job running in your continuous integration server sitting in your data center or may be a box under your desk. And sometimes you are just trying s

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2016-03-15 Thread jkukul
Hi Eric (or rather: anyone who's experiencing similar situation), I think your problem was, that the /--files/ parameter was provided after the application jar. Your command should have looked like this, instead: ./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master yarn-cluster

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Oh!!! What the hell Please never use the URI *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of pain, security issues, code maintenance issues and ofcourse something that Amazon strongly suggests that we do not use. Please use roles and you will not have to worry about s

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
I am using spark 1.6. I am not using any broadcast variable. This broadcast variable is probably used by the state management of mapwithState ...Manas On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu wrote: > Which version of Spark are you using ? > > Can you show the code snippet w.r.t. broadcast vari

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Cody Koeninger
The direct stream gives you access to the topic. The offset range for each partition contains the topic. That way you can create a single stream, and the first thing you do with it is mapPartitions with a switch on topic. Of course, it may make more sense to separate topics into different jobs,

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, Yes, I'm running the executors with 8 cores each. I also have properly configured executor memory, driver memory, num execs and so on in submit cmd. I'm a long time spark user, please lets skip the dummy cmd configuration stuff and dive in the interesting stuff :) Another strange thing I've n

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Ted Yu
Which version of Spark are you using ? Can you show the code snippet w.r.t. broadcast variable ? Thanks On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar wrote: > Hi, > I have a streaming application that takes data from a kafka topic and uses > mapwithstate. > After couple of hours of smoot

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
You have a slash before the bucket name. It should be @. Regards Sab On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote: > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my file name is deneme.txt. But I am getting > this error. Here is

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma
By default spark uses 2 executors with one core each, have you allocated more executors using the command line args as - --num-executors 25 --executor-cores x ??? What do you mean by the difference between the nodes is huge ? Regards, Padma Ch On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Hi, Try starting your clusters with roles, and you will not have to configure, hard code anything at all. Let me know in case you need any help with this. Regards, Gourav Sengupta On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya wrote: > Hi Safak, > > I changed the Keys but there is no change.

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks the maven structure is identical to sbt. just sbt file I will have to replace with pom.xml I will use your pom.xml to start with it. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Yes, sbt uses the same structure as maven for source files. > On Mar 15, 2016, at 1:53 PM, Mich Talebzadeh > wrote: > > Thanks the maven structure is identical to sbt. just sbt file I will have to > replace with pom.xml > > I will use your pom.xml to start with it. > > Cheers > > Dr Mich Ta

Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors with

Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi
Thanks Evan for the points. I had supposed what you said, but as I don't have enough experience maybe I was missing something, thanks for the answer!! On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan wrote: > Andres, > > A couple points: > > 1) If you look at my post, you can see that you could use Sp

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
You can build using maven from the command line as well. This layout should give you an idea and here are some resources - http://www.scala-lang.org/old/node/345 project/ pom.xml - Defines the project src/ main/ java/ - Contains a

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
sounds like the layout is basically the same as sbt layout, the sbt file is replaced by pom.xml? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Installing Spark on Mac

2016-03-15 Thread Aida Tefera
Hi Jakob, sorry for my late reply I tried to run the below; came back with "netstat: lunt: unknown or uninstrumented protocol I also tried uninstalling version 1.6.0 and installing version1.5.2 with Java 7 and SCALA version 2.10.6; got the same error messages Do you think it would be worth me

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks again Is there anyway one can set this one up without eclipse much like what I did with sbt? I need to know the directory structure foe MVN project. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

  1   2   >