RE: Spark-submit and Windows / Linux mixed network

2014-11-12 Thread Ashic Mahtab
jar not found :( Seems if I create a directory sim link so that the share path in the same on the unix mount point as in windows, and submit from the drive where the mount point is, then it works. Granted, that's quite an ugly hack. Reverting to serving jar off http (i.e. using a relative

Best way of transforming stack traces

2014-11-12 Thread Kevin Kilroy
Hi, I have log files with lines that begin with a time stamp. However those lines continue onto new lines representing Java stack traces. I want to be able to search for the line then pull out the corresponding stack traces. I was thinking of using either take(n) or reduce to 'peek' ahead at

RE: scala.MatchError

2014-11-12 Thread Naveen Kumar Pokala
Hi, Do you mean with java, I shouldn’t have Issue class as a property (attribute) in Instrument Class? Ex : Class Issue { Int a; } Class Instrument { Issue issue; } How about scala? Does it support such user defined datatypes in classes Case class Issue . case class Issue( a:Int = 0)

Nested Complex Type Data Parsing and Transforming to table

2014-11-12 Thread luohui20001
Hi I got a problem when reading a textfile which contains nested complex type data and got a type unmatch problem.Any hint will be appreciated. The problem take place at map(s = s.map as type mismatch; found :

Pass RDD to functions

2014-11-12 Thread Deep Pradhan
Hi, Can we pass RDD to functions? Like, can we do the following? *def func (temp: RDD[String]):RDD[String] = {* *//body of the function* *}* Thank You

Re: Scala vs Python performance differences

2014-11-12 Thread Andrew Ash
Jeremy, Did you complete this benchmark in a way that's shareable with those interested here? Andrew On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'd also be interested in seeing such a benchmark. On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira

Re: Pass RDD to functions

2014-11-12 Thread Akhil Das
Yes you can create more and more pipelines with your RDDs Thanks Best Regards On Wed, Nov 12, 2014 at 3:24 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can we pass RDD to functions? Like, can we do the following? *def func (temp: RDD[String]):RDD[String] = {* *//body of the

Re: Scala vs Python performance differences

2014-11-12 Thread Samarth Mailinglist
I was about to ask this question. On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash and...@andrewash.com wrote: Jeremy, Did you complete this benchmark in a way that's shareable with those interested here? Andrew On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas nicholas.cham...@gmail.com

Spark and insertion into RDBMS/NoSQL

2014-11-12 Thread nitinkalra2000
Hi All, We are exploring insertion into RDBMS(SQL Server) through Spark by JDBC Driver. The excerpt from the code is as follows : We are doing insertion inside an action : Integer res = flatMappedRDD.reduce(new Function2Integer,Integer,Integer(){

Number of partitions in RDD for input DStreams

2014-11-12 Thread Juan Rodríguez Hortalá
Hi list, In an excelent blog post on Kafka and Spark Streaming integrartion ( http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/), Michael Noll poses an assumption about the number of partitions of the RDDs created by input DStreams. He says his

Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
[cid:image001.png@01CFFE9C.25904980] Hi, How to set the above properties on JavaSQLContext. I am not able to see setConf method on JavaSQLContext Object. I have added spark core jar and spark assembly jar to my build path. And I am using spark 1.1.0 and hadoop 2.4.0 --Naveen

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Tao Xiao
Thanks for your replies. Actually we can kill a driver by the command bin/spark-class org.apache.spark.deploy.Client kill spark-master driver-id if you know the driver id. 2014-11-11 22:35 GMT+08:00 Ritesh Kumar Singh riteshoneinamill...@gmail.com : There is a property :

Re: Pass RDD to functions

2014-11-12 Thread qinwei
I think it‘s ok,feel free to treat RDD like common object qinwei  From: Deep PradhanDate: 2014-11-12 18:24To: user@spark.apache.orgSubject: Pass RDD to functionsHi, Can we pass RDD to functions?Like, can we do the following? def func (temp: RDD[String]):RDD[String] = {//body of the

Java client connection

2014-11-12 Thread Eduardo Cusa
HI guys, I starting to working with spark from java and when i run the folliwing code : SparkConf conf = new SparkConf().setMaster(spark://10.0.2.20:7077 ).setAppName(SparkTest); JavaSparkContext sc = new JavaSparkContext(conf); I recived the following error and the java process exit ends:

RE: Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, November 12, 2014 6:38 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Spark SQL configurations JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014

why flatmap has shuffle

2014-11-12 Thread qinwei
Hi, everyone!     I consider flatmap as a narrow dependency , but why it has shuffle? as shown on the web UI: my code is as below : val transferRDD = sc.textFile(hdfs://host:port/path) val rdd = transferRDD.map(line = { val trunks = line.split(\t)

Re: ISpark class not found

2014-11-12 Thread Laird, Benjamin
Sounds like ipython notebook issue, not an ISpark one. Might want to reinstall pip install ipython[notebook], which will grab the notebook necessary components like tornado. Try spinning up ispark console instead of notebook to see if the ISpark kernel is functioning. ipython console —profile

Snappy error with Spark SQL

2014-11-12 Thread Naveen Kumar Pokala
HI, I am facing the following problem when I am trying to save my RDD as parquet File. 14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null

Re: Status of MLLib exporting models to PMML

2014-11-12 Thread Villu Ruusmann
Hi DB, DB Tsai wrote I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. I am the principal author of the said Java PMML API projects and I want to

Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some sample rows when I hit this exception. The code snippet that caused this error is : model = DecisionTree.trainClassifier(parsedData,

join 2 tables

2014-11-12 Thread Franco Barrientos
I have 2 tables in a hive context, and I want to select one field of each table where id’s of each table are equal. For example, val tmp2=sqlContext.sql(select a.ult_fecha,b.pri_fecha from fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id) but i get an error:

RE: Snappy error with Spark SQL

2014-11-12 Thread Kapil Malik
Hi, Try adding this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory 1TB and when trying to cache a table partition(which is 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than

Re: Question about textFileStream

2014-11-12 Thread Saiph Kappa
What if the window is of 5 seconds, and the file takes longer than 5 seconds to be completely scanned? It will still attempt to load the whole file? On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Entire file in a window. On Mon, Nov 10, 2014 at 9:20 AM, Saiph

No module named pyspark - latest built

2014-11-12 Thread jamborta
Hi all, I am trying to run spark with the latest build (from branch-1.2), as far as I can see, all the paths are set and SparkContext starts up OK, however, I cannot run anything that goes to the nodes. I get the following error: Error from python worker: /usr/bin/python2.7: No module named

Re: SVMWithSGD default threshold

2014-11-12 Thread Caron
Sean, Thanks a lot for your reply! A few follow up questions: 1. numIterations should be 100, not 100*trainingSetSize, right? 2. My training set has 90k positive data points (with label 1) and 60k negative data points (with label 0). I set my numIterations to 100 as default. I still got the same

Re: SVMWithSGD default threshold

2014-11-12 Thread Sean Owen
OK, it's not class imbalance. Yes, 100 iterations. My other guess is that the stepSize of 1 is way too big for your data. I'd suggest you look at the weights / intercept of the resulting model to see if it makes any sense. You can call clearThreshold on the model, and then it will 'predict' the

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread Davies Liu
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230 On Wed, Nov 12, 2014 at 7:20 AM, rprabhu rpra...@ufl.edu wrote: Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, This is my code, import org.apache.hadoop.hbase.CellUtil /** * JF: convert a Result object into a string with column family and qualifier names. Sth like * 'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2' etc. * k-v pairs are separated by ';'. different

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav
yes, can you always specify minimum number of partitions and that would force some parallelism ( assuming you have enough cores) On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa saiph.ka...@gmail.com wrote: What if the window is of 5 seconds, and the file takes longer than 5 seconds to be

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
As far as I know you basically have two options: let partitions be recomputed (possibly caching / persisting memory only), or persist to disk (and memory) and suffer the cost of writing to disk. The question is which will be more expensive in your case. My experience is you're better off letting

Re: join 2 tables

2014-11-12 Thread Rishi Yadav
please use join syntax. On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: I have 2 tables in a hive context, and I want to select one field of each table where id’s of each table are equal. For example, *val tmp2=sqlContext.sql(select

using RDD result in another TDD

2014-11-12 Thread Adrian Mocanu
Hi I'd like to use the result of one RDD1 in another RDD2. Normally I would use something like a barrier so make the 2nd RDD wait till the computation of the 1st RDD is done then include the result from RDD1 in the closure for RDD2. Currently I create another RDD, RDD3, out of the result of RDD1

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick, I saw the HBase api has experienced lots of changes. If I remember correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is 0.98.1. To get the column family names and qualifier names, we need to call different methods for these two different versions. I don't know how

Re: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread spr
After comparing with previous code, I got it work by making the return a Some instead of Tuple2. Perhaps some day I will understand this. spr wrote --code val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, Time)]) = { val currentCount = if

Wildly varying aggregate performance depending on code location

2014-11-12 Thread Jim Carroll
Hello all, I have a really strange thing going on. I have a test data set with 500K lines in a gzipped csv file. I have an array of column processors, one for each column in the dataset. A Processor tracks aggregate state and has a method process(v : String) I'm calling: val processors:

Reading from Hbase using python

2014-11-12 Thread Alan Prando
Hi all, I'm trying to read an hbase table using this an example from github ( https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py), however I have two qualifiers in a column family. Ex.: ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986,

RE: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
My understanding is that the reason you have an Option is so you could filter out tuples when None is returned. This way your state data won't grow forever. -Original Message- From: spr [mailto:s...@yarcdata.com] Sent: November-12-14 2:25 PM To: u...@spark.incubator.apache.org Subject:

Re: using RDD result in another TDD

2014-11-12 Thread Sean Owen
You can't use RDDs inside of RDDs, so this won't work anyway. You could collect the result of RDD1 and broadcast it, perhaps. collect() blocks. On Wed, Nov 12, 2014 at 6:41 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi I’d like to use the result of one RDD1 in another RDD2. Normally

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.deprecation

Building spark targz

2014-11-12 Thread Ashwin Shankar
Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Although the build is successful, I don't see the targz generated. Am I running the wrong command ? -- Thanks,

Re: Spark and Play

2014-11-12 Thread Donald Szeto
Hi Akshat, If your application is to serve results directly from a SparkContext, you may want to take a look at http://prediction.io. It integrates Spark with spray.io (another REST/web toolkit by Typesafe). Some heavy lifting is done here:

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
Just making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4

Re: spark streaming: stderr does not roll

2014-11-12 Thread Nguyen, Duc
I've also tried setting the aforementioned properties using System.setProperty() as well as on the command line while submitting the job using --conf key=value. All to no success. When I go to the Spark UI and click on that particular streaming job and then the Environment tab, I can see the

Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
Yes, I'm looking at assembly/target. I don't see the tar ball. I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar ,classes,test-classes, maven-shared-archive-resources,spark-test-classpath.txt. On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Just

Re: Building spark targz

2014-11-12 Thread Sean Owen
mvn package doesn't make tarballs. It creates artifacts that will generally appear in target/ and subdirectories, and likewise within modules. Look at make-distribution.sh On Wed, Nov 12, 2014 at 8:14 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Hi, I just cloned spark from the github

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Can you give us a bit more detail: hbase release you're using. whether you can reproduce using hbase shell. I did the following using hbase shell against 0.98.4: hbase(main):001:0 create 'test', 'f1' 0 row(s) in 2.9140 seconds = Hbase::Table - test hbase(main):002:0 put 'test', 'row1', 'f1:1',

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
I think you can provide -Pbigtop-dist to build the tar. On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen so...@cloudera.com wrote: mvn package doesn't make tarballs. It creates artifacts that will generally appear in target/ and subdirectories, and likewise within modules. Look at

Re: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread Yana Kadiyska
Adrian, do you know if this is documented somewhere? I was also under the impression that setting a key's value to None would cause the key to be discarded (without any explicit filtering on the user's part) but can not find any official documentation to that effect On Wed, Nov 12, 2014 at 2:43

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
To my knowledge, Spark 1.1 comes with HBase 0.94 To utilize HBase 0.98, you will need: https://issues.apache.org/jira/browse/SPARK-1297 You can apply the patch and build Spark yourself. Cheers On Wed, Nov 12, 2014 at 12:57 PM, Alan Prando a...@scanboo.com.br wrote: Hi Ted! Thanks for

How can my java code executing on a slave find the task id?

2014-11-12 Thread Steve Lewis
I am trying to determine how effective partitioning is at parallelizing my tasks. So far I suspect it that all work is done in one task. My plan is to create a number of accumulators - one for each task and have functions increment the accumulator for the appropriate task (or slave) the values

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Looking at HBaseResultToStringConverter : override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } Here is the code for Result.value(): public byte [] value() { if (isEmpty()) { return null; }

ec2 script and SPARK_LOCAL_DIRS not created

2014-11-12 Thread Darin McBeath
I'm using spark 1.1 and the provided ec2 scripts to start my cluster (r3.8xlarge machines).  From the spark-shell, I can verify that the environment variables are set scala System.getenv(SPARK_LOCAL_DIRS)res0: String = /mnt/spark,/mnt2/spark However, when I look on the workers, the directories

Re: Wildly varying aggregate performance depending on code location

2014-11-12 Thread Jim Carroll
Well it looks like this is a scala problem after all. I loaded the file using pure scala and ran the exact same Processors without Spark and I got 20 seconds (with the code in the same file as the 'main') vs 30 seconds (with the exact same code in a different file) on the 500K rows. -- View

Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the

RE: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
You are correct; the filtering I’m talking about is done implicitly. You don’t have to do it yourself. Spark will do it for you and remove those entries from the state collection. From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: November-12-14 3:50 PM To: Adrian Mocanu Cc: spr;

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Hi Xiangrui, thank you very much for your response. I looked for the .so as you suggested. It is not here: $ jar tf assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar | grep netlib-native_system-linux-x86_64.so or here: $ jar tf

Re: Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Michael Armbrust
There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
forgot to mention, that this setup works in spark standalone mode, only problem when I run on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html Sent from the Apache Spark User List mailing list

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
...@gmail.com wrote: This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.deprecation

How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext currentContext = ...; AccumulatorInteger accumulator = currentContext.accumulator(0, MyAccumulator); will create an Accumulator of Integers. For many large Data problems Integer is too small and Long is a better type. I see a call like the following

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread Xiangrui Meng
That means the -Pnetlib-lgpl option didn't work. Could you use sbt to build the assembly jar and see whether the .so file is inside the assembly jar? Which system and Java version are you using? -Xiangrui On Wed, Nov 12, 2014 at 2:22 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi Xiangrui, thank you

spark.parallelize seems broken on type List

2014-11-12 Thread mod0
Interesting result here. I'm trying to parallelize a list for some simple tests with spark and Ganglia. It seems that spark.parallelize doesn't create partitions except for on the master node on our cluster. The image below shows the CPU utilization per node over three tests. The first two compute

Map output statuses exceeds frameSize

2014-11-12 Thread pouryas
Hey all I am doing a groupby on nearly 2TB of data and I am getting this error: 2014-11-13 00:25:30 ERROR org.apache.spark.MapOutputTrackerMasterActor - Map output statuses were 32163619 bytes which exceeds spark.akka.frameSize (10485760 bytes). org.apache.spark.SparkException: Map output

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Tobias Pfeiffer
Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the code. Do you see any suspicious messages in the log output? Tobias

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hey Thanks for responding so fast. I ran the code with the fix and it works great. Regards, Rahul -- View this message in context:

Re: No module named pyspark - latest built

2014-11-12 Thread Tamas Jambor
Thanks. Will it work with sbt at some point? On Thu, 13 Nov 2014 01:03 Xiangrui Meng men...@gmail.com wrote: You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote: I have

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, Thanks for the information. I am running Spark streaming in a yarn cluster and the configuration should be correct. I followed the KafkaWordCount to write the current code three months ago. It has been working for several months. The messages are in json format. Actually, this code worked

Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote: I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following

Cannot summit Spark app to cluster, stuck on “UNDEFINED”

2014-11-12 Thread brother rain
I use this command to summit *spark application* to *yarn cluster* export YARN_CONF_DIR=conf bin/spark-submit --class Mining --master yarn-cluster --executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar *In Web UI, it stuck on* UNDEFINED [image: enter image description here] *In

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd = { rdd.map({ kv = val k

Re: No module named pyspark - latest built

2014-11-12 Thread Andrew Or
Hey Jamborta, What java version did you build the jar with? 2014-11-12 16:48 GMT-08:00 jamborta jambo...@gmail.com: I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform =

Using data in RDD to specify HDFS directory to write to

2014-11-12 Thread jschindler
I am having a problem trying to figure out how to solve a problem. I would like to stream events from Kafka to my Spark Streaming app and write the contents of each RDD out to a HDFS directory. Each event that comes into the app via kafka will be JSON and have an event field with the name of the

Re: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread Steve Reinhardt
I'm missing something simpler (I think). That is, why do I need a Some instead of Tuple2? Because a Some might or might not be there, but a Tuple2 must be there? Or something like that? From: Adrian Mocanu amoc...@verticalscope.commailto:amoc...@verticalscope.com You are correct; the

RE: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Shao, Saisai
Did you configure Spark master as local, it should be local[n], n 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday,

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Thanks! I used sbt (command below) and the .so file is now there (shown below). Now that I have this new assembly.jar, how do I run the spark-shell so that it can see the .so file when I call the kmeans function? Thanks again for your help with this. sbt/sbt -Dhadoop.version=2.4.0 -Pyarn

Re: Pyspark Error when broadcast numpy array

2014-11-12 Thread bliuab
Dear Liu: I have tested this issue under Spark-1.1.0. The problem is solved under this newer version. On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu bli...@cse.ust.hk wrote: Dear Liu: Thank you for your replay. I will set up an experimental environment for spark-1.1 and test it. On Wed, Nov 12,

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
It's the exact same API you've already found, and it's documented: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam JavaSparkContext has helper methods for int and double but not long. You can just make your own little implementation of

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
The fact that the caching percentage went down is highly suspicious. It should generally not decrease unless other cached data took its place, or if unless executors were dying. Do you know if either of these were the case? On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld

Re: flatMap followed by mapPartitions

2014-11-12 Thread Mayur Rustagi
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Nov 13,

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
I see Javadoc Style documentation but nothing that looks like a code sample I tried the following before asking public static class LongAccumulableParam implements AccumulableParamLong,Long,Serializable { @Override public Long addAccumulator(final Long r, final Long t) {

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
Look again, the type is AccumulatorParam, not AccumulableParam. But yes that's what you do. On Thu, Nov 13, 2014 at 4:32 AM, Steve Lewis lordjoe2...@gmail.com wrote: I see Javadoc Style documentation but nothing that looks like a code sample I tried the following before asking public

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
Sorry, I think I was not clear in what I meant. I didn't mean it went down within a run, with the same instance. I meant I'd run the whole app, and one time, it would cache 100%, and the next run, it might cache only 83% Within a run, it doesn't change. On Wed, Nov 12, 2014 at 11:31 PM, Aaron

Query from two or more tables Spark Sql .I have done this . Is there any simpler solution.

2014-11-12 Thread akshayhazari
As of now my approach is to fetch all data from tables located in different databases in separate RDD's and then make a union of them and then query on them together. I want to know whether I can perform a query on it directly along with creating an RDD. i.e. Instead of creating two RDDs , firing

Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break. But I’m reading that spark requires a shared filesystem like HDFS or S3… Can I use Tachyon or this or something simple for a shared filesystem? -- Founder/CEO

Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Harold Nguyen
Hi Kevin, Yes, Spark can read and write to Cassandra without Hadoop. Have you seen this: https://github.com/datastax/spark-cassandra-connector Harold On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote: We have all our data in Cassandra so I’d prefer to not have to bring

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
I also discussed with Liancheng two weeks ago. And he suggested to use toLocalIterator to collect partitions of RDD to driver (same order if RDD is sorted), and then turn each partition to a RDD and put them in the queue. So: To turn RDD[(timestamp, value)] to DStream 1) Group by

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
Yup , very important that n1 for spark streaming jobs, If local use local[2] The thing to remember is that your spark receiver will take a thread to itself and produce data , so u need another thread to consume it . In a cluster manager like yarn or mesos, the word thread Is not used

Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
Yes. That’s what I was planning on using actually. I was just curious whether intermediate data had to be kept in HDFS but this answers my question. thanks. On Wed, Nov 12, 2014 at 9:33 PM, Harold Nguyen har...@nexgate.com wrote: Hi Kevin, Yes, Spark can read and write to Cassandra without

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
Spark's scheduling is pretty simple: it will allocate tasks to open cores on executors, preferring ones where the data is local. It even performs delay scheduling, which means waiting a bit to see if an executor where the data resides locally becomes available. Are yours tasks seeing very skewed

Joined RDD

2014-11-12 Thread ajay garg
Hi, I have two RDDs A and B which are created from reading file from HDFS. I have a third RDD C which is created by taking join of A and B. All three RDDs (A, B and C ) are not cached. Now if I perform any action on C (let say collect), action is served without reading any data from the disk.

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Hi Xiangrui, All is well. Got it working now, I just recompiled with sbt with the additional package flag and that created all the /bin files. Then when I start spark-shell, the webUI environment show the assembly jar is in spark's classpath entries and now the kmeans function finds it -- no

  1   2   >