Re: sbt/sbt run command returns a JVM problem

2014-09-23 Thread christy
thanks very much, seems working... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p14870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-23 Thread Nicholas Chammas
On Tue, Sep 23, 2014 at 1:58 AM, myasuka myas...@live.com wrote: Thus I want to know why recommend 2-3 tasks per CPU core? You want at least 1 task per core so that you fully utilize the cluster's parallelism. You want 2-3 tasks per core so that tasks are a bit smaller than they would

Where can I find the module diagram of SPARK?

2014-09-23 Thread Theodore Si
Hi, Please help me with that. BR, Theodore Si - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Access resources from jar-local resources folder

2014-09-23 Thread Roberto Coluccio
Hello folks, I have a Spark Streaming application built with Maven (as jar) and deployed with the spark-submit script. The application project has the following (main) structure: myApp src main scala com.mycompany.package MyApp.scala DoSomething.scala ... resources aPerlScript.pl ...

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Yes, the total number of terms is 43839. I have also tried running it using different values of parallelism ranging from 1/core to 10/core. I also used multiple configurations like setting spark.storage.memoryFaction and spark.shuffle.memoryFraction to default values. The point to note

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
I get the following stacktrace if it is of any help. 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 (MapPartitionsRDD[24] at

spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Priya Ch
Hi, I am using spark 1.0.0. In my spark code i m trying to persist an rdd to disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the location where the rdd has been written to disk. I specified SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than using the default

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread RodrigoB
Hi TD, This is actually an important requirement (recovery of shared variables) for us as we need to spread some referential data across the Spark nodes on application startup. I just bumped into this issue on Spark version 1.0.1. I assume the latest one also doesn't include this capability. Are

Re: Recommended ways to pass functions

2014-09-23 Thread Yanbo Liang
All these two kinds of function is OK but you need to make your class extends Serializable. But all these kinds of pass functions can not save data which will be send. If you define a function which will not use member parameter of a class or object, you can use val like definition method. For

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
Hi, Spark.local.dir is the one used to write map output data and persistent RDD blocks, but the path of file has been hashed, so you cannot directly find the persistent rdd block files, but definitely it will be in this folders on your worker node. Thanks Jerry From: Priya Ch

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
Is it possible to view the persisted RDD blocks ? If I use YARN, RDD blocks would be persisted to hdfs then will i be able to read the hdfs blocks as i could do in hadoop ? On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List] ml-node+s1001560n14885...@n3.nabble.com wrote:

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
I couldnt even see the spark-id folder in the default /tmp directory of local.dir. On Tue, Sep 23, 2014 at 6:01 PM, Priya Ch learnings.chitt...@gmail.com wrote: Is it possible to view the persisted RDD blocks ? If I use YARN, RDD blocks would be persisted to hdfs then will i be able

Error launching spark application from Windows to Linux YARN Cluster - Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2014-09-23 Thread dxrodri
I am trying to submit a simple SparkPi application from a windows machine which has spark 1.0.2 to a hadoop 2.3.0 cluster running on Linux. SparkPi application can be launched and executed successfully when running on the Linux machine, however, I get the following error when I launch from

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
This folder will be created when you start your Spark application under your spark.local.dir, with the name “spark-local-xxx” as prefix. It’s quite strange you don’t see this folder, maybe you miss something. Besides if Spark cannot create this folder on start, persist rdd to disk will be

TorrentBroadcast causes java.io.IOException: unexpected exception type

2014-09-23 Thread Arun Ahuja
Since upgrading to Spark 1.1 we have been seeing the following error in the logs: 14/09/23 02:14:42 ERROR executor.Executor: Exception in task 1087.0 in stage 0.0 (TID 607) java.io.IOException: unexpected exception type at

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
great, thanks -- Nan Zhu On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote: Yes, Matei made a JIRA last week and I just suggested a PR: https://github.com/apache/spark/pull/2508 On Sep 23, 2014 2:55 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote:

Re: Distributed dictionary building

2014-09-23 Thread Sean Owen
Yes, Matei made a JIRA last week and I just suggested a PR: https://github.com/apache/spark/pull/2508 On Sep 23, 2014 2:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: shall we document this in the API doc? Best, -- Nan Zhu On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
shall we document this in the API doc? Best, -- Nan Zhu On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: zipWithUniqueId is also affected... I had to persist the dictionaries to make use of the indices lower down in the flow... On Sun, Sep 21, 2014 at 1:15 AM, Sean

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Greg Hill
Thanks for looking into it. I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration. I'd rather just use spark-defaults.conf than the environment variables, and looking at the

Spark 1.1.0 on EC2

2014-09-23 Thread Gilberto Lira
Hi, What better way to use version 1.1.0 of the spark in ec2? Att, Giba

recommended values for spark driver memory?

2014-09-23 Thread Greg Hill
I know the recommendation is it depends, but can people share what sort of memory allocations they're using for their driver processes? I'd like to get an idea of what the range looks like so we can provide sensible defaults without necessarily knowing what the jobs will look like. The

MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread aka.fe2s
Hi, I'm looking for available online ML algorithms (that improve model with new streaming data). The only one I found is linear regression. Is there anything else implemented as part of MLlib? Thanks, Oleksiy.

access javaobject in rdd map

2014-09-23 Thread jamborta
Hi all, I have a java object that contains a ML model which I would like to use for prediction (in python). I just want to iterate the data through a mapper and predict for each value. Unfortunately, this fails when it tries to serialise the object to sent it to the nodes. Is there a trick

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
Thank you for the answer and sorry for the double question, but now it works! I have one additional question, is it possible to use a broadcast variable in this object, at the moment I try it in the way below, but the broadcast object is still null. object lookupObject { private var treeFile :

Re: java.lang.ClassNotFoundException on driver class in executor

2014-09-23 Thread Barrington Henry
Hi Andrew, Thanks for the prompt response. I tried command line and it works fine. But, I want to try from IDE for easier debugging and transparency into code execution. I would try and see if there is any way to get the jar over to the executor from within the IDE. - Barrington On Sep 21,

Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via KafkaUtils.createStream with five threads in the topic map and consume messages fine. Sifting through the user list and Google, I see that its possible to

SparkSQL: Freezing while running TPC-H query 5

2014-09-23 Thread Samay
Hi, I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs, 122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of total data). When I try to run TPC-H query 5, the query hangs for a

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Soumitra Kumar
I thought I did a good job ;-) OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous spark-submit, and want to load that in next spark-submit job. - Original Message - From: Soumitra Kumar kumar.soumi...@gmail.com To: spark users

Re: access javaobject in rdd map

2014-09-23 Thread Davies Liu
You should create a pure Python object (copy the attributes from Java object), then it could be used in map. Davies On Tue, Sep 23, 2014 at 8:48 AM, jamborta jambo...@gmail.com wrote: Hi all, I have a java object that contains a ML model which I would like to use for prediction (in python).

Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread Gilberto Lira
Hi, i'm trying to run hbase_inputformat.py example but i'm not getting. this is the error: Traceback (most recent call last): File /root/spark/examples/src/main/python/hbase_inputformat.py, line 70, in module conf=conf) File /root/spark/python/pyspark/context.py, line 471, in

Re: access javaobject in rdd map

2014-09-23 Thread Tamas Jambor
Hi Davies, Thanks for the reply. I saw that you guys do that way in the code. Is there no other way? I have implemented all the predict functions in scala, so I prefer not to reimplement the whole thing in python. thanks, On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu dav...@databricks.com

Re: Fails to run simple Spark (Hello World) scala program

2014-09-23 Thread Moshe Beeri
Sure in local mode it works for me as well, the issue is that I run master only, I needed worker as well. תודה רבה, משה בארי. 054-3133943 Email moshe.be...@gmail.com | linkedin http://www.linkedin.com/in/mobee On Mon, Sep 22, 2014 at 9:58 AM, Akhil Das-2 [via Apache Spark User List]

Re: access javaobject in rdd map

2014-09-23 Thread Davies Liu
Right now, there is no way to access JVM in Python worker, in order to make this happen, we need to do: 1. setup py4j in Python worker 2. serialize the JVM objects and transfer to executors 3. link the JVM objects and py4j together to get an interface Before these happens, maybe you could try to

Re: Exception with SparkSql and Avro

2014-09-23 Thread Michael Armbrust
Can you show me the DDL you are using? Here is an example of a way I got the avro serde to work: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246 Also, this isn't ready for primetime yet, but a quick plug for some ongoing work:

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
You can't directly query JSON tables from the CLI or JDBC server since temporary tables only live for the life of the Spark Context. This PR will eventually (targeted for 1.2) let you do what you want in pure SQL: https://github.com/apache/spark/pull/2475 On Mon, Sep 22, 2014 at 4:52 PM, Yin

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
A workaround for now would be to save the JSON as parquet and the create a metastore parquet table. Using parquet will be much faster for repeated querying. This function might be helpful: import org.apache.spark.sql.hive.HiveMetastoreTypes def createParquetTable(name: String, file: String,

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread freedafeng
I don't know if it's relevant, but I had to compile spark for my specific hbase and hadoop version to make that hbase_inputformat.py work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html Sent from

Re: spark time out

2014-09-23 Thread Chen Song
I am running the job on 500 executors, each with 8G and 1 core. See lots of fetch failures on reduce stage, when running a simple reduceByKey map tasks - 4000 reduce tasks - 200 On Mon, Sep 22, 2014 at 12:22 PM, Chen Song chen.song...@gmail.com wrote: I am using Spark 1.1.0 and have seen a

Transient association error on a 3 nodes cluster

2014-09-23 Thread Edwin
I'm running my application on a three nodes cluster(8 cores each, 12 G memory each), and I receive the follow actor error, does anyone have any idea? 14:31:18,061 ERROR [akka.remote.EndpointWriter] (spark-akka.actor.default-dispatcher-17) Transient association error (association remains live):

Re: MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread Liquan Pei
Hi Oleksiy, Right now, only streaming linear regression is available in MLlib. There are working in progress on Streaming K-means and Streaming SVM. Please take a look at the following jiras for more information. Streaming K-means https://issues.apache.org/jira/browse/SPARK-3254 Streaming SVM

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
I solved it :) I moved the lookupObject into the function where I create the broadcast and now all works very well! object lookupObject { private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _ def main(args: Array[String]): Unit = { … val treeFile = sc.broadcast(args(0))

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Tathagata Das
At a high-level, the suggestion sounds good to me. However regarding code, its best to submit a Pull Request on Spark github page for community reviewing. You will find more information here. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Tue, Sep 23, 2014 at 10:11 PM,

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-23 Thread Davies Liu
Or maybe there is a bug related to the base64 in py4j, could you dumps the serialized bytes of closure to verify this? You could add a line in spark/python/pyspark/rdd.py: ser = CloudPickleSerializer() pickled_command = ser.dumps(command) + print len(pickled_command),

Re: spark1.0 principal component analysis

2014-09-23 Thread st553
sowen wrote it seems that the singular values from the SVD aren't returned, so I don't know that you can access this directly Its not clear to me why these aren't returned? The S matrix would be useful to determine a reasonable value for K. -- View this message in context:

Re: access javaobject in rdd map

2014-09-23 Thread jamborta
Great. Thanks a lot. On 23 Sep 2014 18:44, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n14908...@n3.nabble.com wrote: Right now, there is no way to access JVM in Python worker, in order to make this happen, we need to do: 1. setup py4j in Python worker 2. serialize the JVM

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks
In its current implementation, the principal components are computed in MLlib in two steps: 1) In a distributed fashion, compute the covariance matrix - the result is a local matrix. 2) On this local matrix, compute the SVD. The sorting comes from the SVD. If you want to get the eigenvalues out,

Re: Java Implementation of StreamingContext.fileStream

2014-09-23 Thread Michael Quinlan
Thanks very much for the pointer, which validated my initial approach. It turns out that I was creating a tag for the abstract class InputFormat.class. Using TextInputFormat.class instead fixed my issue. Regards, Mike -- View this message in context:

Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Dan Dietterich
I am trying to load data from csv format into parquet using Spark SQL. It consistently runs out of memory. The environment is: * standalone cluster using HDFS and Hive metastore from HDP2.0 * spark1.1.0 * parquet jar files (v1.5) explicitly added when starting spark-sql.

Re: Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Michael Armbrust
I would hope that things should work for this kind of workflow. I'm curious if you have tried using saveAsParquetFile instead of inserting directly into a hive table (you could still register this as an external table afterwards). Right now inserting into Hive tables is going to through their

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Posting your code would be really helpful in figuring out gotchas. On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote: Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread Tathagata Das
This is actually a very tricky as their two pretty big challenges that need to be solved. (i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable dont have checkpointing support (that is you cannot write the content of a broadcast variable to HDFS and recover it automatically

Re: Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Dan Dietterich
I have only been using spark through the SQL front-end (CLI or JDBC). I don't think I have access to saveAsParquetFile from there, do I? -- View this message in context:

HdfsWordCount only counts some of the words

2014-09-23 Thread SK
Hi, I tried out the HdfsWordCount program in the Streaming module on a cluster. Based on the output, I find that it counts only a few of the words. How can I have it count all the words in the text? I have only one text file in the directory. thanks -- View this message in context:

General question on persist

2014-09-23 Thread Arun Ahuja
I have a general question on when persisting will be beneficial and when it won't: I have a task that runs as follow keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)) partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save()

Re: RDD data checkpoint cleaning

2014-09-23 Thread Tathagata Das
I am not sure what you mean by data checkpoint continuously increase, leading to recovery process taking time? Do you mean that in HDFS you are seeing rdd checkpoint files being continuously written but never being deleted? On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB rodrigo.boav...@aspect.com

Re: General question on persist

2014-09-23 Thread Liquan Pei
Hi Arun, The intermediate results like keyedRecordPieces will not be materialized. This indicates that if you run partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() again, the keyedRecordPieces will be re-computed . In this case, cache or

Spark Code to read RCFiles

2014-09-23 Thread Pramod Biligiri
Hi, I'm trying to read some data in RCFiles using Spark, but can't seem to find a suitable example anywhere. Currently I've written the following bit of code that lets me count() the no. of records, but when I try to do a collect() or a map(), it fails with a ConcurrentModificationException. I'm

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Hi TD, tnks for getting back on this. Yes that's what I was experiencing - data checkpoints were being recovered from considerable time before the last data checkpoint, probably since the beginning of the first writes, would have to confirm. I have some development on this though. These results

Sorting a table in Spark

2014-09-23 Thread Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
Hello,

Sorting a Table in Spark RDD

2014-09-23 Thread Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
Hello, So I have crated a table in in RDD in spark in thei format: col1col2 --- 1. 10 11 2. 12 8 3. 9 13 4. 2 3 And the RDD is ristributed by the rows (rows 1, 2 on one node and rows 3 4 on another) I want to sort each column of the table so

Re: General question on persist

2014-09-23 Thread Arun Ahuja
Thanks Liquan, that makes sense, but if I am only doin the computation once, there will essentially be no difference, correct? I had second question related to mapPartitions 1) All of the records of the Iterator[T] that a single function call in mapPartitions process must fit into memory,

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Just a follow-up. Just to make sure about the RDDs not being cleaned up, I just replayed the app both on the windows remote laptop and then on the linux machine and at the same time was observing the RDD folders in HDFS. Confirming the observed behavior: running on the laptop I could see the

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Andrew Or
Yes... good find. I have filed a JIRA here: https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it shortly. Both of these fixes will be available in 1.1.1. Until both of these are merged in, it appears that the only way you can do it now is through --driver-memory. -Andrew

Running Spark from an Intellij worksheet - akka.version error

2014-09-23 Thread adrian
I have an SBT Spark project compiling fine in Intellij. However when I try to create a SparkContext from a worksheet: import org.apache.spark.SparkContext val sc1 = new SparkContext(local[8], sc1) I get this error: com.typesafe.config.ConfigException$Missing: No configuration setting found for

Re: Sorting a Table in Spark RDD

2014-09-23 Thread Victor Tso-Guillen
You could pluck out each column in separate rdds, sort them independently, and zip them :) On Tue, Sep 23, 2014 at 2:40 PM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) abaghdasa...@bloomberg.net wrote: Hello, So I have crated a table in in RDD in spark in thei format: col1 col2

Worker Random Port

2014-09-23 Thread Paul Magid
I am trying to get an edge server up and running connecting to our Spark 1.1 cluster. The edge server is in a different DMZ than the rest of the cluster and we have to specifically open firewall ports between the edge server and the rest of the cluster. I can log on to any node in the

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Sorry, I am almost Java illiterate but here's my Scala code to do the equivalent (that I have tested to work): val kInStreams = (1 to 10).map{_ = KafkaUtils.createStream(ssc,zkhost.acme.net:2182,myGrp,Map(myTopic - 1), StorageLevel.MEMORY_AND_DISK_SER) } //Create 10 receivers across the cluster,

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
To my eyes, these are functionally equivalent. I’ll try a Scala approach, but this may cause waves for me upstream (e.g., non-Java) Thanks for looking at this. If anyone else can see a glaring issue in the Java approach that would be appreciated. Thanks, Matt On Sep 23, 2014, at 4:13 PM,

SQL status code to indicate success or failure of query

2014-09-23 Thread Du Li
Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql() returns SchemaRDD which either is empty or contains some Rows of results. However, some queries and commands do not return results by nature; being empty

Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3)

IOException running streaming job

2014-09-23 Thread Emil Gustafsson
I'm trying out some streaming with spark and I'm getting an error that puzzles me since I'm new to Spark. I get this error all the time but 1-2 batches in the stream are processed before the job stops. but never the complete job and often no batch is processed at all. I use Spark 1.1.0. The job

Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that case, Spark SQL (https://spark.apache.org/docs/latest/sql-programming-guide.html) is the easiest way. Matei On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) wrote: Hi, I'm trying to

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread Xiangrui Meng
You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the

Re: Exception with SparkSql and Avro

2014-09-23 Thread Zalzberg, Idan (Agoda)
Thanks, I didn't create the tables myself as I have no control over that process. However these tables are read just fund using the Jdbc connection to the hiveserver2 so it should be possible On Sep 24, 2014 12:48 AM, Michael Armbrust mich...@databricks.com wrote: Can you show me the DDL you are

Re: Worker Random Port

2014-09-23 Thread Andrew Ash
Hi Paul, There are several ports you need to configure in order to run in a tight network environment. It sounds like you the DMZ that contains the spark cluster is wide open internally, but you have to poke holes between that and the driver. You should take a look at the port configuration

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread Andrew Ash
Hi Theodore, What do you mean by module diagram? A high level architecture diagram of how the classes are organized into packages? Andrew On Tue, Sep 23, 2014 at 12:46 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Please help me with that. BR, Theodore Si

RE: HdfsWordCount only counts some of the words

2014-09-23 Thread SK
I execute it as follows: $SPARK_HOME/bin/spark-submit --master master url --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar hdfsdir After I start the job, I add a new test file in hdfsdir. It is a large text file which I

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-23 Thread Andrew Ash
Also you'd rather have 2-3 tasks per core than 1 task per core because if the 1 task per core is actually 1.01 tasks per core, then you have one wave of tasks complete and another wave of tasks with very few tasks in them. You get better utilization when you're higher than 1. Aaron Davidson goes

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything

Re: SQL status code to indicate success or failure of query

2014-09-23 Thread Michael Armbrust
An exception should be thrown in the case of failure for DDL commands. On Tue, Sep 23, 2014 at 4:55 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql()

java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-23 Thread sbir...@wynyardgroup.com
Hello, I am new to Spark. I have downloaded Spark 1.1.0 and trying to run the TallSkinnySVD.scala example with different input data sizes. I tried with input data with 1000X1000 matrix, 5000X5000 matrix. Though I had faced some Java Heap issues I added following parameters in spark-defaults.conf

Converting one RDD to another

2014-09-23 Thread Deep Pradhan
Hi, Is it always possible to get one RDD from another. For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my example the type is Int). I do not get an RDD. Can anyone explain this to me? Thank You

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread David Rowe
Hi Andrew, I can't speak for Theodore, but I would find that incredibly useful. Dave On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash and...@andrewash.com wrote: Hi Theodore, What do you mean by module diagram? A high level architecture diagram of how the classes are organized into packages?

how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread christy
This process began yesterday and it has already run for more than 20 hours. Is it normal? Any one has the same problem? No error throw out yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-long-does-it-take-executing-sbt-sbt-assembly-tp14975.html

Re: Converting one RDD to another

2014-09-23 Thread Zhan Zhang
Here is my understanding def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = { if (num == 0) { //if 0, return empty array Array.empty } else { mapPartitions { items = //map each partition to a a new one with the iterator consists of the single queue,

Re: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Zhan Zhang
Definitely something wrong. For me, 10 to 30 minutes. Thanks. Zhan Zhang On Sep 23, 2014, at 10:02 PM, christy 760948...@qq.com wrote: This process began yesterday and it has already run for more than 20 hours. Is it normal? Any one has the same problem? No error throw out yet. --

Re: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Tobias Pfeiffer
Hi, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-building-project-with-sbt-assembly-is-extremely-slow-td13152.html -- Maybe related to this? Tobias

java.lang.stackoverflowerror when running Spark shell

2014-09-23 Thread mrshen
I tested the examples according to the docs in spark sql programming guide, but the java.lang.stackoverflowerror occurred everytime I called sqlContext.sql(...). Meanwhile, it worked fine in a hiveContext. The Hadoop version is 2.2.0, the Spark version is 1.1.0, built with Yarn, Hive.. I would be

RE: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Shao, Saisai
If you have enough memory, the speed will be faster, within one minutes, since most of the files are cached. Also you can build your Spark project on a mounted ramfs in Linux, this will also speed up the process. Thanks Jerry -Original Message- From: Zhan Zhang

Can not see any spark metrics on ganglia-web

2014-09-23 Thread tsingfu
I installed ganglia, and I think it worked well for hadoop, hbase for I can see hadoop/hbase metrics on ganglia-web.I want to use ganglia to monitor spark. and I followed the steps as following:1) first I did a custom compile with -Pspark-ganglia-lgpl, and it sucessed without