Sharding vs. Per-Timeframe Tables

2015-09-29 Thread Jan Algermissen
Hi, I am using Spark and the Cassandra-connector to save customer events for later batch analysis. Primary access pattern later on will be by time-slice One way to save the events would be to create a C* row per day, for example, and within that row store the events in decreasing time order.

Where are logs for Spark Kafka Yarn on Cloudera

2015-09-29 Thread Rachana Srivastava
Hello all, I am trying to test JavaKafkaWordCount on Yarn, to make sure Yarn is working fine I am saving the output to hdfs. The example works fine in local mode but not on yarn mode. I cannot see any output logged when I changed the mode to yarn-client or yarn-cluster or cannot find any

unsubscribe

2015-09-29 Thread sukesh kumar
unsubscribe -- Thanks & Best Regards Sukesh Kumar

OOM error in Spark worker

2015-09-29 Thread varun sharma
My workers are going OOM over time. I am running a streaming job in spark 1.4.0. Here is the heap dump of workers. /16,802 instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by "sun.misc.Launcher$AppClassLoader @ 0xdff94088" occupy 488,249,688 (95.80%) bytes. These instances are

RE: nested collection object query

2015-09-29 Thread Tridib Samanta
Well I figure out a way to use explode. But it returns two rows if there is two match in nested array objects. select id from department LATERAL VIEW explode(employee) dummy_table as emp where emp.name = 'employee0' I was looking for an operator that loops through the array and return true

Hive alter table is failing

2015-09-29 Thread Ophir Cohen
Hi, I'm using Spark on top of Hive. As I want to keep old tables I store the DataFrame into tmp table in hive and when finished successfully I rename the table. In last few days I've upgrade to use Spark 1.4.1, and as I'm using aws emr I got Hive 1.0. Now when I try to rename the table I get the

Re: Monitoring tools for spark streaming

2015-09-29 Thread Adrian Tanase
You can also use the REST api introduced in 1.4 – although it’s harder to parse: * jobs from the same batch are not grouped together * You only get total delay, not scheduling delay From: Hari Shreedharan Date: Tuesday, September 29, 2015 at 5:27 AM To: Shixiong Zhu Cc: Siva,

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Adrian Tanase
The error message is very explicit (partition is under replicated), I don’t think it’s related to networking issues. Try to run /home/kafka/bin/kafka-topics.sh —zookeeper localhost/kafka —describe topic_name and see which brokers are missing from the replica assignment. (replace home, zk-quorum

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
"more partitions and replicas than available brokers" -- what would be a good ratio? We've been trying to set up 3 topics with 64 partitions. I'm including the output of "bin/kafka-topics.sh --zookeeper localhost:2181 --describe topic1" below. I think it's symptomatic and confirms your theory,

Re: Hive alter table is failing

2015-09-29 Thread Ophir Cohen
Nop, I'm checking it out thanks! On Tue, Sep 29, 2015 at 3:30 PM, Ted Yu wrote: > Have you seen this thread ? > http://search-hadoop.com/m/q3RTtGwP431AQ2B41 > > Plugin metastore version for your deployment. > > Cheers > > On Sep 29, 2015, at 5:20 AM, Ophir Cohen

Cant perform full outer join

2015-09-29 Thread Saif.A.Ellafi
Hi all, So I Have two dataframes, with two columns: DATE and VALUE. Performing this: data = data.join(cur_data, data("DATE") === cur_data("DATE"), "outer") returns Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'DATE' is ambiguous, could be: DATE#0, DATE#3.; But

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Ashish Soni
I am using Java Streaming context and it doesnt have method setLogLevel and also i have tried by passing VM argument in eclipse and it doesnt work JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); Ashish On Tue, Sep 29, 2015 at 7:23 AM, Adrian Tanase

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
Adrian, Thanks for your response. I just looked at both machines we're testing on and on both the Kafka server process looks OK. Anything specific I can check otherwise? >From googling around, I see some posts where folks suggest to check the DNS settings (those appear fine) and to set the

RE: Setting executors per worker - Standalone

2015-09-29 Thread java8964
I don't think you can do that in the Standalone mode before 1.5. The best you can do is to have multi workers per box. One worker can and will only start one executor, before Spark 1.5. What you can do is to set "SPARK_WORKER_INSTANCES", which control how many worker instances you can start per

RE: nested collection object query

2015-09-29 Thread java8964
You have 2 options: Option 1: Use lateral view explode, as you did below. But if you want to remove the duplicate, then use distinct after that. For example: col1, col2, ArrayOf(Struct) After explode: col1, col2, employee0col1, col2, employee1col1, col2, employee0 Then select distinct col1, col2

Re: Fetching Date value from spark.sql.row in Spark 1.2.2

2015-09-29 Thread satish chandra j
HI All, If any alternate solutions to get the Date value from org.apache.spark.sql.row please suggest me Regards, Satish Chandra On Tue, Sep 29, 2015 at 4:41 PM, satish chandra j wrote: > HI All, > Currently using Spark 1.2.2, as getDate method is not defined in this

Re: Merging two avro RDD/DataFrames

2015-09-29 Thread Adrian Tanase
Seems to me that the obvious candidate is loading both master and delta, using join or cogroup then write the new master. Through some clever sharding and key management you might achieve some efficiency gains, but I’d say start here if your numbers are in the hundreds of thousands… should run

Re: Where are logs for Spark Kafka Yarn on Cloudera

2015-09-29 Thread Marcelo Vanzin
(-dev@) Try using the "yarn logs" command to read logs for finished applications. You can also browse the RM UI to find more information about the applications you ran. On Mon, Sep 28, 2015 at 11:37 PM, Rachana Srivastava wrote: > Hello all, > > > > I am

Converting a DStream to schemaRDD

2015-09-29 Thread Daniel Haviv
Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ? Thank you. Daniel

RandomForestClassifer does not recognize number of classes, nor can number of classes be set

2015-09-29 Thread Kristina Rogale Plazonic
Hi, I'm trying out the ml.classification.RandomForestClassifer() on a simple dataframe and it returns an exception that number of classes has not been set in my dataframe. However, I cannot find a function that would set number of classes, or pass it as an argument anywhere. In mllib, numClasses

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
I apologize for posting this Kafka related issue into the Spark list. Have gotten no responses on the Kafka list and was hoping someone on this list could shed some light on the below. --- We're running into

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Adrian Tanase
I believe some of the brokers in your cluster died and there are a number of partitions that nobody is currently managing. -adrian From: Dmitry Goldenberg Date: Tuesday, September 29, 2015 at 3:26 PM To: "user@spark.apache.org" Subject: Kafka error "partitions

Re: Hive alter table is failing

2015-09-29 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtGwP431AQ2B41 Plugin metastore version for your deployment. Cheers > On Sep 29, 2015, at 5:20 AM, Ophir Cohen wrote: > > Hi, > > I'm using Spark on top of Hive. > As I want to keep old tables I store the DataFrame

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Cody Koeninger
Try writing and reading to the topics in question using the kafka command line tools, to eliminate your code as a variable. That number of partitions is probably more than sufficient: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Re: Converting a DStream to schemaRDD

2015-09-29 Thread Adrian Tanase
Also check this out https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala From the data bricks reference app: https://github.com/databricks/reference-apps From: Ewan Leith Date:

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
Thanks, Cody. Yes we did see that writeup from Jay, it seems to just refer to his test 6 partitions. I've been looking for more of a recipe of what the possible max is vs. what the optimal value may be; haven't found such. KAFKA-899 appears related but it was fixed in Kafka 0.8.2.0 - we're

RE: Converting a DStream to schemaRDD

2015-09-29 Thread Ewan Leith
Something like: dstream.foreachRDD { rdd => val df = sqlContext.read.json(rdd) df.select(…) } https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams Might be the place to start, it’ll convert each batch of dstream into an RDD then let you work

Fetching Date value from spark.sql.row in Spark 1.2.2

2015-09-29 Thread satish chandra j
HI All, Currently using Spark 1.2.2, as getDate method is not defined in this version hence trying to fetch Date value of a specific coulmn using *get* method as specified in docs (ref URL given below:) https://spark.apache.org/docs/1.2.2/api/java/index.html?org/apache/spark/sql/api/java/Row.html

Change Orc split size

2015-09-29 Thread Renu Yadav
Hi, I am reading data from hive orc table using spark-sql which is taking 256mb as split size. How can i change this size Thanks, Renu

Re: Spark Streaming many subscriptions vs many jobs

2015-09-29 Thread Cody Koeninger
There isn't an easy way of ensuring delivery semantics for producing to kafka (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish). If there's only one logical consumer of the intermediate state, I wouldn't write it back to kafka, i'd just keep it in a single spark

Re: Stopping criteria for gradient descent

2015-09-29 Thread Yanbo Liang
Hi Nishanth, The diff of solution vectors is compared to relative tolerance or absolute tolerance, you can set convergenceTol which can affect the convergence criteria of SGD. 2015-09-17 8:31 GMT+08:00 Nishanth P S : > Hi, > > I am running LogisticRegressionWithSGD in

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a

Re: How to find how much data will be train in mllib or how much the spark job is completed ?

2015-09-29 Thread Robineast
This page gives details on the monitoring available http://spark.apache.org/docs/latest/monitoring.html. You can get a UI showing Jobs, Stages and Tasks with an indication how far completed the job is. The UI is usually on port 4040 of the machine where you run the spark driver program. The

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Adrian Tanase
You should set exta java options for your app via Eclipse project and specify something like -Dlog4j.configuration=file:/tmp/log4j.properties Sent from my iPhone On 28 Sep 2015, at 18:52, Shixiong Zhu > wrote: You can use

Re: Adding / Removing worker nodes for Spark Streaming

2015-09-29 Thread Adrian Tanase
Just wanted to make sure one thing is really clear – the kafka offsets are part of the actual RDD – in every batch spark is saving the offset ranges for each partition – this in theory will make the data in each batch stable across recovery. The other important thing is that with correct

Spark Streaming many subscriptions vs many jobs

2015-09-29 Thread Arttii
Hi, So I am working on a project where we might end up having a bunch of decoupled logic components that have to run inside spark streaming. We are using KAFKA as the source of streaming data. My first question is; is it better to chain these logics together by applying transforms to a single rdd

Re: Does YARN start new executor in place of the failed one?

2015-09-29 Thread Adrian Tanase
In theory, yes - however in practice it seems that it depends on how they die. I’ve recently logged an issue for the case when the machine is restarted. If the executor process dies it generally comes back gracefully. https://issues.apache.org/jira/browse/SPARK-10792 Maybe you can vote up the

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-29 Thread Rick Moritz
I specified JavaSerializer in both cases (and attempted to use Kryo in the shell, but failed due to SPARK-6520), and still get the vastly differing performance. Somehow the shell-compiler must impact either serialization or shuffling, but at a level other than the standard REPL API, since Zeppelin

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a

Re: Spark-Kafka Connector issue

2015-09-29 Thread Cody Koeninger
Show the output of bin/kafka-topics.sh --list. Show the actual code with the topic name hardcoded in the set, not loaded from an external file you didn't show. Show the full stacktrace you're getting. On Mon, Sep 28, 2015 at 10:03 PM, Ratika Prasad wrote: > Yes the

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON

2015-09-29 Thread Ted Yu
sqlContext.read.json() expects Path to the JSON file. FYI On Tue, Sep 29, 2015 at 7:23 AM, Fernando Paladini wrote: > Hello guys, > > I'm very new to Spark and I'm having some troubles when reading a JSON to > dataframe on PySpark. > > I'm getting a JSON object from an

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
Thats a pretty advanced example that uses experimental APIs. I'd suggest looking at https://github.com/databricks/spark-avro as a reference. On Mon, Sep 28, 2015 at 9:00 PM, Ted Yu wrote: > See this thread: > > http://search-hadoop.com/m/q3RTttmiYDqGc202 > > And: > >

"Method json([class java.util.HashMap]) does not exist" when reading JSON

2015-09-29 Thread Fernando Paladini
Hello guys, I'm very new to Spark and I'm having some troubles when reading a JSON to dataframe on PySpark. I'm getting a JSON object from an API response and I would like to store it in Spark as a DataFrame (I've read that DataFrame is better than RDD, that's accurate?). For what I've read

Spark Job/Stage names

2015-09-29 Thread Nithin Asokan
I'm interested to see if anyone knows of a way to have custom job/stage name for Spark Application. I believe I can use *sparkContext.setCallSite(String)* to update job/stage names but it does not let me update each stage name, setting this value will set same text for all job and stage names for

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Shixiong Zhu
I mean JavaSparkContext.setLogLevel. You can use it like this: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); jssc.sparkContext().setLogLevel(...); Best Regards, Shixiong Zhu 2015-09-29 22:07 GMT+08:00 Ashish Soni : > I am using

Re: Executor Lost Failure

2015-09-29 Thread Nithin Asokan
Try increasing memory (--conf spark.executor.memory=3g or --executor-memory) for executors. Here is something I noted from your logs 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_2_1813 in memory. 15/09/29 06:32:03 WARN

Re: Executor Lost Failure

2015-09-29 Thread Ted Yu
Can you list the spark-submit command line you used ? Thanks On Tue, Sep 29, 2015 at 9:02 AM, Anup Sawant wrote: > Hi all, > Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new > to Spark so I have less knowledge about the internals of it. The

Re: SparkContext._active_spark_context returns None

2015-09-29 Thread Ted Yu
bq. the right way to reach JVM in python Can you tell us more about what you want to achieve ? If you want to pass some value to workers, you can use broadcast variable. Cheers On Mon, Sep 28, 2015 at 10:31 PM, YiZhi Liu wrote: > Hi Ted, > > Thank you for reply. The sc

DStream union with different slideDuration

2015-09-29 Thread Goodall, Mark (UK)
Hi, I was wondering if there is a reason for limiting union to only work on streams with the same slideDuration. Looking at UnionDStream.scala, if slideDuration was set to the minimum of the parents, and there was a require to enforce that all slideDuration were divisible wholly by the minimum,

Spark Job/Stage names

2015-09-29 Thread nasokan
I'm interested to see if anyone knows of a way to have custom job/stage name for Spark Application. I believe I can use sparkContext.setCallSite(String) to update job/stage names but it does not let me update each stage name, setting this value will set same text for all job and stage names for

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
We've got Kafka working generally. Can definitely write to it now. There was a snag where num.partitions was set to 12 on one node but to 64 on the other. We fixed this and set num.partitions to 42 and things are working on that side. On Tue, Sep 29, 2015 at 9:39 AM, Cody Koeninger

Executor Lost Failure

2015-09-29 Thread Anup Sawant
Hi all, Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new to Spark so I have less knowledge about the internals of it. The job was running for a day or so on 102 Gb of data with 40 workers. -Best, Anup. 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on

Re: Does pyspark in cluster mode need python on individual executor nodes ?

2015-09-29 Thread Ted Yu
I think the answer is yes. Code packaged in pyspark.zip needs python to execute. On Tue, Sep 29, 2015 at 2:08 PM, Ranjana Rajendran < ranjana.rajend...@gmail.com> wrote: > Hi, > > Does a python spark program (which makes use of pyspark ) submitted in > cluster mode need python on the executor

using UDF( defined in Java) in scala through scala

2015-09-29 Thread ogoh
Hello, I have a udf declared in Java but I'd like to call it from spark-shell which only supports Scala. Since I am new to Scala, I couldn't figure out how to call register the Java UDF using sqlContext.udf.register in scala. Below is how I tried. I appreciate any help. Thanks, = my UDF in

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
Release of Spark: 1.5.0. Command line invokation: ACME_INGEST_HOME=/mnt/acme/acme-ingest ACME_INGEST_VERSION=0.0.1-SNAPSHOT ACME_BATCH_DURATION_MILLIS=5000 SPARK_MASTER_URL=spark://data1:7077 JAVA_OPTIONS="-Dspark.streaming.kafka.maxRatePerPartition=1000" JAVA_OPTIONS="$JAVA_OPTIONS

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Hortonworks
You can try to use data frame for both read and write Thanks Zhan Zhang Sent from my iPhone > On Sep 29, 2015, at 1:56 PM, Umesh Kacha wrote: > > Hi Zang, thanks for the response. Table is created using Spark > hiveContext.sql and data inserted into table also using

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Ted Yu
Mind providing a bit more information: release of Spark command line for running Spark job Cheers On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg wrote: > We're seeing this occasionally. Granted, this was caused by a wrinkle in > the Solr schema but this bubbled

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang
Hi, there I ran into an issue when using Spark (v 1.3) to load avro file through Spark SQL. The code sample is below val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”) val myrdd = df.select(“Key", “Name", “binaryfield").rdd val results = myrdd.map(...) val finalResults =

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Ted Yu
Have you tried the following ? --conf spark.driver.userClassPathFirst=true --conf spark.executor. userClassPathFirst=true On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg wrote: > Release of Spark: 1.5.0. > > Command line invokation: > >

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
We're seeing this occasionally. Granted, this was caused by a wrinkle in the Solr schema but this bubbled up all the way in Spark and caused job failures. I just checked and SolrException class is actually in the consumer job jar we use. Is there any reason why Spark cannot find the

Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread unk1102
Hi I have a spark job which creates hive tables in orc format with partitions. It works well I can read data back into hive table using hive console. But if I try further process orc files generated by Spark job by loading into dataframe then I get the following exception Caused by:

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Divya Ravichandran
This could be because org.apache.solr.common.SolrException doesn't implement Serializable. This error shows up when Spark is deserilizing a class which doesn't implement Serializable. Thanks Divya On Sep 29, 2015 4:37 PM, "Dmitry Goldenberg" wrote: > We're seeing this

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Umesh Kacha
Hi Zang, thanks for the response. Table is created using Spark hiveContext.sql and data inserted into table also using hiveContext.sql. Insert into partition table. When I try to load orc data into dataframe I am loading particular partition data stored in path say

Does pyspark in cluster mode need python on individual executor nodes ?

2015-09-29 Thread Ranjana Rajendran
Hi, Does a python spark program (which makes use of pyspark ) submitted in cluster mode need python on the executor nodes ? Isn't the python program interpreted on the client node from where the job is submitted and then the executors run in the JVM of each the executor nodes ? Thank you,

Self Join reading the HDFS blocks TWICE

2015-09-29 Thread Data Science Education
As part of fairly complex processing, I am executing a self join query using HiveContext against a Hive table to find the latest Transaction, oldest Transaction etc: for a given set of Attributes. I am still using v1.3.1 and so Window functions are not an option. The simplified query looks like

Spark Streaming Standalone 1.5 - Stage cancelled because SparkContext was shut down

2015-09-29 Thread An Tran
Hello All, I have several Spark Streaming applications running on Standalone mode in Spark 1.5. Spark is currently set up for dynamic resource allocation. The issue I am seeing is that I can have about 12 Spark Streaming Jobs running concurrently. Occasionally I would see more than half where

Re: Does pyspark in cluster mode need python on individual executor nodes ?

2015-09-29 Thread Ranjana Rajendran
Thank you Ted. I have Python 2.6 on all the nodes including the client node. I want to instead use Python 2.7. For the PySpark shell, I was able to do this by downloading python 2.7.8 and installing it in a root based out of my home directory and setting PYSPARK_PYTHON to ~/python2.7/bin/python

Fwd: Query about checkpointing time

2015-09-29 Thread Jatin Ganhotra
Hi, I started doing the amp-camp 5 exercises . I tried the following 2 scenarios: *Scenario #1* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count *Scenario #2* val pagecounts =

Spark thrift service and Hive impersonation.

2015-09-29 Thread Jagat Singh
Hi, I have started the Spark thrift service using spark user. Does each user needs to start own thrift server to use it? Using beeline i am able to connect to server and execute show tables; However when we try to execute some real query it runs as spark user and HDFS permissions does not

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Hortonworks
How was the table is generated, by hive or by spark? If you generate table using have but read it by data frame, it may have some comparability issue. Thanks Zhan Zhang Sent from my iPhone > On Sep 29, 2015, at 1:47 PM, unk1102 wrote: > > Hi I have a spark job

Re: JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-29 Thread Juan Rodríguez Hortalá
Hi Shixiong, Thanks for your answer. I will take a lot to your suggestion, maybe my call to SparkContext.parallelize doesn't work well when there are less records to parallelize than partitions. Thanks a lot for your help Greetings, Juan 2015-09-24 2:04 GMT-07:00 Shixiong Zhu

Re: pyspark-Failed to run first

2015-09-29 Thread balajikvijayan
Any updates on this issue? A cursory search shows that others are still experiencing this issue. I'm seeing this occur on trivial data sets in pyspark; however they are running successfully in scala. While this is an acceptable workaround I would like to know if this item is on the spark roadmap

Yahoo's Caffe-on-Spark project

2015-09-29 Thread Thomas Dudziak
http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop I would be curious to learn what the Spark developer's plans are in this area (NNs, GPUs) and what they think of integration with existing NN frameworks like Caffe or Torch. cheers, Tom

Re: Cant perform full outer join

2015-09-29 Thread Terry Hoo
Saif, Might be you can rename one of the dataframe to different name first, then do an outer join and a select like this: val cur_d = cur_data.toDF("Date_1", "Value_1") val r = data.join(cur_d, data("DATE" === cur_d("Date_1", "outer").select($"DATE", $"VALUE", $"Value_1") Thanks, Terry On Tue,

Re: unintended consequence of using coalesce operation

2015-09-29 Thread Michael Armbrust
coalesce is generally to avoid launching too many tasks, on a bunch of small files. As a result, the goal is to reduce parallelism (when the overhead of that parallelism is more costly than the gain). You are correct that in you case repartition sounds like a better choice. On Tue, Sep 29, 2015

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
When a user issues a connect command from Beeline, it asks for username and password. What happens if you give spark as the user name? Also, it looks like permission for "/data/mytable” is drwxr-x—x Have you tried changing the permission to allow other users to read? Mohammed From: Jagat

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
I'm actually not sure how either one of these would possibly cause Spark to find SolrException. Whether the driver or executor class path is first, should it not matter, if the class is in the consumer job jar? On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Umesh Kacha
Hi I can read/load orc data created by hive table in a dataframe why is it throwing Malformed ORC exception when I try to load data created by hiveContext.sql into dataframe? On Sep 30, 2015 2:37 AM, "Hortonworks" wrote: > You can try to use data frame for both read and

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
Ted, I think I have tried these settings with the hbase protocol jar, to no avail. I'm going to see if I can try and use these with this SolrException issue though it now may be harder to reproduce it. Thanks for the suggestion. On Tue, Sep 29, 2015 at 8:03 PM, Ted Yu

RE: laziness in textFile reading from HDFS?

2015-09-29 Thread Mohammed Guller
1) It is not required to have the same amount of memory as data. 2) By default the # of partitions are equal to the number of HDFS blocks 3) Yes, the read operation is lazy 4) It is okay to have more number of partitions than number of cores. Mohammed -Original Message- From: davidkl

Re: SparkContext._active_spark_context returns None

2015-09-29 Thread YiZhi Liu
Hi Ted, I think I've make a mistake. I refered to python/mllib, callJavaFunc in mllib/common.py use SparkContext._active_spark_context because it is called from the driver. So maybe there is no explicit way to reach JVM during rdd operations? What I want to achieve is to take a ThriftWritable

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Does each user needs to start own thrift server to use it? No. One of the benefits of the Spark Thrift Server is that it allows multiple users to share a single SparkContext. Most likely, you have file permissions issue. Mohammed From: Jagat Singh [mailto:jagatsi...@gmail.com] Sent: Tuesday,

Re: Monitoring tools for spark streaming

2015-09-29 Thread Otis Gospodnetić
Hi, There's also SPM for Spark -- http://sematext.com/spm/integrations/spark-monitoring.html SPM graphs all Spark metrics and gives you alerting, anomaly detection, etc. and if you ship your Spark and/or other logs to Logsene - http://sematext.com/logsene - you can correlate metrics, logs,

Re: Spark thrift service and Hive impersonation.

2015-09-29 Thread Jagat Singh
Hi, Thanks for your reply. If you see the log message Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/data/mytable":tdcprdr:tdcprdr:drwxr-x--x Spark is trying to

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
Yep, we've designed it so that we take care of any translation that needs to be done for you. On Tue, Sep 29, 2015 at 10:39 AM, Jerry Lam wrote: > Hi Michael and Ted, > > Thank you for the reference. Is it true that once I implement a custom > data source, it can be used

Re: Setting executors per worker - Standalone

2015-09-29 Thread James Pirz
Thanks for your help. You were correct about the memory settings. Previously I had following config: --executor-memory 8g --conf spark.executor.cores=1 Which was really conflicting, as in spark-env.sh I had: export SPARK_WORKER_CORES=4 export SPARK_WORKER_MEMORY=8192m So the memory budget per

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Jerry Lam
Hi Michael and Ted, Thank you for the reference. Is it true that once I implement a custom data source, it can be used in all spark supported language? That is Scala, Java, Python and R. :) I want to take advantage of the interoperability that is already built in spark. Thanks! Jerry On Tue,

Spark SQL deprecating Hive? How will I access Hive metadata in the future?

2015-09-29 Thread YaoPau
I've heard that Spark SQL will be or has already started deprecating HQL. We have Spark SQL + Python jobs that currently read from the Hive metastore to get things like table location and partition values. Will we have to re-code these functions in future releases of Spark (maybe by connecting

How to set System environment variables in Spark

2015-09-29 Thread swetha
Hi, How to set System environment variables when submitting a job? Suppose I have the environment variable as shown below. I have been trying to specify --- -Dcom.w1.p1.config.runOnEnv=dev and --conf -Dcom.w1.p1.config.runOnEnv=dev. But, it does not seem to be working. How to set environment

Re: How to find how much data will be train in mllib or how much the spark job is completed ?

2015-09-29 Thread Robineast
so you could query the rest api in code. E.g. /applications//stages provides details on the number of active and completed tasks in each stage - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action --

Best practices to call small spark jobs as part of REST api

2015-09-29 Thread unk1102
Hi I would like to know any best practices to call spark jobs in rest api. My Spark jobs returns results as json and that json can be used by UI application. Should we even have direct HDFS/Spark backend layer in UI for on demand queries? Please guide. Thanks much. -- View this message in

Re: How to set System environment variables in Spark

2015-09-29 Thread Nithin Asokan
--conf is used to pass any spark configuration that starts with *spark.** You can also use "--driver-java-options" to pass any system properties you would like to the driver program. On Tue, Sep 29, 2015 at 2:30 PM swetha wrote: > > Hi, > > How to set System

input file from tar.gz

2015-09-29 Thread Peter Rudenko
Hi, i have a huge tar.gz file on dfs. This file contains several files, but i want to use only one of them as input. Is it possible to filter somehow a tar.gz schema, something like this: sc.textFile("hdfs:///data/huge.tar.gz#input.txt") Thanks, Peter Rudenko

Spark mailing list confusion

2015-09-29 Thread Robineast
Does anyone have any idea why some topics on the mailing list end up on https://www.mail-archive.com/user@spark.apache.org e.g. this message thread , but not on http://apache-spark-user-list.1001560.n3.nabble.com ? Whilst I get

Re: input file from tar.gz

2015-09-29 Thread Ted Yu
The syntax using '#' is not supported by hdfs natively. YARN resource localization supports such notion. See http://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html Not sure about Spark. On Tue, Sep 29, 2015 at 11:39 AM, Peter

Re: How to set System environment variables in Spark

2015-09-29 Thread Ted Yu
Please see 'spark.executorEnv.[EnvironmentVariableName]' in https://spark.apache.org/docs/latest/configuration.html#runtime-environment FYI On Tue, Sep 29, 2015 at 12:29 PM, swetha wrote: > > Hi, > > How to set System environment variables when submitting a job?

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON

2015-09-29 Thread Fernando Paladini
Thank you for the awesome explained answers! : Actually I've a data_point (simplifying, a sensor inside a physical room) and each data_point has its own point_values (the signals generated by the sensor, including the timestamp of when this signal was generated). That's what I get when I run

spark distributed linear system with sparse data

2015-09-29 Thread Cameron McBride
Hi All, Relatively new to spark and scala, here. I'm trying to solve a simple linear system of A*x = b where A is a sparse matrix, and x and b are dense vectors. Standard fare, really. But the solutions I've found from examples and tests don't seem efficient; I think the use case I need is

Re: Dynamic DAG use-case for spark streaming.

2015-09-29 Thread Tathagata Das
A very basic support that is there in DStream is DStream.transform() which take arbitrary RDD => RDD function. This function can actually choose to do different computation with time. That may be of help to you. On Tue, Sep 29, 2015 at 12:06 PM, Archit Thakur wrote: >

Re: Spark SQL deprecating Hive? How will I access Hive metadata in the future?

2015-09-29 Thread Michael Armbrust
We are not deprecating HiveQL, nor the ability to read metadata from the metastore. On Tue, Sep 29, 2015 at 12:24 PM, YaoPau wrote: > I've heard that Spark SQL will be or has already started deprecating HQL. > We > have Spark SQL + Python jobs that currently read from the

  1   2   >