Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
It really shouldn’t, if anything, running as superuser should ALLOW you to bind to ports 0, 1 etc. It seems very strange that it should even be trying to bind to these ports - maybe a JVM issue? I wonder if the old Apple JVM implementations could have used some different native libraries for

Running Java Program using Eclipse on Existing Spark Cluster

2016-03-09 Thread Gaini Rajeshwar
Hi All, I have one master & 2 workers on my local machine. I wrote the following Java program to count number of lines in README.md file (I am using Maven project to do this) import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import

Re: Installing Spark on Mac

2016-03-09 Thread Gaini Rajeshwar
It should just work with these steps. You don't need to configure much. As mentioned, some settings on your machine are overriding default spark settings. Even running as super-user should not be a problem. It works just fine as super-user as well. Can you tell us what version of Java you are

"bootstrapping" DStream state

2016-03-09 Thread Zalzberg, Idan (Agoda)
Hi, I have a spark-streaming application that basically keeps track of a string->string dictionary. So I have messages coming in with updates, like: "A"->"B" And I need to update the dictionary. This seems like a simple use case for the updateStateByKey method. However, my issue is that when

Re: spark streaming doesn't pick new files from HDFS

2016-03-09 Thread srimugunthan dhandapani
I doubt if thats the problem. Thats how hdfs lists a directory. Output of few more commands below. *$ hadoop fs -ls /tmp/* Found 7 items drwxrwxrwx - hdfs supergroup 0 2016-03-10 11:09 /tmp/.cloudera_health_monitoring_canary_files -rw-r--r-- 3 ndsuser1 supergroup 447873024

Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani
+1 Paul. Both have some pros and cons. Hope this helps. Avro: Pros: 1) Plays nice with other tools, 3rd party or otherwise, or you specifically need some data type in AVRO like binary, but gladly that list is shrinking all the time (yay nested types in Impala). 2) Good for event data that

[Streaming] Batch interval and bulk export

2016-03-09 Thread Li Ming Tsai
Hi, I am doing a few basic operation like map -> reduceByKey -> filter, which is very similar to world count and I'm saving the result where the count > threshold. Currently the batch window is every 10s, but I would like to save the results to redshift at a lower frequency instead of every

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-09 Thread Eran Chinthaka Withana
Hi I'm also having this issue and can not get the tasks to work inside mesos. In my case, the spark-submit command is the following. $SPARK_HOME/bin/spark-submit \ --class com.mycompany.SparkStarter \ --master mesos://mesos-dispatcher:7077 \ --name SparkStarterJob \ --driver-memory 1G \

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
Still I think this information is not enough to explain the reason. 1. Does your yarn cluster has enough resources to start all 10 executors? 2. Would you please try latest version, 1.6.0 or master branch to see if this is a bug and already fixed. 3. you could add

Fwd: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen
Sorry,the last configuration is also --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=60s, "--conf" was lost when I copied it to mail. -- Forwarded message -- From: Jy Chen Date: 2016-03-10 10:09 GMT+08:00 Subject: Re: Dynamic allocation doesn't

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen
Hi, My Spark version is 1.5.1 with Hadoop 2.5.0-cdh5.2.0. These are my configurations of dynamic allocation: --master yarn-client --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=0 --conf

Re: How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Ted Yu
bq. Question is how to get maven repository As you may have noted, version has SNAPSHOT in it. Please checkout latest code from master branch and build it yourself. 2.0 release is still a few months away - though backport of hbase-spark module should come in 1.3 release. On Wed, Mar 9, 2016 at

Re: spark streaming doesn't pick new files from HDFS

2016-03-09 Thread Ted Yu
bq. drwxr-xr-x - tomcat7 supergroup 0 2016-03-09 23:16 /tmp/swg If I read the above line correctly, the size of the file was 0. On Wed, Mar 9, 2016 at 10:00 AM, srimugunthan dhandapani < srimugunthan.dhandap...@gmail.com> wrote: > Hi all > I am working in cloudera CDH5.6 and

Re: binary file deserialization

2016-03-09 Thread Andy Sloane
We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD. On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov wrote: > We have a huge binary file in a custom

Re: binary file deserialization

2016-03-09 Thread Ted Yu
bq. there is a varying number of items for that record If the combination of items is very large, using case class would be tedious. On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj wrote: > You can load that binary up as a String RDD, then map over that RDD and > convert

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
That’s very strange. I just un-set my SPARK_HOME env param, downloaded a fresh 1.6.0 tarball, unzipped it to local dir (~/Downloads), and it ran just fine - the driver port is some randomly generated large number. So SPARK_HOME is definitely not needed to run this. Aida, you are not running

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera
Hi Jakob, Tried running the command env|grep SPARK; nothing comes back Tried env|grep Spark; which is the directory I created for Spark once I downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark Tried running ./bin/spark-shell ; comes back with same error as below; i.e could

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
Sorry had a typo in my previous message: > try running just "/bin/spark-shell" please remove the leading slash (/) On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera wrote: > Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to > set it. How would I do

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
As Tristan mentioned, it looks as though Spark is trying to bind on port 0 and then 1 (which is not allowed). Could it be that some environment variables from you previous installation attempts are polluting your configuration? What does running "env | grep SPARK" show you? Also, try running just

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera
Hi Tristan, my apologies, I meant to write Spark and not SCALA I feel a bit lost at the moment... Perhaps I have missed steps that are implicit to more experienced people Apart from downloading spark and then following Jakob's steps: 1.

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
SPARK_HOME and SCALA_HOME are different. I was just wondering whether spark is looking in a different dir for the config files than where you’re running it. If you have not set SPARK_HOME, it should look in the current directory for the /conf dir. The defaults should be relatively safe, I’ve

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera
I don't think I set the SCALA_HOME environment variable Also, I'm unsure whether or not I should launch the scripts defaults to a single machine(local host) Sent from my iPhone > On 9 Mar 2016, at 19:59, Tristan Nixon wrote: > > Also, do you have the SPARK_HOME

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Also, do you have the SPARK_HOME environment variable set in your shell, and if so what is it set to? > On Mar 9, 2016, at 1:53 PM, Tristan Nixon wrote: > > There should be a /conf sub-directory wherever you installed spark, which > contains several configuration files.

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
There should be a /conf sub-directory wherever you installed spark, which contains several configuration files. I believe that the two that you should look at are spark-defaults.conf spark-env.sh > On Mar 9, 2016, at 1:45 PM, Aida Tefera wrote: > > Hi Tristan, thanks

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh
hive 1.6.0 in embed mode doesn't connect to metastore -- https://issues.apache.org/jira/browse/SPARK-9686 https://forums.databricks.com/questions/6512/spark-160-not-able-to-connect-to-hive-metastore.html On Wed, Mar 9, 2016 at 10:48 AM, Suniti Singh wrote: > Hi, > > I

Re: Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Sean Owen
Oh yeah I already added it after your earlier message, have a look. On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote: > My book on Spark was recently published. I would like to request it to be > added to the Books section on Spark's website. > > > > Here are the

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link:

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
Yeah, to be clear, I'm talking about having only one constructor for a direct stream, that will give you a stream of ConsumerRecord. Different needs for topic subscription, starting offsets, etc could be handled by calling appropriate methods after construction but before starting the stream.

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Yeah, according to the standalone documentation http://spark.apache.org/docs/latest/spark-standalone.html the default port should be 7077, which means that something must be overriding this on your installation - look to the conf scripts! > On Mar 9, 2016, at 1:26 PM, Tristan Nixon

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Looks like it’s trying to bind on port 0, then 1. Often the low-numbered ports are restricted to system processes and “established” servers (web, ssh, etc.) and so user programs are prevented from binding on them. The default should be to run on a high-numbered port like 8080 or such. What do

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Alan Braithwaite
I'd probably prefer to keep it the way it is, unless it's becoming more like the function without the messageHandler argument. Right now I have code like this, but I wish it were more similar looking: if (parsed.partitions.isEmpty()) { JavaPairInputDStream

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-09 Thread Andy Davidson
Hi Ted and Saurahb If I use —conf arguments with pyspark I am able to connect. Any idea how I can set these values programmatically? (I work on a notebook server and can not easily reconfigure the server This works extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera
Hi Jakob, Thanks for your suggestion. I downloaded a pre built version with Hadoop and followed your steps I posted the result on the forum thread, not sure if you can see it? I was just wondering whether this means it has been successfully installed as there are a number of warning/error

Re: Specify log4j properties file

2016-03-09 Thread Tristan Nixon
You can also package an alternative log4j config in your jar files > On Mar 9, 2016, at 12:20 PM, Ashic Mahtab wrote: > > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh
Hi, I am able to reproduce this error only when using spark 1.6.0 and hive 1.6.0. The hive-site.xml is in the classpath but somehow spark rejects the classpath search for hive-site.xml and start using the default metastore Derby. 16/03/09 10:37:52 INFO MetaStoreDirectSql: Using direct SQL,

Re: Installing Spark on Mac

2016-03-09 Thread Aida
Hi everyone, thanks for all your support I went with your suggestion Cody/Jakob and downloaded a pre-built version with Hadoop this time and I think I am finally making some progress :) ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master local[2] log4j:WARN No appenders

Re: Specify log4j properties file

2016-03-09 Thread Matt Narrell
You can also use --files, which doesn't require the file scheme. On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote: > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It doesn't

RE: Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Found it. You can pass in the jvm parameter log4j.configuration. The following works: -Dlog4j.configuration=file:path/to/log4j.properties It doesn't work without the file: prefix though. Tested in 1.6.0. Cheers,Ashic. From: as...@live.com To: user@spark.apache.org Subject: Specify log4j

Re: S3 Zip File Loading Advice

2016-03-09 Thread Xinh Huynh
Could you wrap the ZipInputStream in a List, since a subtype of TraversableOnce[?] is required? case (name, content) => List(new ZipInputStream(content.open)) Xinh On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim wrote: > Hi Sabarish, > > I found a similar posting online where

spark streaming doesn't pick new files from HDFS

2016-03-09 Thread srimugunthan dhandapani
Hi all I am working in cloudera CDH5.6 and version of spark is 1.5.0-cdh5.6.0 I have a strange problem that spark streaming works on a directory in local filesystem but doesnt work for hdfs. My spark streaming program: package com.oreilly.learningsparkexamples.java; import

Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Hello,Is it possible to provide a log4j properties file when submitting jobs to a cluster? I know that by default spark looks for a log4j.properties file in the conf directory. I'm looking for a way to specify a different log4j.properties file (external to the application) without pointing to a

Re: binary file deserialization

2016-03-09 Thread Saurabh Bajaj
You can load that binary up as a String RDD, then map over that RDD and convert each row to your case class representing the data. In the map stage you could also map the input string into an RDD of JSON values and use the following function to convert it into a DF

Re: reading the parquet file

2016-03-09 Thread Xinh Huynh
You might want to avoid that unionAll(), which seems to be repeated over 1000 times. Could you do a collect() in each iteration, and collect your results in a local Array instead of a DataFrame? How many rows are returned in "temp1"? Xinh On Tue, Mar 8, 2016 at 10:00 PM, Angel Angel

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
i tried with avro input, something like /data/txn_*/* and it works for me On Wed, Mar 9, 2016 at 12:12 PM, Ted Yu wrote: > Koert: > I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi > level wildcard. > > Cheers > > On Wed, Mar 9, 2016 at 8:22 AM, Koert

binary file deserialization

2016-03-09 Thread Ruslan Dautkhanov
We have a huge binary file in a custom serialization format (e.g. header tells the length of the record, then there is a varying number of items for that record). This is produced by an old c++ application. What would be best approach to deserialize it into a Hive table or a Spark RDD? Format is

How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Rachana Srivastava
I am trying to integrate SparkStreaming with HBase. I am calling following APIs to connect to HBase HConnection hbaseConnection = HConnectionManager.createConnection(conf); hBaseTable = hbaseConnection.getTable(hbaseTable); Since I cannot get the connection and broadcast the connection each

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
Koert: I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi level wildcard. Cheers On Wed, Mar 9, 2016 at 8:22 AM, Koert Kuipers wrote: > i use multi level wildcard with hadoop fs -ls, which is the exact same > glob function call > > On Wed, Mar 9, 2016 at

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Dave Maughan
Hi, We're having a similar issue. We have a standalone cluster running 1.5.2 with Hive working fine having dropped hive-site.xml into the conf folder. We've just updated to 1.6.0, using the same configuration. Now when starting a spark-shell we get the following: java.lang.RuntimeException:

Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi
Hello, I'd like to know if this architecture is correct or not. We are studying Spark as our ETL engine, we have a UI designer for the graph, this give us a model that we want to translate in the corresponding Spark executions. What brings to us Akka FSM, Using same sparkContext for all actors,

Re: Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi
My Mistake, It's not Akka FSM is Akka Flow Graphs On Wed, Mar 9, 2016 at 1:46 PM, Andrés Ivaldi wrote: > Hello, > > I'd like to know if this architecture is correct or not. We are studying > Spark as our ETL engine, we have a UI designer for the graph, this give us > a

Re: Streaming job delays

2016-03-09 Thread Juan Leaniz
Hi Batch interval is 5min. I actually managed to fix the issue by turning off dynamic allocation and the external shuffle service. This seems to have helped and now the scheduling delay is between 0-5ms and processing time is about 2.8min which is lower than my batch interval. I also noticed

Re: HBASE

2016-03-09 Thread Mich Talebzadeh
I agree with Ted's assessment. Big Data space is getting crowded with an amazing array of tools and utilities some disappearing like meteors. Hadoop is definitely a keeper. So are Hive and Spark. Hive is the most stable Data Warehouse on Big Data and Spark is offering an array of impressive

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
i use multi level wildcard with hadoop fs -ls, which is the exact same glob function call On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wrote: > Hadoop glob pattern doesn't support multi level wildcard. > > Thanks > > On Mar 9, 2016, at 6:15 AM, Koert Kuipers

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Christophe Préaud
Hi, Unless I've misunderstood what you want to achieve, you could use: sqlContext.read.json(sc.textFile("/mnt/views-p/base/2016/01/*/*-xyz.json")) Regards, Christophe. On 09/03/16 15:24, Ted Yu wrote: Hadoop glob pattern doesn't support multi level wildcard. Thanks On Mar 9, 2016, at 6:15 AM,

RE: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Gerhard Fiedler
Hi guys, Thanks for responding. Re SPARK_CLASSPATH (Daoyuan): I think you are right. We tried it, and that’s what the warning we got said. Re SparkConf (Daoyuan): We need the custom jar in the driver code, so I don’t know how that would work. Re EMR -u (Sonal): The documentation says that

Re: HBASE

2016-03-09 Thread Ted Yu
bq. it is kind of columnar NoSQL database. The storage format in HBase is not columnar. I would suggest you build upon what you already know (Spark and Hive) and expand on that. Also, if your work uses Big Data technologies, those would be the first to consider getting to know better. On Wed,

Re: S3 Zip File Loading Advice

2016-03-09 Thread Benjamin Kim
Hi Sabarish, I found a similar posting online where I should use the S3 listKeys. http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd. Is this what you were thinking? And, your assumption is correct. The zipped CSV file contains only a single file. I

HBASE

2016-03-09 Thread Ashok Kumar
Hi Gurus, I am relatively new to Big Data and know some about Spark and Hive. I was wondering do I need to pick up skills on Hbase as well. I am not sure how it works but know that it is kind of columnar NoSQL database. I know it is good to know something new in Big Data space. Just wondering if

Re: [Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-09 Thread diplomatic Guru
Could someone verify this for me? On 8 March 2016 at 14:06, diplomatic Guru wrote: > Hello all, > > I'm using Random Forest for my machine learning (batch), I would like to > use online prediction using Streaming job. However, the document only > states linear

Re: How to use graphx to partition a graph which could assign topologically-close vertices on a same machine?

2016-03-09 Thread Robineast
In GraphX partitioning relates to edges not to vertices - vertices are partitioned however the RDD that was used to create the graph was partitioned. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
Hadoop glob pattern doesn't support multi level wildcard. Thanks > On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote: > > if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation > handles globs > >> On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation handles globs On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote: > This is currently not supported. > > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
This is currently not supported. > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something like this possible? > > sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json") > > I switched to DataFrames because my source files changed from TSV to

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread Ted Yu
I am not Spark committer. So I cannot be the shepherd :-) > On Mar 9, 2016, at 2:27 AM, James Hammerton wrote: > > Hi Ted, > > Finally got round to creating this: > https://issues.apache.org/jira/browse/SPARK-13773 > > I hope you don't mind me selecting you as the shepherd

Re: Spark ML - Scaling logistic regression for many features

2016-03-09 Thread Nick Pentreath
Hi Daniel The bottleneck in Spark ML is most likely (a) the fact that the weight vector itself is dense, and (b) the related communication via the driver. A tree aggregation mechanism is used for computing gradient sums (see

DataFrame support for hadoop glob patterns

2016-03-09 Thread Jakub Liska
Hey, is something like this possible? sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json") I switched to DataFrames because my source files changed from TSV to JSON but now I'm not able to load the files as I did before. I get this error if I try that :

Re: Streaming job delays

2016-03-09 Thread Matthias Niehoff
hi, What’s your batch interval? if the processing time is constantly bigger than your batch interval it is totally normal that your scheduling delay is going up. 2016-03-08 23:28 GMT+01:00 jleaniz : > Hi, > > I have a streaming application that reads batches from Flume,

Re: Installing Spark on Mac

2016-03-09 Thread Steve Loughran
> On 8 Mar 2016, at 18:06, Aida wrote: > > Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3. I'd look at that error message and fix it - To unsubscribe, e-mail:

Re: Spark on RAID

2016-03-09 Thread Steve Loughran
On 8 Mar 2016, at 16:34, Eddie Esquivel > wrote: Hello All, In the Spark documentation under "Hardware Requirements" it very clearly states: We recommend having 4-8 disks per node, configured without RAID (just as separate mount

Re: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Sonal Goyal
Hi Gerhard, I just stumbled upon some documentation on EMR - link below. Seems there is a -u option to add jars in S3 to your classpath, have you tried that ? http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html Best Regards, Sonal Founder, Nube

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton
Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton wrote: > Hi Ted, > > Thanks for getting back - I

Re: updating the Books section on the Spark documentation page

2016-03-09 Thread Sean Owen
No, the site itself (the part not under docs/) is in the ASF SVN repo. A PR wouldn't help. Just request here or dev@ with details. On Wed, Mar 9, 2016 at 7:30 AM, Jan Štěrba wrote: > You could try creating a pull-request on github. > > -Jan > -- > Jan Sterba >

Re: How to display the web ui when running Spark on YARN?

2016-03-09 Thread Shady Xu
Thanks for the reply. I am now trying to configure yarn.web-proxy.address according to https://issues.apache.org/jira/browse/SPARK-5837, but cannot start the standalone web proxy server. I am using CDH 5.0.1 and below is the error log: sbin/yarn-daemon.sh: line 44:

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
Would you please send out the configurations of dynamic allocation so we could know better. On Wed, Mar 9, 2016 at 4:29 PM, Jy Chen wrote: > Hello everyone: > > I'm trying the dynamic allocation in Spark on YARN. I have followed > configuration steps and started the

[Error]Run Spark job as hdfs user from oozie workflow

2016-03-09 Thread Divya Gehlot
Hi, I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2 When I am submitting my spark scala script through shell script using Oozie workflow. I am submitting job as hdfs user but It is running as user = "yarn" so all the output should get store under user/yarn directory only . When

Re: No event log in /tmp/spark-events

2016-03-09 Thread Yu Xin
hi Andrew Thanks for reply. I run SparkPi in cluster with 1 master+2 slaves based on YARN, I did not specify the client mode so I think it should in client mode. I checked the console log and did not find EventLoggingListener keyword.* Seems the spark-default.conf are not passed correctly. * The

Re: S3 Zip File Loading Advice

2016-03-09 Thread Jörn Franke
Oozie may be able to do this for you and integrate with Spark. > On 09 Mar 2016, at 06:03, Benjamin Kim wrote: > > I am wondering if anyone can help. > > Our company stores zipped CSV files in S3, which has been a big headache from > the start. I was wondering if anyone

Fwd: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen
Hello everyone: I'm trying the dynamic allocation in Spark on YARN. I have followed configuration steps and started the shuffle service. Now it can request executors when the workload is heavy but it cannot remove executors. I try to open the spark shell and don’t run any command, no executor is

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang
Spark will skip the stage if it is computed by other jobs. That means the common parent RDD of each job only needs to be computed once. But it is still multiple sequential jobs, not concurrent jobs. On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba wrote: > Hi Andy, > > its nice to

Re: S3 Zip File Loading Advice

2016-03-09 Thread Sabarish Sasidharan
You can use S3's listKeys API and do a diff between consecutive listKeys to identify what's new. Are there multiple files in each zip? Single file archives are processed just like text as long as it is one of the supported compression formats. Regards Sab On Wed, Mar 9, 2016 at 10:33 AM,