Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming structure

Re: zip files submitted with --py-files disappear from hdfs after a while on EMR

2015-05-16 Thread jaredtims
Any resolution to this? I am having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html Sent from the Apache Spark User List mailing list archive at

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2015-05-16 Thread jaredtims
Any resolution to this? Im having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
nd delegate > "update: part to them. > > On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati > mailto:nisrina.luthfiy...@gmail.com>> wrote: > > Hi all, > I have a stream of data from Kafka that I want to process and store in hdfs > using Spark Streaming. >

Re: multiple hdfs folder & files input to PySpark

2015-05-15 Thread Oleg Ruchovets
r: An error occurred while calling o30.partitions. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sdo-hdp-bd-master1.development.c4i:8020/user/hdfs/ /input/lprs/2015_05_14/file3.csv Input path does not exist: hdfs://sdo-hdp-bd-master1.development.c4i:8020/us

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
; Hi all, > I have a stream of data from Kafka that I want to process and store in > hdfs using Spark Streaming. > Each data has a date/time dimension and I want to write data within the > same time dimension to the same hdfs directory. The data stream might be > unordered (by time

Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread Nisrina Luthfiyati
Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wond

Required settings for permanent HDFS Spark on EC2

2015-05-12 Thread darugar
Hello, I have Spark 1.3.1 running well on EC2 with ephemeral hdfs using the spark-ec2 script, quite happy with it. I want to switch to persistent-hdfs in order to be able to maintain data between cluster stop/starts. Unfortunately spark-ec stop/start causes spark to revert back from persistent

Re: Spark can not access jar from HDFS !!

2015-05-11 Thread Ravindra
hiveContext as given below - >>> scala> hiveContext.sql ("CREATE TEMPORARY FUNCTION sample_to_upper AS >>> 'com.abc.api.udf.MyUpper' USING JAR >>> 'hdfs:///users/ravindra/customUDF2.jar'") >>> >>> I

Re: Spark can not access jar from HDFS !!

2015-05-10 Thread Ravindra
ng to create custom udfs with hiveContext as given below - >> scala> hiveContext.sql ("CREATE TEMPORARY FUNCTION sample_to_upper AS >> 'com.abc.api.udf.MyUpper' USING JAR >> 'hdfs:///users/ravindra/customUDF2.jar'") >> >> I have put th

Re: Spark can not access jar from HDFS !!

2015-05-09 Thread Michael Armbrust
uot;CREATE TEMPORARY FUNCTION sample_to_upper AS > 'com.abc.api.udf.MyUpper' USING JAR > 'hdfs:///users/ravindra/customUDF2.jar'") > > I have put the udf jar in the hdfs at the path given above. The same > command works well in the hive shell but failing here

Spark can not access jar from HDFS !!

2015-05-09 Thread Ravindra
Hi All, I am trying to create custom udfs with hiveContext as given below - scala> hiveContext.sql ("CREATE TEMPORARY FUNCTION sample_to_upper AS 'com.abc.api.udf.MyUpper' USING JAR 'hdfs:///users/ravindra/customUDF2.jar'") I have put the udf jar in the hdfs

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Rendy Bambang Junior
o http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi > > > 2015-05-06 12:22 GMT+08:00 MrAsanjar . : > >> why not try https://github.com/linkedin/camus - camus is kafka to HDFS >> pipeline >> >> On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior <

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Saisai Shao
Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . : > why not try https://github.com/linkedin/camus - camus is kafka to HDFS > pipeline > > On Tue,

Re: multiple hdfs folder & files input to PySpark

2015-05-05 Thread Ai He
-sparkcontext-textfile). Thanks > On May 5, 2015, at 5:59 AM, Oleg Ruchovets wrote: > > Hi >We are using pyspark 1.3 and input is text files located on hdfs. > > file structure > > file1.txt > file2.txt > >

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread MrAsanjar .
why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior < rendy.b.jun...@gmail.com> wrote: > Hi all, > > I am planning to load data from Kafka to HDFS. Is it normal to use spark > streaming to load

Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread Rendy Bambang Junior
Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
data node or make minreplication to 0. Hdfs is trying > to replicate at least one more copy and not able to find another DN to do > thay > On 6 May 2015 09:37, "Sudarshan Murty" wrote: > >> Another thing - could it be a permission problem ? >> It creates al

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
Try to add one more data node or make minreplication to 0. Hdfs is trying to replicate at least one more copy and not able to find another DN to do thay On 6 May 2015 09:37, "Sudarshan Murty" wrote: > Another thing - could it be a permission problem ? > It creates all the directo

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
eem to indicate that the system is aware that a datanode exists > but is excluded from the operation. So, it looks like it is not partitioned > and Ambari indicates that HDFS is in good health with one NN, one SN, one > DN. > I am unable to figure out what the issue is. > thanks fo

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
- which seem to indicate that the system is aware that a datanode exists but is excluded from the operation. So, it looks like it is not partitioned and Ambari indicates that HDFS is in good health with one NN, one SN, one DN. I am unable to figure out what the issue is. thanks for your help. On Tue, May

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, "Sudarshan" wrote: > > I have searched all replies to this question & not found an answer. > > I am running standalone Sp

saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan
I have searched all replies to this question & not found an answer.I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount).Only line I

multiple hdfs folder & files input to PySpark

2015-05-05 Thread Oleg Ruchovets
Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure file1.txt file2.txt file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job multiple

Spark + Mesos + HDFS resource split

2015-04-27 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am building a mesos cluster for the purposes of using it to run spark workloads (in addition to other frameworks). I am under the impression that it is preferable/recommended to run hdfs datanode process, spark slave on the same physical node

Re: Running spark over HDFS

2015-04-21 Thread madhvi
my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any resources

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
. >> >> Thanks >> Best Regards >> >> On Mon, Apr 20, 2015 at 12:22 PM, madhvi wrote: >> >>> Hi All, >>> >>> I am new to spark and have installed spark cluster over my system having >>> hadoop cluster.I want to process d

Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
Not sure what would slow it down as the repartition completes equally fast on all nodes, implying that the data is available on all, then there are a few computation steps none of them local on the master. On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen wrote: > What machines are HDFS data no

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
other 2 nodes -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, April 20, 2015 12:57 PM To: jamborta Cc: user@spark.apache.org Subject: Re: writing to hdfs on master node much faster What machines are HDFS data nodes -- just your master? that would explain it

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
What machines are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local

writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all, I have a three node cluster with identical hardware. I am trying a workflow where it reads data from hdfs, repartitions it and runs a few map operations then writes the results back to hdfs. It looks like that all the computation, including the repartitioning and the maps complete within

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 03:18 PM, Archit Thakur wrote: There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit Thakur.

Re: Running spark over HDFS

2015-04-20 Thread Archit Thakur
There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 3:05 PM, madhvi wrote: > On Mo

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 02:52 PM, SURAJ SHETH wrote: Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth Tried the same but still

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
he master uri >> as shown in the web UI's top left corner like: spark://someIPorHost:7077 >> and it should be fine. >> >> Thanks >> Best Regards >> >> On Mon, Apr 20, 2015 at 12:22 PM, madhvi wrote: >> >>> Hi All, >>> >>

Re: Running spark over HDFS

2015-04-20 Thread madhvi
SparkConf().setAppName("JavaWordCount"); sparkConf.setMaster("spark://192.168.0.119:7077"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); Configuration conf = new Configuration(); conf.set("fs.default.name", "hdfs://192.168.0.119:9000");

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
n Mon, Apr 20, 2015 at 12:22 PM, madhvi wrote: > >> Hi All, >> >> I am new to spark and have installed spark cluster over my system having >> hadoop cluster.I want to process data stored in HDFS through spark. >> >> When I am running code in eclipse it is g

Re: Running spark over HDFS

2015-04-20 Thread madhvi
I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has no

Re: Running spark over HDFS

2015-04-19 Thread Akhil Das
nstalled spark cluster over my system having > hadoop cluster.I want to process data stored in HDFS through spark. > > When I am running code in eclipse it is giving the following warning > repeatedly: > scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; > chec

Running spark over HDFS

2015-04-19 Thread madhvi
Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
tFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick"); > Next,I just save this DataFrame onto HDFS with below code.It will generate > 36 parquet files too,but the size of each file is about 265M > > tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick

RE: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-08 Thread Puneet Kumar Ojha
Thanks From: Nick Pentreath [mailto:nick.pentre...@gmail.com] Sent: Tuesday, April 07, 2015 5:52 PM To: Puneet Kumar Ojha Cc: user@spark.apache.org Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data There is no difference - textFile calls hadoopFile with a

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
ecutor, it >> will lower the memory requirement, with running in a slower speed. >> >> Yong >> >> -- >> Date: Wed, 8 Apr 2015 04:57:22 +0800 >> Subject: Re: 'Java heap space' error occured when query 4G data file from

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread Ted Yu
oncurrency of your executor, it > will lower the memory requirement, with running in a slower speed. > > Yong > > -- > Date: Wed, 8 Apr 2015 04:57:22 +0800 > Subject: Re: 'Java heap space' error occured when query 4G data file from &

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964
ower the cores for executor by set "-Dspark.deploy.defaultCores=". When you have not enough memory, reduce the concurrency of your executor, it will lower the memory requirement, with running in a slower speed. Yong Date: Wed, 8 Apr 2015 04:57:22 +0800 Subject: Re: 'Java heap space' error

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
Any help?please. Help me do a right configure. 李铖 于2015年4月7日星期二写道: > In my dev-test env .I have 3 virtual machines ,every machine have 12G > memory,8 cpu core. > > Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not > right. > > I run this command :*spark-submit --master yarn-

Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Nick Pentreath
outFormat) when data is present in HDFS? Will there be any performance > gain that can be observed? > Puneet Kumar Ojha > Data Architect | PubMatic<http://www.pubmatic.com/>

Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Puneet Kumar Ojha
Hi , Is there any difference between Difference between textFile Vs hadoopFile (textInoutFormat) when data is present in HDFS? Will there be any performance gain that can be observed? Puneet Kumar Ojha Data Architect | PubMatic<http://www.pubmatic.com/>

'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit --master yarn-client --driver-memory 7g --executor-memory 6g /home/hadoop/spark/main.py* exceptio

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, "hdfs path"). You can check the code examples here: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples. -Xiangrui On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip wrote: > Hello Zhou, > > You can look at the recomme

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou, You can look at the recommendation template <http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation> of PredictionIO. PredictionIO is built on the top of spark. And this template illustrates how you can save the ALS model to HDFS and the reload it

MLlib: save models to HDFS?

2015-04-03 Thread S. Zhou
I am new to MLib so I have a basic question: is it possible to save MLlib models (particularly CF models) to HDFS and then reload it later? If yes, could u share some sample code (I could not find it in MLlib tutorial). Thanks!

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
/ byte[]. Review what > you are writing since it is not BytesWritable / Text. > > On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers > wrote: > > I'm actually running this in a separate environment to our HDFS cluster. > > > > I think I've been able to sort out th

Re: Spark, snappy and HDFS

2015-04-02 Thread Sean Owen
27;t Spark-specific; you do not have a SequenceFile of byte[] / String, but of byte[] / byte[]. Review what you are writing since it is not BytesWritable / Text. On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers wrote: > I'm actually running this in a separate environment to our HDFS cluster. >

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to s

Re: Spark, snappy and HDFS

2015-04-01 Thread Xianjin YE
o spark-env.sh > (http://spark-env.sh) file, but still nothing. > > On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE (mailto:advance...@gmail.com)> wrote: > > Can you read snappy compressed file in hdfs? Looks like the libsnappy.so > > is not in the hadoop native lib path.

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Apr 1, 2015 at 7:19 PM, Xianjin YE wrote: > Can you read snappy compressed file in hdfs? Looks like the libsnappy.so > is not in the hadoop native lib path. > > On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: > > Has anyone else encountered the following error when

Re: Spark, snappy and HDFS

2015-04-01 Thread Xianjin YE
Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: > Has anyone else encountered the following error when trying to read a snappy > compressed sequence file fro

Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li
memory) and the rest >> >for regular mesos tasks? >> >> >This means, on each slave node I would have tachyon worker (+ hdfs >> >configuration to talk to s3 or the hdfs datanode) and the mesos slave >> ?process. Is this correct? >> >> >> > > > -- > --Sean > > -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Sean Bigdatafun
gt; > >This means, on each slave node I would have tachyon worker (+ hdfs > >configuration to talk to s3 or the hdfs datanode) and the mesos slave > ?process. Is this correct? > > > -- --Sean

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
total memory) and the rest > for regular mesos tasks? > > This depends on your machine spec and workload. The high level idea is to give Tachyon the memory size equals to the total memory size of the machine minus other processes' memory needs. > This means, on each sla

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Haoyuan, So on each mesos slave node I should allocate/section off some amount of memory for tachyon (let's say 50% of the total memory) and the rest for regular mesos tasks? This means, on each slave node I would have tachyon worker (+

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
s deployment. I can't seem to figure out the "best > practices" around HDFS and Tachyon. The documentation about Spark's > data-locality section seems to point that each of my mesos slave nodes > should also run a hdfs datanode. This seems fine but I can't seem to

deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the "best practices" around HDFS and Tachyon. The documentation about Spark's data-locality section

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread nsalian
Try running it like this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 Caveats: 1) Make sure the permissions of /user/nick is 775 or 777. 2) No need for

Re: Spark-submit not working when application jar is in hdfs

2015-03-30 Thread nsalian
Client mode would not support HDFS jar extraction. I tried this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 And it worked. -- View this message in context

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964
I think the jar file has to be local. In HDFS is not supported yet in Spark. See this answer: http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs > Date: Sun, 29 Mar 2015 22:34:46 -0700 > From: n.e.trav...@gmail.com > To: user@spark.a

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Akhil Das
What happens when you do: sc.textFile("hdfs://path/to/the_file.txt") Thanks Best Regards On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers wrote: > Hi List, > > I'm following this example here > < > https://github.com/databricks/learning-spark/tree/master/min

java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Nick Travers
\ --class com.oreilly.learningsparkexamples.mini.scala.WordCount \ hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar \ hdfs://host.domain.ex/user/nickt/linkage hdfs://host.domain.ex/user/nickt/wordcounts The jar is submitted fine and I can see it appear on the driver node (i.e. connecting to and reading f

Re: Spark-submit not working when application jar is in hdfs

2015-03-29 Thread dilm
Made it work by using yarn-cluster as master instead of local. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22281.html Sent from the Apache Spark User List mailing list archive at

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread Ted Yu
Looking at SparkSubmit#addJarToClasspath(): uri.getScheme match { case "file" | "local" => ... case _ => printWarning(s"Skip remote jar $uri.") It seems hdfs scheme is not recognized. FYI On Thu, Feb 26, 2015 at 6:09 PM, dilm

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread rrussell25
Hi, did you resolve this issue or just work around it be keeping your application jar local? Running into the same issue with 1.3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22272.html

Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee
>From the standpoint of Spark SQL accessing the files - when it is hitting Hive, it is in effect hitting HDFS as well. Hive provides a great framework where the table structure is already well defined.But underneath it, Hive is just accessing files from HDFS so you are hitting HDFS either

Re: EC2 cluster created by spark using old HDFS 1.0

2015-03-22 Thread Akhil Das
That's a hadoop version incompatibility issue, you need to make sure everything runs on the same version. Thanks Best Regards On Sat, Mar 21, 2015 at 1:24 AM, morfious902002 wrote: > Hi, > I created a cluster using spark-ec2 script. But it installs HDFS version > 1.0. I would li

EC2 cluster created by spark using old HDFS 1.0

2015-03-20 Thread morfious902002
Hi, I created a cluster using spark-ec2 script. But it installs HDFS version 1.0. I would like to use this cluster to connect to HIVE installed on a cloudera CDH 5.3 cluster. But I am getting the following error:- org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate

Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread Michael Armbrust
I am trying to explain that these are not either/or decisions. You are likely going to be storing the data on HDFS no matter what other choices you make. You can use parquet to store the data whether or not you are addressing files directly on HDFS or using the Hive Metastore to locate the

Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread 李铖
Did you mean that parquet is faster than hive format ,and hive format is faster than hdfs ,for Spark SQL? : ) 2015-03-18 1:23 GMT+08:00 Michael Armbrust : > The performance has more to do with the particular format you are using, > not where the metadata is coming from. Even hive tabl

Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread Michael Armbrust
The performance has more to do with the particular format you are using, not where the metadata is coming from. Even hive tables are read from files HDFS usually. You probably should use HiveContext as its query language is more powerful than SQLContext. Also, parquet is usually the faster

Re: Unable to saveAsParquetFile to HDFS since Spark 1.3.0

2015-03-17 Thread Cheng Lian
This has been fixed by https://github.com/apache/spark/pull/5020 On 3/18/15 12:24 AM, Franz Graf wrote: Hi all, today we tested Spark 1.3.0. Everything went pretty fine except that I seem to be unable to save an RDD as parquet to HDFS. A minimum example is: import sqlContext.implicits

Unable to saveAsParquetFile to HDFS since Spark 1.3.0

2015-03-17 Thread Franz Graf
Hi all, today we tested Spark 1.3.0. Everything went pretty fine except that I seem to be unable to save an RDD as parquet to HDFS. A minimum example is: import sqlContext.implicits._ // Reading works fine! val foo: RDD[String] = spark.textFile("hdfs://") // this work

Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread 李铖
Hi,everybody. I am new in spark. Now I want to do interactive sql query using spark sql. spark sql can run under hive or loading files from hdfs. Which is better or faster? Thanks.

Should I do spark-sql query on HDFS or hive?

2015-03-17 Thread 李铖
Hi,everybody. I am new in spark. Now I want to do interactive sql query using spark sql. spark sql can run under hive or loading files from hdfs. Which is better or faster? Thanks.

Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation

2015-03-13 Thread Edmon Begoli
All, Does anyone have any reference to a publication or other, informal sources (blogs, notes), showing performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file system (NFS). I need this for formal performance research. We are currently doing a research into this on a very

Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda
You can add resolvers on SBT using resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"; On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW wrote: > Hi, > > I am not able to read from HDFS(Intel distribution hadoop,Ha

ec2 persistent-hdfs with ebs using spot instances

2015-03-10 Thread Deborah Siegel
Hello, I'm new to ec2. I've set up a spark cluster on ec2 and am using persistent-hdfs with the data nodes mounting ebs. I launched my cluster using spot-instances ./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z us-east-1c --spark-version=1.2.0 --spot-price=.032

Re: Spark with data on NFS v HDFS

2015-03-05 Thread Tobias Pfeiffer
ng data to Spark from NFS v HDFS? > As I understand it, one performance advantage of using HDFS is that the task will be computed at a cluster node that has data on its local disk already, so the tasks go to where the data is. In the case of NFS, all data must be downloaded from the file server(

Spark with data on NFS v HDFS

2015-03-05 Thread Ashish Mukherjee
Hello, I understand Spark can be used with Hadoop or standalone. I have certain questions related to use of the correct FS for Spark data. What is the efficiency trade-off in feeding data to Spark from NFS v HDFS? If one is not using Hadoop, is it still usual to house data in HDFS for Spark to

Spark-submit not working when application jar is in hdfs

2015-02-26 Thread dilm
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip remote jar hdfs://localhost:9000/user/hdfs

How to read from hdfs using spark-shell in Intel hadoop?

2015-02-26 Thread MEETHU MATHEW
Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the commandmvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and read a HDFS file using sc.textFile() and the exception is    WARN

Re: bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Akhil Das
There was already a thread around it if i understood your question correctly, you can go through this https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/%3ccannjawtrp0nd3odz-5-_ya351rin81q-9+f2u-qn+vruqy+...@mail.gmail.com%3E Thanks Best Regards On Thu, Feb 19, 2015 at 8:16 PM, Chic

bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Chico Qi
Hi all, In Spark Streaming I want use the Dstream.saveAsTextFiles by bulk writing because of the normal saveAsTextFiles cannot during the batch interval of setting. May be a common pool of writing or another assigned worker for bulk writing? Thanks! B/R Jichao

Re: Writing to HDFS from spark Streaming

2015-02-16 Thread Sean Owen
;> and then pass that as the final argument. >> >> On Wed, Feb 11, 2015 at 6:35 AM, Akhil Das >> wrote: >> > Did you try : >> > >> > temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, >> > String.class,(Class) >>

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
Feb 11, 2015 at 6:35 AM, Akhil Das > wrote: > > Did you try : > > > > temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, > String.class,(Class) > > TextOutputFormat.class); > > > > Thanks > > Best Regards > > > > O

Re: Has Spark 1.2.0 changed EC2 persistent-hdfs?

2015-02-13 Thread Joe Wass
Looks like this is caused by issue SPARK-5008: https://issues.apache.org/jira/browse/SPARK-5008 On 13 February 2015 at 19:04, Joe Wass wrote: > I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour > appears to have changed. > > My launch script is &g

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Sean Owen
wal/blinkdb) which > seems to work only with Spark 0.9. However, if I want to access HDFS I need > to compile Spark against Hadoop version which is running on my > cluster(2.6.0). Hence, the versions problem ... > > > > On Friday, February 13, 2015 11:28 AM, Sean Owen wrote: &g

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Grandl Robert
I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which seems to work only with Spark 0.9. However, if I want to access HDFS I need to compile Spark against Hadoop version which is running on my cluster(2.6.0). Hence, the versions problem ... On Friday, February 13

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Grandl Robert
015 at 7:13 PM, Grandl Robert wrote: > Hi guys, > > Probably a dummy question. Do you know how to compile Spark 0.9 to easily > integrate with HDFS 2.6.0 ? > > I was trying > sbt/sbt -Pyarn -Phadoop-2.6 assembly > or > mvn -Dhadoop.version=2.6.0 -DskipTests clean package > > but none of these approaches succeeded. > > Thanks, > Robert

<    5   6   7   8   9   10   11   12   13   14   >