Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
Thanks for sharing this, Brandon! Looks like a great architecture for people to build on. Matei On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote: Hi Spark community, At Adobe Research, we're happy to open source a prototype technology called Spindle we've been

Re: spark on yarn cluster can't launch

2014-08-16 Thread Sandy Ryza
Hi, Do you know what YARN scheduler you're using and what version of YARN? It seems like this would be caused by YarnClient.getQueueInfo returning null, though, from browsing the YARN code, I'm not sure how this could happen. -Sandy On Fri, Aug 15, 2014 at 11:23 AM, Andrew Or

Re: spark on yarn cluster can't launch

2014-08-16 Thread Sandy Ryza
On closer look, it seems like this can occur if the queue doesn't exist. Filed https://issues.apache.org/jira/browse/SPARK-3082. -Sandy On Sat, Aug 16, 2014 at 12:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi, Do you know what YARN scheduler you're using and what version of YARN? It

Re: Updating exising JSON files

2014-08-16 Thread Sean Owen
If you mean you want to overwrite the file in-place while you're reading it, no you can't do that with HDFS. That would be dicey on any file system. If you just want to append to the file, yes HDFS supports appends. I am pretty certain Spark does not have a concept that maps to appending, though I

Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-16 Thread Mayur Rustagi
Quite a good question, I assume you know the size of the cluster going in, then you can essentially try to partition the data in some multiples of that use rangepartitioner to partition the data roughly equally. Dynamic partitions are created based on number of blocks on filesystem hence the

Re: Open source project: Deploy Spark to a cluster with Puppet and Fabric.

2014-08-16 Thread Nicholas Chammas
Hey Brandon, Thank you for sharing this. What is the relationship of this project to the spark-ec2 tool that comes with Spark? Does it provide a superset of the functionality of spark-ec2? Nick 2014년 8월 13일 수요일, bdamosa...@adobe.com님이 작성한 메시지: Hi Spark community, We're excited about Spark

Re: Running Spark shell on YARN

2014-08-16 Thread Eric Friedman
+1 for such a document. Eric Friedman On Aug 15, 2014, at 1:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: Sandy and others: Is there a single source of Yarn/Hadoop properties that should be set or reset for running Spark on Yarn? We've sort of stumbled through one property

Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
I followed this thread http://apache-spark-user-list.1001560.n3.nabble.com/YARN-issues-with-resourcemanager-scheduler-address-td5201.html#a5258 to set SPARK_YARN_USER_ENV to HADOOP_CONF_DIR export SPARK_YARN_USER_ENV=CLASSPATH=$HADOOP_CONF_DIR and used the following command to share conf

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
There's really nothing special besides including that jar on your classpath. You just do selects, inserts, etc as you normally would. The same instructions here apply https://cwiki.apache.org/confluence/display/Hive/Parquet From:

RE: Does HiveContext support Parquet?

2014-08-16 Thread lyc
Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12231.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-16 Thread rkishore999
I'm also getting into same issue and is blocked here. Did any of you were able to go past this issue? I tried using both ephimeral and persistent-hdfs. I'm getting the same issue. -- View this message in context:

RE: Does HiveContext support Parquet?

2014-08-16 Thread Flavio Pompermaier
Hi to all, sorry for not being fully on topic but I have 2 quick questions about Parquet tables registered in Hive/sparq: 1) where are the created tables stored? 2) If I have multiple hiveContexts (one per application) using the same parquet table, is there any problem if inserting concurrently

Re: spark streaming - lamda architecture

2014-08-16 Thread Jörn Franke
Hi, Maybe this helps you. For the speed layer I think something like complex event processing as it is - to some extent - supported by Spark Streaming can make sense. You process the events as they come in. You store them afterwards. The Spark Streaming web page gives a nice example: trend

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
If you're using HiveContext then all metadata is in the Hive metastore as defined in hive-site.xml. Concurrent writes should be fine as long as you're using a concurrent metastore db. From: Flavio Pompermaiermailto:pomperma...@okkam.it Sent: ‎8/‎16/‎2014 1:26 PM

Re: Does HiveContext support Parquet?

2014-08-16 Thread Michael Armbrust
Hi to all, sorry for not being fully on topic but I have 2 quick questions about Parquet tables registered in Hive/sparq: Using HiveQL to CREATE TABLE will add a table to the metastore / warehouse exactly as it would in hive. Registering is a purely temporary operation that lives with the

kryo out of buffer exception

2014-08-16 Thread Mohit Jaggi
Hi All, I was doing a groupBy and apparently some keys were very frequent making the serializer fail with buffer overflow exception. I did not need a groupBy so I switched to combineByKey in this case but would like to know how to increase the kryo buffer sizes to avoid this error. I hope there is

Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Steve Lewis
I want to look at porting a Hadoop problem to Spark - eventually I want to run on a Hadoop 2.0 cluster but while I am learning and porting I want to run small problems in my windows box. I installed scala and sbt. I download Spark and in the spark directory can say mvn -Phadoop-0.23

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon, Looks very cool...will try it out for ad-hoc analysis of our datasets and provide more feedback... Could you please give bit more details about the differences of Spindle architecture compared to Hue + Spark integration (python stack) and Ooyala Jobserver ? Does Spindle allow

Re: iterating with index in psypark

2014-08-16 Thread Chengi Liu
nevermind folks!!! On Sat, Aug 16, 2014 at 2:22 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have data like following: 1,2,3,4 1,2,3,4 5,6,2,1 and so on.. I would like to create a new rdd as follows: (0,0,1) (0,1,2) (0,2,3) (0,3,4) (1,0,1) .. and so on.. How do i do

s3:// sequence file startup time

2014-08-16 Thread kmatzen
I have some RDD's stored as s3://-backed sequence files sharded into 1000 parts. The startup time is pretty long (~10's of minutes). It's communicating with S3, but I don't know what it's doing. Is it just fetching the metadata from S3 for each part? Is there a way to pipeline this with the

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Manu Suryavansh
Hi, I have built spark-1.0.0 on Windows using Java 7/8 and I have been able to run several examples - here are my notes - http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html on how to build from source and run examples in spark shell. Regards, Manu On Sat, Aug

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-16 Thread Cui xp
Hi DB, Thanks for your reply, I saw the slide in slidesshare, and I am studying it. But one link in the page which is https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull reports ERROR 404 NOT FOUND. -- View this message in context:

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Tushar Khairnar
I am also trying to run on Windows and will post once I am able to launch. My guess is that by hand it probably means manually forming the java command I.e. class path and java options and then appending right class name for worker or master. Spark script follow hierarchy : start-master or

Program without doing assembly

2014-08-16 Thread Deep Pradhan
Hi, I am just playing around with the codes in Spark. I am printing out some statements of the codes given in Spark so as to see how it looks. Every time I change/add something to the code I have to run the command *SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly* which is tiresome at times. Is