Broadcast variables: when should I use them?

2015-01-26 Thread frodo777
Hello. I have a number of static Arrays and Maps in my Spark Streaming driver program. They are simple collections, initialized with integer values and strings directly in the code. There is no RDD/DStream involvement here. I do not expect them to contain more than 100 entries, each. They are

R: Broadcast variables: when should I use them?

2015-01-26 Thread Paolo Platter
Hi, Yes, if they are not big, it's a good practice to broadcast them to avoid serializing them each time you use clojure. Paolo Inviata dal mio Windows Phone Da: frodo777mailto:roberto.vaquer...@bitmonlab.com Inviato: ‎26/‎01/‎2015 14:34 A:

spark context not picking up default hadoop filesystem

2015-01-26 Thread jamborta
hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in

HW imbalance

2015-01-26 Thread Antony Mayi
Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Håkan Jonsson
Thanks. Turns out this is a proxy problem somehow. Sorry to bother you. /Håkan On Mon Jan 26 2015 at 11:02:18 AM Franc Carter franc.car...@rozettatech.com wrote: AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember

[SQL] Self join with ArrayType columns problems

2015-01-26 Thread Pierre B
Using Spark 1.2.0, we are facing some weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f

Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Akhil Das
Ah i think for locally you should give the full hdfs URL. like val logs = sc.textFile(hdfs://akhldz:9000/sigmoid/logs) Thanks Best Regards On Mon, Jan 26, 2015 at 9:36 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning that

Re: Issues when combining Spark and a third party java library

2015-01-26 Thread Akhil Das
Its more like, Spark is not able to find the hadoop jars. Try setting the HADOOP_CONF_DIR and also make sure *-site.xml are available in the CLASSPATH/SPARK_CLASSPATH. Thanks Best Regards On Mon, Jan 26, 2015 at 7:28 PM, Staffan staffan.arvids...@gmail.com wrote: I'm using Maven and Eclipse to

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
You can create a partitioned hive table using Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote: Hi, I've got a bunch of data stored in S3 under directories like this:

Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Tamas Jambor
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning that it was deprecated (didn't solve the problem), also tried to run with --driver-class-path, which did not work either. I am trying this locally. On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com

Re: Lost task - connection closed

2015-01-26 Thread octavian.ganea
Here is the first error I get at the executors: 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[handle-message-executor-16,5,main] java.lang.StackOverflowError at

Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Akhil Das
You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw are you running the application locally? or in standalone mode? Thanks Best Regards On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote: hi all, I am trying to create a spark context programmatically,

Re: HW imbalance

2015-01-26 Thread Antony Mayi
should have said I am running as yarn-client. all I can see is specifying the generic executor memory that is then to be used in all containers. On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com wrote: You should look at using Mesos. This should abstract

Re: cannot run spark-shell interactively against cluster from remote host - confusing memory warnings

2015-01-26 Thread Akhil Das
When you say remote cluster you need to make sure a few things like: - No firewall/network is blocking any connection (Simply ping from localmachine to remote ip and vice versa) - Make sure all ports (unless you specify them manually) are open. You can also refer this discussion,

Error when cache partitioned Parquet table

2015-01-26 Thread ZHENG, Xu-dong
Hi all, I meet below error when I cache a partitioned Parquet table. It seems that, Spark is trying to extract the partitioned key in the Parquet file, so it is not found. But other query could run successfully, even request the partitioned key. Is it a bug in SparkSQL? Is there any workaround

Re: HW imbalance

2015-01-26 Thread Charles Feduke
You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: #

Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Michael Armbrust
It seems likely that there is some sort of bug related to the reuse of array objects that are returned by UDFs. Can you open a JIRA? I'll also note that the sql method on HiveContext does run HiveQL (configured by spark.sql.dialect) and the hql method has been deprecated since 1.1 (and will

Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Hi, I've got a bunch of data stored in S3 under directories like this: s3n://blah/y=2015/m=01/d=25/lots-of-files.csv In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that it only scans the necessary directories for files to read. As far as I can tell from searching and

Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Dean Wampler
You are creating a HiveContext, then using the sql method instead of hql. Is that deliberate? The code doesn't work if you replace HiveContext with SQLContext. Lots of exceptions are thrown, but I don't have time to investigate now. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Issues when combining Spark and a third party java library

2015-01-26 Thread Staffan
I'm using Maven and Eclipse to build my project. I'm letting Maven download all the things I need for running everything, which has worked fine up until now. I need to use the CDK library (https://github.com/egonw/cdk, http://sourceforge.net/projects/cdk/) and once I add the dependencies to my

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Cheng Lian
Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks

Re: SVD in pyspark ?

2015-01-26 Thread Joseph Bradley
Hi Andreas, There unfortunately is not a Python API yet for distributed matrices or their operations. Here's the JIRA to follow to stay up-to-date on it: https://issues.apache.org/jira/browse/SPARK-3956 There are internal wrappers (used to create the Python API), but they are not really public

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Chris Gore
Good to hear there will be partitioning support. I’ve had some success loading partitioned data specified with Unix glowing format. i.e.: sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”) would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution, but it

Spark (Streaming?) holding on to Mesos resources

2015-01-26 Thread Gerard Maas
Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

Re: HW imbalance

2015-01-26 Thread Sandy Ryza
Hi Antony, Unfortunately, all executors for any single Spark application must have the same amount of memory. It's possibly to configure YARN with different amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), so other apps might be able to take advantage of the extra

Re: Spark webUI - application details page

2015-01-26 Thread spark08011
Where is the history server running? Is it running on the same node as the logs directory. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21374.html Sent from the Apache Spark User List mailing list archive at

Re: Error when cache partitioned Parquet table

2015-01-26 Thread Sadhan Sood
Hi Xu-dong, Thats probably because your table's partition path don't look like hdfs://somepath/key=value/*.parquet. Spark is trying to extract the partition key's value from the path while caching and hence the exception is being thrown since it can't find one. On Mon, Jan 26, 2015 at 10:45 AM,

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks

Re: Lost task - connection closed

2015-01-26 Thread Aaron Davidson
It looks like something weird is going on with your object serialization, perhaps a funny form of self-reference which is not detected by ObjectOutputStream's typical loop avoidance. That, or you have some data structure like a linked list with a parent pointer and you have many thousand elements.

large data set to get rid of exceeds Integer.MAX_VALUE error

2015-01-26 Thread freedafeng
Hi, This seems to be a known issue (see here: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html) The data set is about 1.5 T bytes. There are 14 region servers. I am not sure how many regions there are for this data set. But very likely

Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
(looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the

Spark and S3 server side encryption

2015-01-26 Thread curtkohler
We are trying to create a Spark job that writes out a file to S3 that leverage S3's server side encryption for sensitive data. Typically this is accomplished by setting the appropriate header on the put request, but it isn't clear whether this capability is exposed in the Spark/Hadoop APIs. Does

SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
Does anyone know if I can save a RDD as a text file to a pre-created directory in S3 bucket? I have a directory created in S3 bucket: //nexgen-software/dev When I tried to save a RDD as text file in this directory: rdd.saveAsTextFile(s3n://nexgen-software/dev/output); I got following

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-26 Thread Su She
Hello Sean and Akhil, I shut down the services on Cloudera Manager. I shut them down in the appropriate order and then stopped all services of CM. I then shut down my instances. I then turned my instances back on, but I am getting the same error. 1) I tried hadoop fs -safemode leave and it said

Re: saving rdd to multiple files named by the key

2015-01-26 Thread Aniket Bhatnagar
This might be helpful: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote: Hi, I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. I got them by

Mathematical functions in spark sql

2015-01-26 Thread 1esha
Hello everyone! I try execute select 2/3 and I get 0.. Is there any way to cast double to int or something similar? Also it will be cool to get list of functions supported by spark sql. Thanks! -- View this message in context:

Re: Mathematical functions in spark sql

2015-01-26 Thread Ted Yu
Have you tried floor() or ceil() functions ? According to http://spark.apache.org/sql/, Spark SQL is compatible with Hive SQL. Cheers On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote: Hello everyone! I try execute select 2/3 and I get 0.. Is there any

Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-26 Thread maven
All, I recently try to build Spark-1.2 on my enterprise server (which has Hadoop 2.3 with YARN). Here're the steps I followed for the build: $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package $ export SPARK_HOME=/path/to/spark/folder $ export

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
Your output folder specifies rdd.saveAsTextFile(s3n://nexgen-software/dev/output); So it will try to write to /dev/output which is as expected. If you create the directory /dev/output upfront in your bucket, and try to save it to that (empty) directory, what is the behaviour? On Tue, Jan 27,

saving rdd to multiple files named by the key

2015-01-26 Thread Sharon Rapoport
Hi, I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. I got them by combining many [k,v] by [k]. I could then save to file by partitions, but that still doesn't allow me to choose the name, and leaves me stuck with foo/part-... Any tips? Thanks, Sharon

Re: spark 1.2 ec2 launch script hang

2015-01-26 Thread Pete Zybrick
Try using an absolute path to the pem file On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I used the spark-ec2 script of spark 1.2 to launch a cluster. I have modified the script according to

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-26 Thread Akhil Das
Command would be: hadoop dfsadmin -safemode leave If you are not able to ping your instances, it can be because of you are blocking all the ICMP requests. Im not quiet sure why you are not able to ping google.com from your instances. Make sure the internal IP (ifconfig) is proper in the

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
When spark saves rdd to a text file, the directory must not exist upfront. It will create a directory and write the data to part- under that directory. In my use case, I create a directory dev in the bucket ://nexgen-software/dev . I expect it creates output direct under dev and a part-

spark sqlContext udaf

2015-01-26 Thread sunwei
Hi, any one can show me some examples using UDAF for spark sqlcontext?

Re: Mathematical functions in spark sql

2015-01-26 Thread Alexey Romanchuk
I have tried select ceil(2/3), but got key not found: floor On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote: Have you tried floor() or ceil() functions ? According to http://spark.apache.org/sql/, Spark SQL is compatible with Hive SQL. Cheers On Mon, Jan 26, 2015 at

spark 1.2 ec2 launch script hang

2015-01-26 Thread ey-chih chow
Hi, I used the spark-ec2 script of spark 1.2 to launch a cluster. I have modified the script according to https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70 But the script was still hung at the following message: Waiting for cluster to enter 'ssh-ready'

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the argument for saveAsTextFile. This argument is considered as a folder in the bucket and actual files are created in it with the naming convention part-n, where n is the number of output partition. On Mon, Jan 26, 2015 at

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
Awesome ! That would be great !! On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com wrote: I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-26 Thread Shailesh Birari
Thanks. But after setting spark.shuffle.blockTransferService to nio application fails with Akka Client disassociation. 15/01/27 13:38:11 ERROR TaskSchedulerImpl: Lost executor 3 on wynchcs218.wyn.cnw.co.nz: remote Akka client disassociated 15/01/27 13:38:11 INFO TaskSetManager: Re-queueing tasks

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember where but I have a memory of seeing somewhere that the AMI was only in us-east cheers On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote: Thanks,

Re: SparkSQL tasks spend too much time to finish.

2015-01-26 Thread Yi Tian
Hi, San You need to provide more information to diagnose this problem, like : 1. What kind of SQL did you execute? 2. If there are some |group| operation in this SQL, could you do some statistic about how many unique group keys in this case? On 1/26/15 17:01, luohui20...@sina.com wrote:

Re: Eclipse on spark

2015-01-26 Thread Luke Wilson-Mawer
I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote: How to compile a Spark project in Scala IDE for

Re: [GraphX] Integration with TinkerPop3/Gremlin

2015-01-26 Thread Nicolas Colson
TinkerPop has become an Apache Incubator project and seems to have Spark in mind in their proposal https://wiki.apache.org/incubator/TinkerPopProposal. That's good news! I hope there will be nice collaborations between the communities. On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Håkan Jonsson
Thanks, I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and 1.2 with the same command: ./spark-ec2 -k foo -i bar.pem launch mycluster By default this launches in us-east-1. I tried changing the the region using: -r us-west-1 but that had the same result: Could not resolve

Worker never used by our Spark applications

2015-01-26 Thread Federico Ragona
Hello, we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each of them running one Worker and one of them also running the Master; they are all connected to the same HDFS instance. Until a few days ago, they were all configured with SPARK_WORKER_MEMORY = 18G

Re: RE: Shuffle to HDFS

2015-01-26 Thread bit1...@163.com
I have also thought that Hadoop mapper output result is saved on HDFS, at least if the job only has Mapper but doesn't have Reducer. If there is reducer, then the map output will be saved on local disk? From: Shao, Saisai Date: 2015-01-26 15:23 To: Larry Liu CC:

Re: RE: Shuffle to HDFS

2015-01-26 Thread Sean Owen
If there is no Reducer, there is no shuffle. The Mapper output goes to HDFS, yes. But the question here is about shuffle files, right? Those are written by the Mapper to local disk. Reducers load them from the Mappers over the network then. Shuffle files do not go to HDFS. On Mon, Jan 26, 2015 at

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Charles Feduke
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015

Worker never used by our Spark applications

2015-01-26 Thread Federico Ragona
Hello, we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each of them running one Worker and one of them also running the Master; they are all connected to the same HDFS instance. Until a few days ago, they were all configured with SPARK_WORKER_MEMORY = 18G

Re: Eclipse on spark

2015-01-26 Thread vaquar khan
I am using SBT On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote: I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33

Re: Pairwise Processing of a List

2015-01-26 Thread Sean Owen
AFAIK ordering is not strictly guaranteed unless the RDD is the product of a sort. I think that in practice, you'll never find elements of a file read in some random order, for example (although see the recent issue about partition ordering potentially depending on how the local file system lists

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Jörn Franke
Hi, What do your jobs do? Ideally post source code, but some description would already helpful to support you. Memory leaks can have several reasons - it may not be Spark at all. Thank you. Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Let me clarify, you do not need to have Hive installed, and what I'm suggesting is completely self-contained in Spark SQL. We support

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
Hi Jörn, A memory leak on the job would be contained within the resources reserved for it, wouldn't it? And the job holding resources is not always the same. Sometimes it's one of the Streaming jobs, sometimes it's a heavy batch job that runs every hour. Looks to me that whatever is causing the

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Michael Armbrust
I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com wrote: Those annotations actually don't work because the timestamp is SQL has optional nano-second

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for the info! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark webUI - application details page

2015-01-26 Thread ilaxes
Hi, I do'nt have any history server running. As SK's already pointed in a previous post the history server seems to be required only in mesos or yarn mode, not in standalone mode. https://spark.apache.org/docs/1.1.1/monitoring.html If Spark is run on Mesos or YARN, it is still possible to