IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id1000, 1000, ad_user_id) as user_id* FROM

Re: Linkage error - duplicate class definition

2015-01-20 Thread Hafiz Mujadid
Have you solved this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Linkage-error-duplicate-class-definition-tp9482p21260.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Sean Owen
You can persist the RDD in (2) right after it is created. It will not cause it to be persisted immediately, but rather the first time it is materialized. If you persist after (3) is calculated, then it will be re-calculated (and persisted) after (4) is calculated. On Tue, Jan 20, 2015 at 3:38 AM,

spark streaming kinesis issue

2015-01-20 Thread Hafiz Mujadid
Hi experts! I am using spark streaming with kinesis and getting this exception while running program java.lang.LinkageError: loader (instance of org/apache/spark/executor/ChildExecutorURLClassLoader$userClassLoader$): attempted duplicate class definition for name:

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, Yes, this is what I'm doing. I'm using hiveContext.hql() to run my query. But, the problem still happens. On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S. msdeva...@gmail.com wrote: Add one more library libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0 val

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Can you share your code ? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Yes, this is what I'm doing. I'm using hiveContext.hql() to run

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] *Sent:* Tuesday, January 20, 2015 5:22 PM *To:* User

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Ashish
Thanks Sean ! On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen so...@cloudera.com wrote: You can persist the RDD in (2) right after it is created. It will not cause it to be persisted immediately, but rather the first time it is materialized. If you persist after (3) is calculated, then it will be

Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Night Wolf
In Spark SQL we have Row objects which contain a list of fields that make up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2). Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing. Say for example I have the

RE: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Bob Tiernay
I found the following to be a good discussion of the same topic: http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html From: so...@cloudera.com Date: Tue, 20 Jan 2015 10:02:20 + Subject: Re: Does Spark automatically run different

Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-20 Thread thanhtien522
Hi, I try to run Spark on YARN cluster by set master as yarn-client on java code. It works fine with count task by not working with other command. It threw ClassCastException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field

Can multiple streaming apps use the same checkpoint directory?

2015-01-20 Thread Ashic Mahtab
Hi, For client mode spark submits of applications, we can do the following: def createStreamingContext() = { ... val sc = new SparkContext(conf) // Create a StreamingContext with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) } ... val ssc =

RE: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Wang, Daoyuan
Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan From: Xuelin Cao [mailto:xuelincao2...@gmail.com] Sent: Tuesday, January 20, 2015 5:22 PM To: User Subject: IF statement doesn't work in Spark-SQL? Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found

spark streaming with checkpoint

2015-01-20 Thread balu.naren
I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'm using

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Add one more library libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Research Associate | Cyber Security | AMRITA

Re: Issues with constants in Spark HiveQL queries

2015-01-20 Thread yana
I run Spark 1.2 and do not have this issue. I dont believe the Hive version would matter(I run spark1.2 with Hive12 profile) but that would be a good test. The last version I tried for you was a cdh4.2 spark1.2 prebuilt without pointing to an external hive install(in fact I tried it on a

RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don't think checkpoint in Spark

Re: Error for first run from iPython Notebook

2015-01-20 Thread Dave
Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks! On Mon Jan 19 2015 at 1:50:43 PM Dave dla...@gmail.com wrote: Hi, I've setup my first spark cluster (1 master, 2 workers) and an iPython notebook server that I'm trying to setup to access the

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
The below is not exactly a solution to your question but this is what we are doing. For the first time we do end up doing row.getstring() and we immediately parse it through a map function which aligns it to either a case class or a structType. Then we register it as a table and use just column

Re: Spark Streaming checkpoint recovery causes IO re-execution

2015-01-20 Thread RodrigoB
Hi Hannes, Good to know I'm not alone on the boat. Sorry about not posting back, I haven't gone in a while onto the user list. It's on my agenda to get over this issue. Will be very important for our recovery implementation. I have done an internal proof of concept but without any conclusions

Saving a mllib model in Spark SQL

2015-01-20 Thread Divyansh Jain
Hey people, I have run into some issues regarding saving the k-means mllib model in Spark SQL by converting to a schema RDD. This is what I am doing: case class Model(id: String, model: org.apache.spark.mllib.clustering.KMeansModel) import sqlContext.createSchemaRDD val rowRdd =

spark-submit --py-files remote: Only local additional python files are supported

2015-01-20 Thread Vladimir Grigor
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding

getting started writing unit tests for my app

2015-01-20 Thread Matthew Cornell
Hi Folks, I'm writing a GraphX app and I need to do some test-driven development. I've got Spark running on our little cluster and have built and run some hello world apps, so that's all good. I've looked through the test source and found lots of helpful examples that use SharedSparkContext, and

Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
Hi, I am trying to aggregate a key based on some timestamp, and I believe that spilling to disk is changing the order of the data fed into the combiner. I have some timeseries data that is of the form: (key, date, other data) Partition 1 (A, 2, ...) (B, 4, ...) (A, 1, ...)

Re: NPE in Parquet

2015-01-20 Thread Iulian Dragoș
It’s an array.length, where the array is null. Looking through the code, it looks like the type converter assumes that FileSystem.globStatus never returns null, but that is not the case. Digging through the Hadoop codebase, inside Globber.glob, here’s what I found: /* * When the input

NPE in Parquet

2015-01-20 Thread Alessandro Baretta
All, I strongly suspect this might be caused by a glitch in the communication with Google Cloud Storage where my job is writing to, as this NPE exception shows up fairly randomly. Any ideas? Exception in thread Thread-126 java.lang.NullPointerException at

PySpark Client

2015-01-20 Thread Chris Beavers
Hey all, Is there any notion of a lightweight python client for submitting jobs to a Spark cluster remotely? If I essentially install Spark on the client machine, and that machine has the same OS, same version of Python, etc., then I'm able to communicate with the cluster just fine. But if Python

spark-submit --py-files remote: Only local additional python files are supported

2015-01-20 Thread Vladimir Grigor
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Wang, Ningjun (LNG-NPV)
Can anybody answer this? Do I have to have hdfs to achieve this? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Friday, January 16, 2015 1:15 PM To: Imran

Apply function to all elements along each key

2015-01-20 Thread Luis Guerra
Hi all, I would like to apply a function over all elements for each key (assuming key-value RDD). For instance, imagine I have: import numpy as np a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello', 'goodbye']]) a = sc.parallelize(a) Then I want to create a key-value RDD, using the

Re: PySpark Client

2015-01-20 Thread Andrew Or
Hi Chris, Short answer is no, not yet. Longer answer is that PySpark only supports client mode, which means your driver runs on the same machine as your submission client. By corollary this means your submission client must currently depend on all of Spark and its dependencies. There is a patch

Re: spark-submit --py-files remote: Only local additional python files are supported

2015-01-20 Thread Andrew Or
Hi Vladimir, Yes, as the error messages suggests, PySpark currently only supports local files. This does not mean it only runs in local mode, however; you can still run PySpark on any cluster manager (though only in client mode). All this means is that your python files must be on your local file

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Kane Kim
Related question - is execution of different stages optimized? I.e. map followed by a filter will require 2 loops or they will be combined into single one? On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay btier...@hotmail.com wrote: I found the following to be a good discussion of the same topic:

Re: Aggregate order semantics when spilling

2015-01-20 Thread Andrew Or
Hi Justin, I believe the intended semantics of groupByKey or cogroup is that the ordering *within a key *is not preserved if you spill. In fact, the test cases for the ExternalAppendOnlyMap only assert that the Set representation of the results is as expected (see this line

Re: How to output to S3 and keep the order

2015-01-20 Thread Anny Chen
Thanks Aniket! It is working now. Anny On Mon, Jan 19, 2015 at 5:56 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: When you repartiton, ordering can get lost. You would need to sort after repartitioning. Aniket On Tue, Jan 20, 2015, 7:08 AM anny9699 anny9...@gmail.com wrote:

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Cheng Lian
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems that this function can’t be properly resolved. Could you provide a minimum code snippet that reproduces this issue? Cheng On 1/20/15 1:22 AM, Xuelin Cao wrote: Hi, I'm trying to migrate some hive scripts to

Re: Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-20 Thread Cheng Lian
Guess this can be helpful: http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases On 1/19/15 8:26 AM, mucks17 wrote: Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran

Re: Support for SQL on unions of tables (merge tables?)

2015-01-20 Thread Cheng Lian
I think you can resort to a Hive table partitioned by date https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables On 1/11/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g.

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Davies Liu
If the dataset is not huge (in a few GB), you can setup NFS instead of HDFS (which is much harder to setup): 1. export a directory in master (or anyone in the cluster) 2. mount it in the same position across all slaves 3. read/write from it by file:///path/to/monitpoint On Tue, Jan 20, 2015 at

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Mohammed Guller
I don’t think it will work without HDFS. Mohammed From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Tuesday, January 20, 2015 7:55 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: Can I save RDD to local file system and then read it back on spark

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Ted Yu
Please also see this thread: http://search-hadoop.com/m/JW1q5De7pU1 On Tue, Jan 20, 2015 at 3:58 PM, Sean Owen so...@cloudera.com wrote: Guava is shaded in Spark 1.2+. It looks like you are mixing versions of Spark then, with some that still refer to unshaded Guava. Make sure you are not

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
Yeah, as Michael said, I forgot that UDT is not a public API. Xiangrui's suggestion makes more sense. Cheng On 1/20/15 12:49 PM, Xiangrui Meng wrote: You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the

Re: Spark Sql reading whole table from cache instead of required coulmns

2015-01-20 Thread Cheng Lian
Hey Surbhit, In this case, the web UI stats is not accurate. Please refer to this thread for an explanation: https://www.mail-archive.com/user@spark.apache.org/msg18919.html Cheng On 1/13/15 1:46 AM, Surbhit wrote: Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Sean Owen
Guava is shaded in Spark 1.2+. It looks like you are mixing versions of Spark then, with some that still refer to unshaded Guava. Make sure you are not packaging Spark with your app and that you don't have other versions lying around. On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari

Re: Error for first run from iPython Notebook

2015-01-20 Thread Felix C
+1. I can confirm this. It says collect fails in Py4J --- Original Message --- From: Dave dla...@gmail.com Sent: January 20, 2015 6:49 AM To: user@spark.apache.org Subject: Re: Error for first run from iPython Notebook Not sure if anyone who can help has seen this. Any suggestions would be

Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hello, I recently upgraded my setup from Spark 1.1 to Spark 1.2. My existing applications are working fine on ubuntu cluster. But, when I try to execute Spark MLlib application from Eclipse (Windows node) it gives java.lang.NoClassDefFoundError: com/google/common/base/Preconditions exception.

Spark 1.1.0 - spark-submit failed

2015-01-20 Thread ey-chih chow
Hi, I issued the following command in a ec2 cluster launched using spark-ec2: ~/spark/bin/spark-submit --class com.crowdstar.cluster.etl.ParseAndClean --master spark://ec2-54-185-107-113.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --total-executor-cores 4

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Mark Hamstra
A map followed by a filter will not be two stages, but rather one stage that pipelines the map and filter. On Jan 20, 2015, at 10:26 AM, Kane Kim kane.ist...@gmail.com wrote: Related question - is execution of different stages optimized? I.e. map followed by a filter will require 2 loops

Re: S3 Bucket Access

2015-01-20 Thread bbailey
Hi sranga, Were you ever able to get authentication working with the temporary IAM credentials (id, secret, token)? I am in the same situation and it would be great if we could document a solution so others can benefit from this Thanks! sranga wrote Thanks Rishi. That is exactly what I am

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Aaron Davidson
Spark's network-common package depends on guava as a provided dependency in order to avoid conflicting with other libraries (e.g., Hadoop) that depend on specific versions. com/google/common/base/Preconditions has been present in Guava since version 2, so this is likely a dependency not found

Task result deserialization error (1.1.0)

2015-01-20 Thread Dmitriy Lyubimov
Hi, I am getting task result deserialization error (kryo is enabled). Is it some sort of `chill` registration issue at front end? This is application that lists spark as maven dependency (so it gets correct hadoop and chill dependencies in classpath, i checked). Thanks in advance. 15/01/20

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Thanks Aaron. Adding Guava jar resolves the issue. Shailesh On Wed, Jan 21, 2015 at 3:26 PM, Aaron Davidson ilike...@gmail.com wrote: Spark's network-common package depends on guava as a provided dependency in order to avoid conflicting with other libraries (e.g., Hadoop) that depend on

RangePartitioner

2015-01-20 Thread Rishi Yadav
I am joining two tables as below, the program stalls at below log line and never proceeds. What might be the issue and possible solution? INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79 Table 1 has  450 columns Table2 has  100 columns Both tables have few million

Python connector for spark-cassandra

2015-01-20 Thread Nishant Sinha
Hello everyone, Is there a python connector for Spark and Cassandra as there is one for Java. I found a Java connector by DataStax on github: https://github.com/datastax/spark-cassandra-connector I am looking for something similar in Java. Thanks

Re: Spark Streaming with Kafka

2015-01-20 Thread firemonk9
Hi, I am having similar issues. Have you found any resolution ? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p21276.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hi Frank, Its a normal eclipse project where I added Scala and Spark libraries as user libraries. Though, I am not attaching any hadoop libraries, in my application code I have following line. System.setProperty(hadoop.home.dir, C:\\SB\\HadoopWin) This Hadoop home dir contains winutils.exe

What will happen if the Driver exits abnormally?

2015-01-20 Thread personal_email0
As title. Is there some mechanism to recover to make the job can be completed? Any comments will be very appreciated. Best Regards, Anzhsoft - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

MapType in spark-sql

2015-01-20 Thread Kevin Jung
Hi all How can I add MapType and ArrayType to schema when I create StructType programmatically? val schema = StructType( schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) above code from spark document works fine but if I change StringType to MapType or

Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread TJ Klein
Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1. However, the initial joy faded quickly when I noticed that all my stuff didn't successfully terminate operations anymore. Using

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hello, I double checked the libraries. I am linking only with Spark 1.2. Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked and nothing else. Thanks, Shailesh On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen so...@cloudera.com wrote: Guava is shaded in Spark 1.2+. It looks

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-20 Thread Cheng Lian
Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side metadata code path is more performant. And I

RE: dynamically change receiver for a spark stream

2015-01-20 Thread Shao, Saisai
Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-20 Thread Nicholas Chammas
Are the gz files roughly equal in size? Do you know that your partitions are roughly balanced? Perhaps some cores get assigned tasks that end very quickly, while others get most of the work. On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil gautham.a...@gmail.com wrote: Hi, Thanks for getting

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Frank Austin Nothaft
Shailesh, To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. Not sure if you are using Maven (or what) to build, but if you can pull up your builds dependency tree, you will likely find com.google.guava being brought in by one of your dependencies. Regards, Frank Austin

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread jagaximo
Kevin (Sangwoo) Kim wrote If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == A).flatMap(_._2).distinct.count rdd.filter(_._1 ==

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread Kevin (Sangwoo) Kim
Great to hear you got solution!! Cheers! Kevin On Wed Jan 21 2015 at 11:13:44 AM jagaximo takuya_seg...@dwango.co.jp wrote: Kevin (Sangwoo) Kim wrote If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val

Re: MapType in spark-sql

2015-01-20 Thread Cheng Lian
You need to provide key type, value type for map type, element type for array type, and whether they contain null: |StructType(Array( StructField(map_field,MapType(keyType =IntegerType, valueType =StringType, containsNull =true), nullable =true),

RE: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Bob Tiernay
If using Maven, one simply use whatever version they prefer and at build time and the artifact using something like: buildplugins plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-shade-plugin/artifactIdexecutions execution

Fwd: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-20 Thread Terry Hole
Hi, I am trying to move from 1.1 to 1.2 and found that the newFilesOnly=false (Intend to include old files) does not work anymore. It works great in 1.1, this should be introduced by the last change of this class. Does this flag behavior change or is it a regression? Issue should be caused by

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-20 Thread Cheng Lian
On 1/15/15 11:26 PM, Nathan McCarthy wrote: Thanks Cheng! Is there any API I can get access too (e.g. ParquetTableScan) which would allow me to load up the low level/baseRDD of just RDD[Row] so I could avoid the defensive copy (maybe lose our on columnar storage etc.). We have parts of our

Re: Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
Hi Andrew, Thanks for your response! For our use case, we aren't actually grouping, but rather updating running aggregates. I just picked grouping because it made the example easier to write out. However, when we merge combiners, the combiners have to have data that are adjacent to each other in

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1.

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Fengyun RAO
the LogParser instance is not serializable, and thus cannot be a broadcast, what’s worse, it contains an LRU cache, which is essential to the performance, and we would like to share among all the tasks on the same node. If it is the case, what’s the recommended way to share a variable among all

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Tassilo Klein
Hi, It's a bit of a longer script that runs some deep learning training. Therefore it is a bit hard to wrap up easily. Essentially I am having a loop, in which a gradient is computed on each node and collected (this is where it freezes at some point). grads =

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you try to disable the new feature of reused worker by: spark.python.worker.reuse = false On Tue, Jan 20, 2015 at 11:12 PM, Tassilo Klein tjkl...@bwh.harvard.edu wrote: Hi, It's a bit of a longer script that runs some deep learning training. Therefore it is a bit hard to wrap up easily.

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Sean Owen
I don't know of any reason to think the singleton pattern doesn't work or works differently. I wonder if, for example, task scheduling is different in 1.2 and you have more partitions across more workers and so are loading more copies more slowly into your singletons. On Jan 21, 2015 7:13 AM,

KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.

spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Fengyun RAO
Currently we are migrating from spark 1.1 to spark 1.2, but found the program 3x slower, with nothing else changed. note: our program in spark 1.1 has successfully processed a whole year data, quite stable. the main script is as below sc.textFile(inputPath) .flatMap(line =

Re: Spark 1.1.0 - spark-submit failed

2015-01-20 Thread Ted Yu
Please check which netty jar(s) are on the classpath. NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty 3.5.4 Cheers On Tue, Jan 20, 2015 at 4:15 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I issued the following command in a ec2 cluster launched using spark-ec2:

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Davies Liu
Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO raofeng...@gmail.com wrote: Currently we are migrating from spark 1.1 to spark

Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, I am developing a Spark Streaming application where I want every item in my stream to be assigned a unique, strictly increasing Long. My input data already has RDD-local integers (from 0 to N-1) assigned, so I am doing the following: var totalNumberOfItems = 0L // update the keys of the

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Akhil Das
How about using accumulators http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators? Thanks Best Regards On Wed, Jan 21, 2015 at 12:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I am developing a Spark Streaming application where I want every item in my stream to be

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-20 Thread Fengyun RAO
currently we migrate from 1.1 to 1.2, and found our program 3x slower, maybe due to the singleton hack? could you explain in detail why or how The singleton hack works very different in spark 1.2.0 thanks! 2015-01-18 20:56 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch: The singleton

Re: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-20 Thread Sean Owen
See also SPARK-3276 and SPARK-3553. Can you say more about the problem? what are the file timestamps, what happens when you run, what log messages if any are relevant. I do not expect there was any intended behavior change. On Wed, Jan 21, 2015 at 5:17 AM, Terry Hole hujie.ea...@gmail.com wrote:

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about using accumulators http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators? As far as I understand, they solve the part of the problem that I am not worried about, namely increasing the

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that

Re: How to create distributed matrixes from hive tables.

2015-01-20 Thread Xiangrui Meng
You can get a SchemaRDD from the Hive table, map it into a RDD of Vectors, and then construct a RowMatrix. The transformations are lazy, so there is no external storage requirement for intermediate data. -Xiangrui On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, We

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-20 Thread Cheng Lian
First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO. You can prepare your data like this (say |CustomerCode| is a string field with

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Xiangrui Meng
You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote: This is because KMeanModel is

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Cheng Lian
I had once worked on a named row feature but haven’t got time to finish it. It looks like this: |sql(...).named.map { row:NamedRow = row[Int]('key) - row[String]('value) } | Basically the |named| method generates a field name to ordinal map for each RDD partition. This map is then shared

dynamically change receiver for a spark stream

2015-01-20 Thread jamborta
Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed.

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for now. May I ask why do you prefer |HiveTableScan| rather than |ParquetTableScan|? Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set

Re: dynamically change receiver for a spark stream

2015-01-20 Thread Akhil Das
Can you not do it with RDDs? Thanks Best Regards On Wed, Jan 21, 2015 at 12:38 AM, jamborta jambo...@gmail.com wrote: Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
This is because |KMeanModel| is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of |KMeansModel| in this case. You may refer to |o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an example. Cheng On 1/20/15

Confused about shuffle read and shuffle write

2015-01-20 Thread Darin McBeath
I have the following code in a Spark Job. // Get the baseline input file(s) JavaPairRDDText,Text hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDDString, String hsfBaselinePairRDD =