Re: Spark ANSI SQL Support

2017-01-17 Thread Andrew Ash
Rishabh, Have you come across any ANSI SQL queries that Spark SQL didn't support? I'd be interested to hear if you have. Andrew On Tue, Jan 17, 2017 at 8:14 PM, Deepak Sharma wrote: > From spark documentation page: > Spark SQL can now run all 99 TPC-DS queries. > > On

Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
Hi List, I've been following several projects with quite some interest over the past few years, and I've continued to wonder, why they're not moving towards a degree of being supported by mainstream Spark-distributions, and more frequently mentioned when it comes to enterprise adoption of Spark.

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Michael Segel
Hi, Lexicographically speaking, Min/Max should work because String(s) support a comparator operator. So anything which supports an equality test (<,>, <= , >= , == …) can also support min and max functions as well. I guess the question is if Spark does support this, and if not, why? Yes,

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates

spark main thread quit, but the Jvm of driver don't crash

2017-01-17 Thread John Fang
My spark main thread create some daemon thread. Then the spark application throw some exceptions, and the main thread will quit. But the jvm of driver don't crash, so How can i do? for example: val sparkConf = new SparkConf().setAppName("NetworkWordCount")

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Jörn Franke
Hallo, I am not sure what you mean by min/max for strings. I do not know if this makes sense. What the ORC format has is bloom filters for strings etc. - are you referring to this? In order to apply min/max filters Spark needs to read the meta data of the file. If the filter is applied or

Struct as parameter

2017-01-17 Thread Tomas Carini
Hi all. I have the following schema that I need to filter with spark sql |-- msg_id: string (nullable = true) |-- country: string (nullable = true) |-- sent_ts: timestamp (nullable = true) |-- marks: array (nullable = true) ||-- element: struct (containsNull = true) |||-- namespace:

spark main thread quit, but the driver don't crash at standalone cluster

2017-01-17 Thread John Fang
My spark main thread create some daemon threads which maybe timer thread. Then the spark application throw some exceptions, and the main thread will quit. But the jvm of driver don't crash for standalone cluster. Of course the question don't happen at yarn cluster. Because the application

Re: Quick but probably silly question...

2017-01-17 Thread Jörn Franke
You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and then you compare the original and delta file and create a new version. When querying before compaction then you need to check in original and delta file. I don to think orc need tez for it , but it

Re: How State of mapWithState is distributed on nodes

2017-01-17 Thread manasdebashiskar
Does anyone have any answer. How does the state distribution happen among multiple nodes. I have seen that in "mapwithState" based workflow the streaming job simply hangs when the node containing all states dies because of OOM. ..Manas -- View this message in context:

Re: Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
To give a bit of a background to my post: I'm currently looking into evaluating whether it could be possible to deploy a Spark-powered DataWareHouse-like structure. In particular the client is interested in evaluating Spark's in-memory caches as "transient persistent layer". Although in theory

Re: Middleware-wrappers for Spark

2017-01-17 Thread Sean Owen
On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz wrote: > * Oryx2 - This was more focused on a particular issue, and looked to be a > very nice framework for deploying real-time analytics --- but again, no > real traction. In fact, I've heard of PoCs being done by/for Cloudera, to >

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Dong Jiang
Hi, Thanks for the response. I am referring to a presentation by Ryan Blue. http://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide Specifically, on page 27, clearly, you can generate min/max for BINARY. I don’t know, for sure, if this is generated by Spark or Pig. In

Re: TDD in Spark

2017-01-17 Thread Lars Albertsson
My advice, short version: * Start by testing one job per test. * Use Scalatest or a standard framework. * Generate input datasets with Spark routines, write to local file. * Run job with local master. * Read output with Spark routines, validate only the fields you care about for the test case at

Re: filter rows by all columns

2017-01-17 Thread Xiaomeng Wan
Thank you Hyukjin, It works. This is what I end up doing df.filter(_.toSeq.zipWithIndex.forall(v => v._1.toString().toDouble - means(v._2) <= 3*staddevs(v._2))).show() Regards, Shawn On 16 January 2017 at 18:30, Hyukjin Kwon wrote: > Hi Shawn, > > Could we do this as

Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-17 Thread Russell Jurney
How can I throttle Spark as it writes to Elasticsearch? I have already repartitioned down to one partition in an effort to slow the writes. ES indicates it is being overloaded, and I don't know how to slow things down. This is all on one r4.xlarge EC2 node that runs Spark with 25GB of RAM and ES

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
On the sqlcontext or hivesqlcontext , you can register the function as udf below: *hiveSqlContext.udf.register("func_name",func(_:String))* Thanks Deepak On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu wrote: > Hey > > Can yu send me the source code of hive java udf which

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Sirisha Cheruvu
This error org.apache.spark.sql.AnalysisException: No handler for Hive udf class com.nexr.platform.hive.udf.GenericUDFNVL2 because: com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26 at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$ lookupFunction$2.apply(hiveUDFs.scala:105) at

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
Did you tried this with spark-shell? Please try this. $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar On the spark shell: val hc = new org.apache.spark.sql.hive.HiveContext(sc) ; hc.sql("create temporary function nexr_nvl2 as ' com.nexr.platform.hive.udf.GenericUDFNVL2'");

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Sirisha Cheruvu
Brilliant... It Worked .. i should give --jars instead of add jar statement On Jan 18, 2017 9:02 AM, "Deepak Sharma" wrote: > Did you tried this with spark-shell? > Please try this. > $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar > > On the spark shell: >

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Sirisha Cheruvu
Thank You Deepak On Jan 18, 2017 9:02 AM, "Deepak Sharma" wrote: > Did you tried this with spark-shell? > Please try this. > $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar > > On the spark shell: > val hc = new org.apache.spark.sql.hive.HiveContext(sc) ; >

How to access metrics for Structured Streaming 2.1

2017-01-17 Thread Heji Kim
Hello. We are trying to migrate and performance test the kafka sink for structured streaming in 2.1. Obviously we miss the beautiful Streaming Statistics ui tab and we are trying to figure out the most reasonable way to monitor event processing rates and lag time. 1. Are the SourceStatus and

Re: Debugging a PythonException with no details

2017-01-17 Thread Nicholas Chammas
Hey Marco, I stopped seeing this error once I started round-tripping intermediate DataFrames to disk. You can read more about what I saw here: https://github.com/graphframes/graphframes/issues/159 Nick On Sat, Jan 14, 2017 at 4:02 PM Marco Mistroni wrote: > It seems it

Re: How to access metrics for Structured Streaming 2.1

2017-01-17 Thread Shixiong(Ryan) Zhu
You can use the monitoring APIs of Structured Streaming to get metrics. See http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries On Tue, Jan 17, 2017 at 5:01 PM, Heji Kim wrote: > Hello. We are trying to

how to dynamic partition dataframe

2017-01-17 Thread lk_spark
hi,all: I want partition data by reading a config file who tells me how to partition current input data. DataFrameWriter have a method named with : partitionBy(colNames: String*): DataFrameWriter[T] why I can't pass parameter format with Seq[String] or Array[String]? 2017-01-18

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Raju Bairishetti
Thanks Yong for the response. Adding my responses inline On Tue, Jan 17, 2017 at 10:27 PM, Yong Zhang wrote: > What DB you are using for your Hive meta store, and what types are your > partition columns? > I am using MySql for Hive metastore. Partition columns are

Please help me out !!!!Getting error while trying to hive java generic udf in spark

2017-01-17 Thread Sirisha Cheruvu
Hi Everyone.. getting below error while running hive java udf from sql context.. org.apache.spark.sql.AnalysisException: No handler for Hive udf class com.nexr.platform.hive.udf.GenericUDFNVL2 because: com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26 at

anyone from bangalore wants to work on spark projects along with me

2017-01-17 Thread Sirisha Cheruvu
Hi , Just thought of keeping my intention of working together with spark developers who are also from bangalore so that we can brainstorm togetherand work out solutions on our projects? what say? expecting a reply

Re: ScalaReflectionException (class not found) error for user class in spark 2.1.0

2017-01-17 Thread Koert Kuipers
and to be clear, this is not in the REPL or with Hive (both well known situations in which these errors arise) On Mon, Jan 16, 2017 at 11:51 PM, Koert Kuipers wrote: > i am experiencing a ScalaReflectionException exception when doing an > aggregation on a spark-sql DataFrame.

need a hive generic udf which also works on spark sql

2017-01-17 Thread Sirisha Cheruvu
Hi Anybody has a test and tried generic udf with object inspector implementaion which sucessfully ran on both hive and spark-sql please share me the git hub link or source code file Thanks in advance Sirisha

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Raju Bairishetti
Had a high level look into the code. Seems getHiveQlPartitions method from HiveMetastoreCatalog is getting called irrespective of metastorePartitionPruning conf value. It should not fetch all partitions if we set metastorePartitionPruning to true (Default value for this is false) def

Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Takeshi Yamamuro
Hi, AFAIK, you could use Hive GenericUDF stuffs in spark without much effort. If you'd like to check test suites about that, you'd better to visit HiveUDFSuite. https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala I also have

Spark/Parquet/Statistics question

2017-01-17 Thread djiang
Hi, I have been looking into how Spark stores statistics (min/max) in Parquet as well as how it uses the info for query optimization. I have got a few questions. First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 rows, with a long type and a string type column. They are sorted

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Yong Zhang
What DB you are using for your Hive meta store, and what types are your partition columns? You maybe want to read the discussion in SPARK-6910, and especially the comments in PR. There are some limitation about partition pruning in Hive/Spark, maybe yours is one of them. Yong

Spark ANSI SQL Support

2017-01-17 Thread Rishabh Bhardwaj
Hi All, Does Spark 2.0 Sql support full ANSI SQL query standards? Thanks, Rishabh.

Re: Spark ANSI SQL Support

2017-01-17 Thread Deepak Sharma
>From spark documentation page: Spark SQL can now run all 99 TPC-DS queries. On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj" wrote: > Hi All, > > Does Spark 2.0 Sql support full ANSI SQL query standards? > > Thanks, > Rishabh. >

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-17 Thread Koert Kuipers
in our experience you can't really. there are some settings to make spark wait longer before retrying when es is overloaded, but i have never found them too much use. check out these settings, maybe they are of some help: es.batch.size.bytes es.batch.size.entries es.http.timeout