Rishabh,
Have you come across any ANSI SQL queries that Spark SQL didn't support?
I'd be interested to hear if you have.
Andrew
On Tue, Jan 17, 2017 at 8:14 PM, Deepak Sharma
wrote:
> From spark documentation page:
> Spark SQL can now run all 99 TPC-DS queries.
>
> On
Hi List,
I've been following several projects with quite some interest over the past
few years, and I've continued to wonder, why they're not moving towards a
degree of being supported by mainstream Spark-distributions, and more
frequently mentioned when it comes to enterprise adoption of Spark.
Hi,
Lexicographically speaking, Min/Max should work because String(s) support a
comparator operator. So anything which supports an equality test (<,>, <= , >=
, == …) can also support min and max functions as well.
I guess the question is if Spark does support this, and if not, why?
Yes,
Hi,
While the parquet file is immutable and the data sets are immutable, how does
sparkSQL handle updates or deletes?
I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row,
and then persist it, I now have two files. If I reread the table back in … will
I see duplicates
My spark main thread create some daemon thread. Then the spark application
throw some exceptions, and the main thread will quit. But the jvm of driver
don't crash, so How can i do?
for example:
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
Hallo,
I am not sure what you mean by min/max for strings. I do not know if this makes
sense. What the ORC format has is bloom filters for strings etc. - are you
referring to this?
In order to apply min/max filters Spark needs to read the meta data of the
file. If the filter is applied or
Hi all.
I have the following schema that I need to filter with spark sql
|-- msg_id: string (nullable = true)
|-- country: string (nullable = true)
|-- sent_ts: timestamp (nullable = true)
|-- marks: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- namespace:
My spark main thread create some daemon threads which maybe timer thread. Then
the spark application throw some exceptions, and the main thread will quit. But
the jvm of driver don't crash for standalone cluster. Of course the question
don't happen at yarn cluster. Because the application
You run compaction, i.e. save the modified/deleted records in a dedicated file.
Every now and then you compare the original and delta file and create a new
version. When querying before compaction then you need to check in original and
delta file. I don to think orc need tez for it , but it
Does anyone have any answer.
How does the state distribution happen among multiple nodes.
I have seen that in "mapwithState" based workflow the streaming job simply
hangs when the node containing all states dies because of OOM.
..Manas
--
View this message in context:
To give a bit of a background to my post: I'm currently looking into
evaluating whether it could be possible to deploy a Spark-powered
DataWareHouse-like structure.
In particular the client is interested in evaluating Spark's in-memory
caches as "transient persistent layer". Although in theory
On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz wrote:
> * Oryx2 - This was more focused on a particular issue, and looked to be a
> very nice framework for deploying real-time analytics --- but again, no
> real traction. In fact, I've heard of PoCs being done by/for Cloudera, to
>
Hi,
Thanks for the response.
I am referring to a presentation by Ryan Blue.
http://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide
Specifically, on page 27, clearly, you can generate min/max for BINARY. I don’t
know, for sure, if this is generated by Spark or Pig.
In
My advice, short version:
* Start by testing one job per test.
* Use Scalatest or a standard framework.
* Generate input datasets with Spark routines, write to local file.
* Run job with local master.
* Read output with Spark routines, validate only the fields you care
about for the test case at
Thank you Hyukjin,
It works. This is what I end up doing
df.filter(_.toSeq.zipWithIndex.forall(v => v._1.toString().toDouble -
means(v._2) <= 3*staddevs(v._2))).show()
Regards,
Shawn
On 16 January 2017 at 18:30, Hyukjin Kwon wrote:
> Hi Shawn,
>
> Could we do this as
How can I throttle Spark as it writes to Elasticsearch? I have already
repartitioned down to one partition in an effort to slow the writes. ES
indicates it is being overloaded, and I don't know how to slow things down.
This is all on one r4.xlarge EC2 node that runs Spark with 25GB of RAM and
ES
On the sqlcontext or hivesqlcontext , you can register the function as udf
below:
*hiveSqlContext.udf.register("func_name",func(_:String))*
Thanks
Deepak
On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu wrote:
> Hey
>
> Can yu send me the source code of hive java udf which
This error
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
com.nexr.platform.hive.udf.GenericUDFNVL2 because:
com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$
lookupFunction$2.apply(hiveUDFs.scala:105)
at
Did you tried this with spark-shell?
Please try this.
$spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar
On the spark shell:
val hc = new org.apache.spark.sql.hive.HiveContext(sc) ;
hc.sql("create temporary function nexr_nvl2 as '
com.nexr.platform.hive.udf.GenericUDFNVL2'");
Brilliant... It Worked ..
i should give --jars
instead of add jar statement
On Jan 18, 2017 9:02 AM, "Deepak Sharma" wrote:
> Did you tried this with spark-shell?
> Please try this.
> $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar
>
> On the spark shell:
>
Thank You Deepak
On Jan 18, 2017 9:02 AM, "Deepak Sharma" wrote:
> Did you tried this with spark-shell?
> Please try this.
> $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar
>
> On the spark shell:
> val hc = new org.apache.spark.sql.hive.HiveContext(sc) ;
>
Hello. We are trying to migrate and performance test the kafka sink for
structured streaming in 2.1. Obviously we miss the beautiful Streaming
Statistics ui tab and we are trying to figure out the most reasonable way
to monitor event processing rates and lag time.
1. Are the SourceStatus and
Hey Marco,
I stopped seeing this error once I started round-tripping intermediate
DataFrames to disk.
You can read more about what I saw here:
https://github.com/graphframes/graphframes/issues/159
Nick
On Sat, Jan 14, 2017 at 4:02 PM Marco Mistroni wrote:
> It seems it
You can use the monitoring APIs of Structured Streaming to get metrics. See
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
On Tue, Jan 17, 2017 at 5:01 PM, Heji Kim
wrote:
> Hello. We are trying to
hi,all:
I want partition data by reading a config file who tells me how to
partition current input data.
DataFrameWriter have a method named with : partitionBy(colNames: String*):
DataFrameWriter[T]
why I can't pass parameter format with Seq[String] or Array[String]?
2017-01-18
Thanks Yong for the response. Adding my responses inline
On Tue, Jan 17, 2017 at 10:27 PM, Yong Zhang wrote:
> What DB you are using for your Hive meta store, and what types are your
> partition columns?
>
I am using MySql for Hive metastore. Partition columns are
Hi Everyone..
getting below error while running hive java udf from sql context..
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
com.nexr.platform.hive.udf.GenericUDFNVL2 because:
com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26
at
Hi ,
Just thought of keeping my intention of working together with spark
developers who are also from bangalore so that we can brainstorm
togetherand work out solutions on our projects?
what say?
expecting a reply
and to be clear, this is not in the REPL or with Hive (both well known
situations in which these errors arise)
On Mon, Jan 16, 2017 at 11:51 PM, Koert Kuipers wrote:
> i am experiencing a ScalaReflectionException exception when doing an
> aggregation on a spark-sql DataFrame.
Hi
Anybody has a test and tried generic udf with object inspector
implementaion which sucessfully ran on both hive and spark-sql
please share me the git hub link or source code file
Thanks in advance
Sirisha
Had a high level look into the code. Seems getHiveQlPartitions method from
HiveMetastoreCatalog is getting called irrespective of
metastorePartitionPruning
conf value.
It should not fetch all partitions if we set metastorePartitionPruning to
true (Default value for this is false)
def
Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.
You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this
Hi,
AFAIK, you could use Hive GenericUDF stuffs in spark without much effort.
If you'd like to check test suites about that, you'd better to visit
HiveUDFSuite.
https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
I also have
Hi,
I have been looking into how Spark stores statistics (min/max) in Parquet as
well as how it uses the info for query optimization.
I have got a few questions.
First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 rows,
with a long type and a string type column.
They are sorted
What DB you are using for your Hive meta store, and what types are your
partition columns?
You maybe want to read the discussion in SPARK-6910, and especially the
comments in PR. There are some limitation about partition pruning in
Hive/Spark, maybe yours is one of them.
Yong
Hi All,
Does Spark 2.0 Sql support full ANSI SQL query standards?
Thanks,
Rishabh.
>From spark documentation page:
Spark SQL can now run all 99 TPC-DS queries.
On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj" wrote:
> Hi All,
>
> Does Spark 2.0 Sql support full ANSI SQL query standards?
>
> Thanks,
> Rishabh.
>
in our experience you can't really.
there are some settings to make spark wait longer before retrying when es
is overloaded, but i have never found them too much use.
check out these settings, maybe they are of some help:
es.batch.size.bytes
es.batch.size.entries
es.http.timeout
38 matches
Mail list logo