How to start two Workers connected to two different masters

2019-02-27 Thread onmstester onmstester
I have two java applications sharing the same spark cluster, the applications should be running on different servers. Based on my experience, if spark driver (inside java application) connects remotely to spark master (which is running on different node), then the response time to submit a

Fwd: Re: spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
_1#0, _2#1, _3#2] You can use " roll up" or "group set" for multiple dimension  to replace "union" or "union all" On Tue, Nov 20, 2018 at 8:34 PM onmstester onmstester wrote: I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned

spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this: for(all cassandra partitions){ DataSet

Fwd: How to avoid long-running jobs blocking short-running jobs

2018-11-03 Thread onmstester onmstester
You could have used two separate pools with different weights for ETL and rest jobs, when ETL pool weights is about 1 and Rest weight is 1000, anytime a Rest Job comes in, it allocate all the resources. Details: https://spark.apache.org/docs/latest/job-scheduling.html Sent using Zoho Mail

Fwd: use spark cluster in java web service

2018-11-01 Thread onmstester onmstester
Refer: https://spark.apache.org/docs/latest/quick-start.html 1. Create a singleton SparkContext at initialization of your cluster, the spark-context or spark-sql would be accessible through a static method anywhere in your application. I recommend using Fair scheduling on your context, to share

Fwd: Having access to spark results

2018-10-25 Thread onmstester onmstester
What about using cache() or save as a global temp table  for subsequent access? Sent using Zoho Mail Forwarded message From : Affan Syed To : "spark users" Date : Thu, 25 Oct 2018 10:58:43 +0330 Subject : Having access to spark results Forwarded message

Re: Spark In Memory Shuffle

2018-10-18 Thread onmstester onmstester
e steps to configure this? Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester wrote: Hi, I failed to config spark for in-memory shuffle so currently just using linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail

Re: Spark In Memory Shuffle

2018-10-17 Thread onmstester onmstester
Hi, I failed to config spark for in-memory shuffle so currently just using  linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail On Wed, 17 Oct 2018 16:41:32 +0330  thomas lavocat wrote Hi everyone, The possibility to have in

createorreplacetempview cause memory leak

2018-06-21 Thread onmstester onmstester
I'm loading some json files in a loop, deserialize them in a list of objects and create a temp table from the list, run a select on table (repeat this for every file): for(jsonFile : allJsonFiles){ sqlcontext.sql("select * from mainTable").filter(").createOrReplaceTempView("table1");

How to set spark.driver.memory?

2018-06-19 Thread onmstester onmstester
I have a spark cluster containing 3 nodes and my application is a jar file running by java -jar . How can i set driver.memory for my application? spark-defaults.conf only would be read by ./spark-summit "java --driver-memory -jar " fails with exception. Sent using Zoho Mail

enable jmx in standalone mode

2018-06-19 Thread onmstester onmstester
How to enable jmx for spark worker/executor/driver in standalone mode? i have add these: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false

spark optimized pagination

2018-06-09 Thread onmstester onmstester
Hi, I'm using spark on top of cassandra as backend CRUD of a Restfull Application. Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them in spark which take some seconds. Problem: sometimes the output result would be a big list which make

spark sql in-clause problem

2018-05-22 Thread onmstester onmstester
I'm reading from this table in cassandra: Table mytable ( Integer Key, Integer X, Interger Y Using: sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) in ((1,2),(3,4))) Encountered error: StructType(StructField((X,IntegerType,true),StructField((Y,IntegerType,true))

Scala's Seq:* equivalent in java

2018-05-15 Thread onmstester onmstester
I could not find how to pass a list to isin() filter in java, something like this could be done with scala: val ids = Array(1,2) df.filter(df("id").isin(ids:_*)).show But in java everything that converts java list to scala Seq fails with unsupported literal type exception:

spark sql StackOverflow

2018-05-15 Thread onmstester onmstester
Hi, I need to run some queries on huge amount input records. Input rate for records are 100K/seconds. A record is like (key1,key2,value) and the application should report occurances of kye1 = something key2 == somethingElse. The problem is there are too many filters in my query: more than