How to start two Workers connected to two different masters

2019-02-27 Thread onmstester onmstester
I have two java applications sharing the same spark cluster, the applications 
should be running on different servers. 

Based on my experience, if spark driver (inside java application) connects 
remotely to spark master (which is running on different node), then the 
response time to submit a job would be increased significantly in compare with 
the scenario that app are running on the same node. 



So i need to have two spark master beside my applications.



 On the other hand i don't want to waste resources by dividing the spark 
cluster to two separate groups (two clusters that phisicaly separated ), 
therefore i want to start two slaves on each node of cluster, each (slave) 
connecting to one of masters and each could access all of the cores. The 
problem is that spark won't let me. First it said that you already have a 
worker, "stop it first". I set WORKER_INSTANCES to 2, but it creates two slave 
to one of the masters on first try!





Sent using https://www.zoho.com/mail/

Fwd: Re: spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
Thanks Kathleen, 1. So if i've got 4 df's and i want "dfv1 union dfv2 union 
dfv3 union dfv4", would it first compute "dfv1 union dfv2" and "dfv3 union 
dfv4" independently and simultaneously? then union their results? 2. Its going 
to be hundreds of partitions to union, creating a temp view for each of them 
might be slow? Sent using Zoho Mail  Forwarded message  
From : kathleen li  To :  
Cc :  Date : Wed, 21 Nov 2018 10:16:21 +0330 Subject : 
Re: spark-sql force parallel union  Forwarded message  
you might first write the code to construct query statement with "union all"  
like below: scala> val query="select * from dfv1 union all select * from dfv2 
union all select * from dfv3" query: String = select * from dfv1 union all 
select * from dfv2 union all select * from dfv3 then write loop to register 
each partition to a view like below:  for (i <- 1 to 3){   
df.createOrReplaceTempView("dfv"+i)   } scala> spark.sql(query).explain == 
Physical Plan == Union :- LocalTableScan [_1#0, _2#1, _3#2] :- LocalTableScan 
[_1#0, _2#1, _3#2] +- LocalTableScan [_1#0, _2#1, _3#2] You can use " roll up" 
or "group set" for multiple dimension  to replace "union" or "union all" On 
Tue, Nov 20, 2018 at 8:34 PM onmstester onmstester 
 wrote: I'm using Spark-Sql to query Cassandra 
tables. In Cassandra, i've partitioned my data with time bucket and one id, so 
based on queries i need to union multiple partitions with spark-sql and do the 
aggregations/group-by on union-result, something like this: for(all cassandra 
partitions){
DataSet currentPartition = sqlContext.sql();
unionResult = unionResult.union(currentPartition);
}
 Increasing input (number of loaded partitions), increases response time more 
than linearly because unions would be done sequentialy. Because there is no 
harm in doing unions in parallel, and i dont know how to force spark to do them 
in parallel, Right now i'm using a ThreadPool to Asyncronosly load all 
partitions in my application (which may cause OOM), and somehow do the sort or 
simple group by in java (Which make me think why even i'm using spark at all?) 
The short question is: How to force spark-sql to load cassandra partitions in 
parallel while doing union on them? Also I don't want too many tasks in spark, 
with my Home-Made Async solution, i use coalesece(1) so one task is so fast 
(only wait time on casandra). Sent using Zoho Mail

spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned 
my data with time bucket and one id, so based on queries i need to union 
multiple partitions with spark-sql and do the aggregations/group-by on 
union-result, something like this: for(all cassandra partitions){
DataSet currentPartition = sqlContext.sql();
unionResult = unionResult.union(currentPartition);
}
 Increasing input (number of loaded partitions), increases response time more 
than linearly because unions would be done sequentialy. Because there is no 
harm in doing unions in parallel, and i dont know how to force spark to do them 
in parallel, Right now i'm using a ThreadPool to Asyncronosly load all 
partitions in my application (which may cause OOM), and somehow do the sort or 
simple group by in java (Which make me think why even i'm using spark at all?) 
The short question is: How to force spark-sql to load cassandra partitions in 
parallel while doing union on them? Also I don't want too many tasks in spark, 
with my Home-Made Async solution, i use coalesece(1) so one task is so fast 
(only wait time on casandra). Sent using Zoho Mail

Fwd: How to avoid long-running jobs blocking short-running jobs

2018-11-03 Thread onmstester onmstester
You could have used two separate pools with different weights for ETL and rest 
jobs, when ETL pool weights is about 1 and Rest weight is 1000, anytime a Rest 
Job comes in, it allocate all the resources. Details: 
https://spark.apache.org/docs/latest/job-scheduling.html Sent using Zoho Mail 
 Forwarded message  From : conner 
 To :  Date : Sat, 03 Nov 2018 
12:34:01 +0330 Subject : How to avoid long-running jobs blocking short-running 
jobs  Forwarded message  Hi, I use spark cluster to run 
ETL jobs and analysis computation about the data after elt stage. The elt jobs 
can keep running for several hours, but analysis computation is a short-running 
job which can finish in a few seconds. The dilemma I entrapped is that my 
application runs in a single JVM and can't be a cluster application, so just 
one spark context in my application currently. But when the elt jobs are 
running, the jobs will occupy all resource including worker executors too long 
to block all my analysis computation jobs. My solution is to find a good way to 
divide the spark cluster resource into two. One part for analysis computation 
jobs, another for elt jobs. if the part for elt jobs is free, I can allocate 
analysis computation jobs to it. So I want to find a middleware that can 
support two spark context and it must be embedded in my application. I do some 
research on the third party project spark job server. It can divide spark 
resource by launching another JVM to run spark context with a specific 
resource. these operations are invisible to the upper layer, so it's a good 
solution for me. But this project is running in a single JVM and just support 
REST API, I can't endure the data transfer by TCP again which too slow to me. I 
want to get a result from spark cluster by TCP and give this result to view 
layer to show. Can anybody give me some good suggestion? I shall be so 
grateful. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org

Fwd: use spark cluster in java web service

2018-11-01 Thread onmstester onmstester
Refer: https://spark.apache.org/docs/latest/quick-start.html 1. Create a 
singleton SparkContext at initialization of your cluster, the spark-context or 
spark-sql would be accessible through a static method anywhere in your 
application. I recommend using Fair scheduling on your context, to share 
resources among all input requests SparkSession spark = 
SparkSession.builder().appName("Simple Application").getOrCreate();
 2. From now on, with sc or spark-sql object, something like 
sparkSql.sql("select * from test").collectAsList() would be run as a spark job 
and returns result to your application Sent using Zoho Mail  
Forwarded message  From : 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> To 
: "user" Date : Thu, 01 Nov 2018 10:52:15 +0330 Subject 
: use spark cluster in java web service  Forwarded message 
 Hi, we want to use spark in our java web service , compute data in 
spark cluster according to request,now we have two probles: 1、 how to get 
sparkSession of remote spark cluster (spark on yarn mode) , we want to keep one 
sparkSession to execute all data compution; 2、how to submit to remote spark 
cluster in java code instead of spark-submit , as we want to execute spark code 
in reponse server; Thanks for any replys 0049003208 0049003...@znv.com 签名由 
网易邮箱大师 定制 - 
To unsubscribe e-mail:user-unsubscr...@spark.apache.org

Fwd: Having access to spark results

2018-10-25 Thread onmstester onmstester
What about using cache() or save as a global temp table  for subsequent access? 
Sent using Zoho Mail  Forwarded message  From : Affan 
Syed  To : "spark users" Date : Thu, 25 
Oct 2018 10:58:43 +0330 Subject : Having access to spark results  
Forwarded message  Spark users,  We really would want to get an 
input here about how the results from a Spark Query will be accessible to a 
web-application. Given Spark is a well used in the industry I would have 
thought that this part would have lots of answers/tutorials about it, but I 
didnt find anything. Here are a few options that come to mind 1) Spark results 
are saved in another DB ( perhaps a traditional one) and a request for query 
returns the new table name for access through a paginated query. That seems 
doable, although a bit convoluted as we need to handle the completion of the 
query. 2) Spark results are pumped into a messaging queue from which a socket 
server like connection is made. What confuses me is that other connectors to 
spark, like those for Tableau, using something like JDBC should have all the 
data (not the top 500 that we typically can get via Livy or other REST 
interfaces to Spark). How do those connectors get all the data through a single 
connection? Can someone with expertise help in bringing clarity.  Thank you.  
Affan ᐧ ᐧ

Re: Spark In Memory Shuffle

2018-10-18 Thread onmstester onmstester
create the ramdisk: mount tmpfs /mnt/spark -t tmpfs -o size=2G then point 
spark.local.dir to the ramdisk, which depends on your deployment strategy, for 
me it was through SparkConf object before passing it to SparkContext: 
conf.set("spark.local.dir","/mnt/spark") To validate that spark is actually 
using your ramdisk (by default it uses /tmp), ls the ramdisk after running some 
jobs and you should see spark directories (with date on directory name) on your 
ramdisk Sent using Zoho Mail  On Wed, 17 Oct 2018 18:57:14 +0330 ☼ R Nair 
 wrote  What are the steps to configure this? 
Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester 
 wrote: Hi, I failed to config spark for in-memory 
shuffle so currently just using linux memory mapped directory (tmpfs) as 
working directory of spark, so everything is fast Sent using Zoho Mail

Re: Spark In Memory Shuffle

2018-10-17 Thread onmstester onmstester
Hi, I failed to config spark for in-memory shuffle so currently just using 
linux memory mapped directory (tmpfs) as working directory of spark, so 
everything is fast Sent using Zoho Mail  On Wed, 17 Oct 2018 16:41:32 +0330 
thomas lavocat  wrote  Hi everyone, 
The possibility to have in memory shuffling is discussed in this issue 
https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper 
"Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I 
would like to know : What is the current state of in memory shuffling ? Is it 
implemented in production ? Does the current shuffle still use disks to work ? 
Is it possible to somehow do it in RAM only ? Regards, Thomas 
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org

createorreplacetempview cause memory leak

2018-06-21 Thread onmstester onmstester
I'm loading some json files in a loop, deserialize them in a list of objects 
and create a temp table from the list, run a select on table (repeat this for 
every file): for(jsonFile : allJsonFiles){ sqlcontext.sql("select * from 
mainTable").filter(").createOrReplaceTempView("table1"); 
sqlcontext.createDataFram(serializedObjectList, 
MyObject.class).createOrReplaceTempView("table2"); sqlqcontext.sql("select * 
from table1 join table2 ").collectAsList();  } after processing 30 json 
files my application crashes with OOM, if i add this two line at the end of 
for-loop: sql.dropTempTable("table1"); sql.dropTempTable("table2"); then My app 
continue to process all 500 json files with no problem, I've used 
createOrReplaceTempView many times in my problem,  should i drop temp tables 
everywhere to free memory? Sent using Zoho Mail

How to set spark.driver.memory?

2018-06-19 Thread onmstester onmstester
I have a spark cluster containing 3 nodes and my application is a jar file 
running by java -jar . 

How can i set driver.memory for my application?

spark-defaults.conf only would be read by ./spark-summit

"java --driver-memory -jar " fails with exception.


Sent using Zoho Mail







enable jmx in standalone mode

2018-06-19 Thread onmstester onmstester
How to enable jmx for spark worker/executor/driver in standalone mode?

i have add these:

spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ 
-Dcom.sun.management.jmxremote.port=9178 \ 
-Dcom.sun.management.jmxremote.authenticate=false \ 
-Dcom.sun.management.jmxremote.ssl=false 
spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote \ 
-Dcom.sun.management.jmxremote.port=0\ 
-Dcom.sun.management.jmxremote.authenticate=false \ 
-Dcom.sun.management.jmxremote.ssl=false
to spark/conf/spark-defaults.conf

run stop-slave.sh

and then start-slave.sh

using netstat -anop|grep executor-pid there is no port other than spark 
api port associated to process
Sent using Zoho Mail







spark optimized pagination

2018-06-09 Thread onmstester onmstester
Hi,

I'm using spark on top of cassandra as backend CRUD of a Restfull Application.

Most of Rest API's retrieve huge amount of data from cassandra and doing a lot 
of aggregation on them  in spark which take some seconds.



Problem: sometimes the output result would be a big list which make client 
browser throw stop script, so we should paginate the result at the server-side,

but it would be so annoying for user to wait some seconds on each page to 
cassandra-spark processings,



Current Dummy Solution: For now i was thinking about assigning a UUID to each 
request which would be sent back and forth between server-side and client-side,

the first time a rest API invoked, the result would be saved in a temptable  
and in subsequent similar requests (request for next pages) the result would be 
fetch from

temptable (instead of common flow of retrieve from cassandra + aggregation in 
spark which would take some time). On memory limit, the old results would be 
deleted.



Is there any built-in clean caching strategy in spark to handle such scenarios?



Sent using Zoho Mail







spark sql in-clause problem

2018-05-22 Thread onmstester onmstester
I'm reading from this table in cassandra:

Table mytable (
Integer Key,

Integer X,

Interger Y



Using:

sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) in 
((1,2),(3,4)))



Encountered error: 



StructType(StructField((X,IntegerType,true),StructField((Y,IntegerType,true)) 
!= 
StructType(StructField((X,IntegerType,false),StructField((Y,IntegerType,false))





Sent using Zoho Mail







Scala's Seq:* equivalent in java

2018-05-15 Thread onmstester onmstester
I could not find how to pass a list to isin() filter in java, something like 
this could be done with scala:

val ids = Array(1,2) df.filter(df("id").isin(ids:_*)).show
But in java everything that converts java list to scala Seq fails with 
unsupported literal type exception:
JavaConversions.asScalaBuffer(list).toSeq()
JavaConverters.asScalaIteratorConverter(list.iterator()).asScala().toSeq().seq()


Sent using Zoho Mail







spark sql StackOverflow

2018-05-15 Thread onmstester onmstester
Hi, 



I need to run some queries on huge amount input records. Input rate for records 
are 100K/seconds.

A record is like (key1,key2,value) and the application should report occurances 
of kye1 = something  key2 == somethingElse.

The problem is there are too many filters in my query: more than 3 thousands 
pair of key1 and key2 should be filtered.

I was simply puting 1 millions of records in a temptable each time and running 
a query sql using spark-sql on temp table:

select * from mytemptable where (kye1 = something  key2 == 
somethingElse) or (kye1 = someOtherthing  key2 == someAnotherThing) 
or ...(3thousands or!!!)

And i encounter StackOverFlow at ATNConfigSet.java line 178.



So i have two options IMHO:

1. Either put all key1 and key2 filter pairs in another temp table and do a 
join between  two temp table

2. Or use spark-stream that i'm not familiar with and i don't know if it could 
handle 3K of filters.



Which way do you suggest? what is the best solution for my problem 
'performance-wise'?



Thanks in advance